VIVO data ingest: The basics

Coming from a background of knowing very little to nothing about the semantic web, RDF and SPARQL, I have perhaps had a steeper learning curve than most. To help people who would be starting at a very basic level too, I thought I would go over this.

Firstly for your first data entry, I suggest using the example that is given in he ingest guide and doing everything exactly as they say:
http://www.vivoweb.org/data-ingest-guide

Following on and in conjunction with this, I would like to highlight some points that were useful to me when learning:
Firstly the ontology. This is the structure of classes and properties that facilitate the headings and what data can be included under these headings. More than that, it describes how each object relates to the other objects to create a network connecting all the information.

Instances

An instance is something that has certain properties of its own rather than just being a text string. These can often be searched for under classes - for example a persons name can be searched under "People>Person". Instances don't have to be only limited to people either - "Location>Address" will also show a range of instances. Some are also not directly available from the menu, but exist to link things together, for example URL's are instances that have the properties: Link anchor text & link URI. To know what values are just strings and what are instances, you must have a look the classes and properties that show up under the admin menu.

Data Structure

For all classes, they will only show up if they have a least one instance stored within them.
For all properties, for the general user, they will only show up if they have a value, but admin will see them all, even if they are blank.
It is important to note that admin will see the pages differently to others.

Class Group: These group classes together eg "People" or "Location". They are the intial options that show up when you click "home".
Class:
These contain instances of some type eg "Person" or "Address". They are the subcategory option available when the Class Group us selected and shows a bar graph ajacent with the relative numbers in each class.
Namespace Pointer: They are the values seen when you search within a class and they point to the location of an instance.  They appear as a list of links that appear within a class eg. "James Bond" or "Peter Parker" within the class "People>Person"
Property Group
:
These group the properties into related fields. They provide a range of major headings within an instance and provide structure the the information contained within. eg "Contact" or "Publications".
Data Property:
These are the names for the properties themselves and contain the data - they may contain strings or pointers to other instances. eg Phone: "021 111 111" or Address: <namespace of the address>. When a namespace is used, the name of the namespace eg "Bobs Address" will appear in its place and provide a link.

Basics of RDF:

You can either SELECT something to show it and test things out, then you change it to CONSTRUCT when you actually want to create the RDF that will represent the results you find from a SELECT query. All the PREFIXs at the top allow you to write in shorthand for exmaple:
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
This allows you to write "rdf:" instead of the full link. The link itself describes how the "subject" or "instance" is related to the "object" which may be a property or another instance. The name for this relational link is a "predicate". Thus everything can be described by Subject-Predicate-Object. That is "something"-"is related to by"-"something else". The SPARQL language uses variables with a ? before them and attempts to match all values which these variables can take that would results in a unique and valid database result. eg data selection query:

#Describe what variables would you like to display eg ?s, ?p ?o. Note * can be used to display all variables used throughout the query.
SELECT ?s ?p ?o
WHERE{

#Find all values of ?s that have the predicate "rdf:type" and the object "vivo:Address".
?s rdf:type vivo:Address .

#Note now ?s has been assigned a value, you can use it in later lines to search
?s ?p ?o # For each value of ?s find the predicate and object and store in ?p and ?o
}

VIVO Ingest Tools

To enter data, you must first have a comma separate file (.csv) and convert it into rdf triple format stored in a Jena model. This is the intial and temporary rdf that will be used to create the final rdf with a syntax that matches VIVO. The final rdf can then be entered directly into the VIVO database and when the search index is rebuilt, the data will display on the site.

Found at "Site Admin > Ingest tools"
The main tools you will need to use are below:
Ingest Menu

1. Manage Jena Models:
Jena models are what store the inital rdf created from a .csv file. This is really easy to do - simply click "Create Model" and enter in a name (any name is fine). To view its content, click on the "output model" link to download and view it. To delete it, click "remove" and to clear its contents without deleting it, click "clear statements". When entering new data, you will need 1 model to load your initial .csv data into and then a second one to convert this inital rdf into the final rdf that is valid for entry into VIVO. It is also possible to store the initial rdf in several models if desired and then combine them into a single one when producing the final rdf. Below I have created 3 Models: One for people, one for website and one for the final contruct.

2. Convert CSV to RDF
Select a .csv file from your computer (or URL). Bwlow this there will then be a box asking for the namspace where these properties should be created. This is the location where all the instances you create will be stored. In my case this is "http://localhost/vivo/" but put wherever your site is hosted then "/vivo/" as a nice default place to put everything. It will then prompt you for the local name of this class. This can just be a small name that provides a simple discription of what your are entering eg "ppl". All the headings in your csv file will be appended to this: eg.

<http://localhost/vivo/ppl_address> & <http://localhost/vivo/ppl_phone to create the predicate part of the triple (the middle one). The value you use is relatively unimportant, as it will not persist when you complete the second rdf conversion, but I have found the shorter the better so you dont have to type so much when creating the queries. Finally select one of the Jena models you created and click Next.

On the next page it will ask you to select your URI prefix. This is what will be put at the start of the name for all URI's stored in this group, and should be appropriately named for ease of identification later eg "people". It will then have a random integer appended to them to make them unique. All will be stored in the namespate you provided earlier eg.
RDF output: <people539801>
Location: <http://localhost/vivo/people539801>
Then click "Convert CSV". If this is successful, you should be directed back to the Ingest Tools. If it is unsuccessful for some reason, then it may refresh the previous page. In this case, return to Ingest tools and try again after clearing the Jena model (see Manage Jena Models above).

3. Viewing initial RDF
Now the csv file should have been converted to rdf and stored in the Jena model you created. To view it, from the ingest menu, select "Manage Jena Models" and then click on the "output model" link below the heading of the model that the data was stored. you can then open this as a text file if you like (I recommend notepad++) which can be downloaded Here but be careful of using standard notepad as it doesn't recognise the linebreaks properly and will be extremely difficult to read. Here is a snippet of what you might expect:

At the top there will be alot of prefixs like: @prefix vivo:    <http://vivoweb.org/ontology/core#> .
This is found in every script and is just to allow shorthand for the rdf to conserve space and make it more human readable.
Below these is the actual data in rdf format. First the Subject is listed, and then all the predicate and object pairs that relate to that subject below it:

<people2822807> #suject
      a       <http://localhost/vivo/ppl> ;
      <http://localhost/vivo/ppl_company> #predicate
              "Trademe" ; #object
      <http://localhost/vivo/ppl_fname> #predicate
              "James" ; #object
      <http://localhost/vivo/ppl_lname> #predicate
              "Bond" ; #object

Then another subject will be listed and it will continue through all the data you have entered. But what about the line immediately after the subject:
a       <http://localhost/vivo/ppl> ;
The "a" is a shorthand way of writing "rdf:type" which is a type of predicate. The subject in this case is <http://localhost/vivo/ppl> rather than an of the data you have entered. This is simply saying the instance <people2822807> is a type of "ppl". Although this doesnt make alot of sense now, it will when it is further processed into the final RDF. If all the data is there, then its time to move on. Otherwise go back and clear the jena model and repeat the steps for csv conversion.

4. Constructing the final RDF

To produce the final RDF data, this intial data must be processed so that the predicates match those within the site, so that the site knows when and where to display the data. Also the data itself must be stored on the site, and each instance given its own space. To do this conversion, from the Ingest menu click "execute SPARQL CONSTRUCT". This will allow you to complete a query that uses the RDF data in the model you created to changed it into a suitable syntax for VIVO and then output in another model of your choosing. In the box there will be a whole lot of PREFIX's similar to those in the RDF.

These simply allow you to write a shortened version instead of typing out the full thing. eg:
PREFIX vivo: <http://vivoweb.org/ontology/core#>
This allows you to write "vivo:XXX" rather than "<http://vivoweb.org/ontology/core#XXX>". You can add more of these if you think they will be useful. I will enter one for this batch:
PREFIX local: <http://localhost/vivo/>
This is the namespace I chose for the data in the intial RDF, and will make it easier to type everything out. When you do this, remember to include the final "/" or final "#" if you want this included in the final output.
Next we need to describe how we want to construct the final RDF and where to get the data from:

CONSTRUCT{
} WHERE {
}

Despite the construct being written first, I find it easier to start on the where, which we can work out from the intial RDF. As discussed earlier under basics of RDF, a ? indicates the the following is a variable. SPARQL will try to ensure that all the values stored within any variable meet the criteria of every line that includes it. for exmaple say we have some data stored in intial RDF. Lets use a real example:
Data:

<people2822807>
      a       <http://localhost/vivo/ppl> ;
      <http://localhost/vivo/ppl_fname>
              "John" ;
      <http://localhost/vivo/ppl_lname>
              "Doe" ;

<people8672>
      a       <http://localhost/vivo/ppl> ;
       <http://localhost/vivo/ppl_fname>
              "John" ;
      <http://localhost/vivo/ppl_lname>
              "Key" ;

<people4901946>
      a       <http://localhost/vivo/ppl> ;
       <http://localhost/vivo/ppl_fname>
              "Jane" ;
      <http://localhost/vivo/ppl_lname>
              "Doe" ;

<people4901946>
      a       <http://localhost/vivo/ppl> ;
       <http://localhost/vivo/ppl_fname>
              "John" ;
      <http://localhost/vivo/ppl_lname>

The script:
PREFIX local: <http://localhost/vivo/>
?a_subject ppl_fname "John" .
?a_subject ppl_lname "Doe" .

*Note the last entry has no last name entered.
In this case ?a_subject will on take on the values of subjects which have both a first name "John" and a lastname "Doe".
ie ?a_subject = <people2822807>
It is possible to use several variables in a single line. This is useful for storing information in these variables, or it could be an unused variable that is there simply because you are unsure what the value is. eg.

The script:
PREFIX local: <http://localhost/vivo/>
?a_subject ppl_fname "John" .
?a_subject ppl_lname ?a_object .

This will search for all values of ?a_subject which has the firstname "John" and also have a lastName, altohugh its value can be anything, it must exist. These values will be stored in ?a_object. Thus:
?a_subject = <people2822807> & <people8672>
?a_object = "Doe" & "Key"

Finally if if you don't want to omit empty values, like the last name of the data above, then you can use the OPTIONAL{ } function.

The script:
PREFIX local: <http://localhost/vivo/>
?a_subject ppl_fname "John" .
OPTIONAL{?a_subject ppl_lname ?a_object}

This will search for all values of ?a_subject which have the firstname "John", and it doesn't matter if they have a last name, they will still be included. But for ones that do happen to have a last name, the values will be stored in ?a_object. Thus:
?a_subject = <people2822807> & <people8672> & <people4901946>
?a_object = "Doe" & "Key"

So we want to get all the values from the intial RDF for the Subjects and Objects and store them in variables and then print them back out in the construct part with predicates that match the VIVO syntax. The easiest way I have found to find out which properties are in each class without searching through the property hierachy in the Admin panel is the following:
1. Click "Site Admin"

2. Under the heading "Data Input" there is a dropdown menu. Select the type of instance you would like to create, such as "Peoppl>Person". Fill in the required fields and submit.

3. When the individual is displayed infront of you, at the top there is a link titled "Edit this individual" click this and then you'll be taken to the control panel. Here click "edit this inidivual" again. This will show a list of all the properties you can add to this instance.
   #Note: To add new properties you must go the Admin and click Data Properties then make a new addition. I
   suggest if you are oging to do this, you use other data properties already present as models on how to fill
   them out.

Add a value to all the fields you might want to enter and click "Submit Changes". This will take you back to the Individual Control panel.

4. Click "Raw Statements with This Resources as Subject". What this is doing is finding all the predicates and objects that have this instances subject and displaying them for you.

You can identify what the field is from "obj" which contains the values you just entered. On the left hand side is "pred" which contains the matching VIVO predicates for the fields under object. I suggest you copy these down, perhaps into a spreadsheet and put what the field is next to it. This will make them easier to find in future. It will likely also contain some objects you have not personally entered, but that were automatically created when you made the instance, such as "Agent", "Thing", "Person". You don't have to worry about these, but it may be useful to understand which instances fit under which "superclass(s)".

Once you have all the fields you want and all the predicates for these, you can begin.

Example Script:
#All the automatically generates PREFIX's....displayed are only the relevent ones:
PREFIX rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:  <http://www.w3.org/2000/01/rdf-schema#>
PREFIX vivo: <http://vivoweb.org/ontology/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

#personally added Prefix:
PREFIX local: <http://localhost/vivo/>

CONSTRUCT{
?subject a vivo:Person .
?subject rdfs:label ?name .
?subject foaf:firstName ?firstname .
?subject foaf:lastName ?lastname .
?subject vivo:phoneNumber ?wphone .
?subject vivo:preferredTitle ?title .
?subject vivo:email ?wemail .
} WHERE {
?subject a local:ppl .
?subject local:ppl_name ?name .
?subject local:ppl_fname ?firstname .
?subject local:ppl_lname ?lastname .

OPTIONAL{?subject local:ppl_workphone ?wphone}
OPTIONAL{?subject local:ppl_title ?title}

OPTIONAL{?subject local:ppl_email ?wemail}
}
So the PREFIX's in red are some of the standard prefixes that appear at the top of the page automatically. You can see that alot of the properties come from different locations, and so require different prefix's, however all the ones in construct are from these list of VIVO predicates. Under the WHERE clause at the bottom, all the predicates (except the first one "a" which was also in the intial RDF script) are the same "local:". This is because all the values in the intial RDF used the namespace <http://localhost/vivo/????> and "local:" has been defined as this in the green PREFIX. The script in yellow is the requirements of ?suject, all of which must be met to be included. This means that ?subject must have the predicate "a" or "rdf:type" and the subject "local:ppl" or "<http://localhost/vivo/ppl>". Likewise it must also have a name, firstname and lastname. If any of these are absent, then their subject will will not be included within the variable ?subject. If present, the workphone and fax are also stored into their respective variables ?wphone and ?fax, but they are not a requirement. When ?subject is a certain value eg <ppl3463263> then there is only one value for the object which has the predicate "local:ppl_name". Likewise for a ?subject of <ppl2938467> there is only one value for "local:ppl_name". So although there may be alot of values stored in the variable ?name, they are all bound to their respective ?subject and are so are paired. Thus when constructing, for each value of ?subject with each type of predicate defined within the CONSTRUCT clause (name, firstname, lastname fax, phone) there is a single object (stored in one of the variables from the WHERE clause) that matches the ?subject. These are all written out to produce the final RDF.

Now that you have a script to enter, recall that you should be in "Site Admin>Ingest Menu > Execute SPARQL CONSTRUCT Query". Type in your script below the PREFIX's. Below this it will ask you to select source models. These refer to the models you stored the intial RDF data in. If you want to combine them you can tick multiple boxes and use the UNION option within the script, but for a start working with one is enough I would imagine. Select the output or "Destination Model" this should be the second spare model that you created earlier to store the final RDF.

Click Execute CONSTRUCT. This may take a while depending on the size and complxity of your script, but unless its very large or complex, its unlikely to take more than a few minutes. If this occurs you may have some sort of loop and should stop and review. If all goes well then it should say at the top of the screen how many lines were entered.

From here you should go back to the Ingest menu and click "manage Jena Models". Find the model you loaded the final RDF into and click its "output model" link. It should look quite similar to the intial, except that it now uses the VIVO syntax for the predicates instead of the arbitrary one that was created from the .csv file. Now that this is complete, you are almost done. Save that file.

There should be one of the "people" highlighted in yellow for each person that was entered. Note how not all the fields have to be present for each individual - eg Peter Parker has no phone number. This is possible because it was an OPTIONAL property.

Finally go to "Site Admin" from the main menu. Under Advanced Data Tools, select "Add/Remove RDF data". Select the RDF file and ensure "add instance data" is selected (default). Select N3 from the dropdown menu #Note this is not the default value. Select Submit. This should show Data entered Successfully.

Now that the data has been entered, it will take a short period of time for this data to appear in the database. What is highly recommended despite this automatic display, is to rebuild the index, to prevent errors and mismatch between display and the database. This can be completed by going back to the "Site Admin" on the main menu and  under Refresh Content click "Rebuild search index" Then "rebuild" (see the bottom of data removal section) for small amounts of data this should only take a couple of minutes or less, but may take longer for larger data sets. When this is compelte, the data should be readily available to be found under the nromal serach tabs at the top of the page.

 

Data Removal:

When removing data, the data you want to remove must first be turned into RDF format. This is then submitted via the same place as the data is entered "Add/REmove RDF Data" under the "Site Admin" tab. This time instead of clicking the "add instance data" you select "remove mixed RDF". Firstly though, you must have the RDF data to remove. You could use the RDF that was used to enter the data, however this would not be complete, because some additional implied RDF data was created when you submitted it. Things such as when you create a "person: you also create a "thing" because a person is a subset of thing. To ensure you remove all the data that relates to the what was entered, you need to construct a query that will produce all the current RDF for those entered.

Earlier we used the "Ingest Tools > Execute SPARQL CONSTRUCT" to create a construct from our initial RDF data, but this time we want to create RDF from the data that is within the database. To do this, within the "Site Admin" page, select "SPARQL Query" from the heading "advanced Data Tools". Delete the example they have given you (everything except the PREFIX's).

Now we want to write a query to find all the data points. I suggest that you first use the SELECT query, until you have selected exactly what you want, and then can turn it into a CONSTRUCT query. Here is an example

SELECT ?s ?p ?o
WHERE{
?s a vivo:Person .
?s ?p ?o .
}

This will find all the subjects which have a "rdf:type" of "vivo:Person" and store them in ?s. It then searches those subjects within ?s to find all predicates and objects which have that as a subject. In other words, its finds all the data that relates to these subjects and displays them. When you have the query displaying the data you want change "SELECT" to "CONSTRUCT" and add 3 addition symbols highlighted in yellow to correct the syntax.

CONSTRUCT{ ?s ?p ?o .
}WHERE{
?s a vivo:Person .
?s ?p ?o .
}

Also make sure you click the option for the "Format for CONSTRUCT and DESCRIBE query results" to N3 rather than the default RDF/XML-ABBREV. Then click "Run Query".

When this is complete, the file should prompt to be downloaded. Open this RDF data and make sure it matches what you would expect. Ie it should be very similar to the RDF that you previosuly uploaded, but with a few extra lines that relate to assumed data (highlighted). Also note that the location of the instance is now on the sever rather than localhost.

Once you have this, go to the "Add/Remove RDF data" as described at the start of this section and select the RDF to upload, select "remove mixed RDF" and select the type as N3 instead of RDF/XML. Finally submit to remove the entries. When this is complete it will give you a successful message.

Remember to then rebuild the search index before begining to search through the remaining data.

Then...

Then...

After this is complete, any data added should be displayed and any data removed should have disappeared. If records are viewed before this final step is complete, some links will not work and some instances will show up as "<undefined>".

 

I hope that this has been useful to you. For toying around and trying different things out, the virtualbox version of VIVO already pre-installed is quite useful. This can be accessed here:
https://sourceforge.net/projects/vivo/files/VIVO%20Virtual%20Appliances/

Submitted by Brendan Tonson-Older on

Comments

Excellent post

Brendan - excellent explanation. This would be perfect for the VIVO Wiki. Would you be willing to

  • permit us to link to your post? (good)
  • allow us to port your post to the Wiki? (better)
  • start contributing to the Wiki yourself? (best)

The new VIVO Wiki is located at https://wiki.duraspace.org/display/VIVO. Write to me at jeb228 at cornell_edu, or just go to the Wiki and sign up.

Jim