Planet Adtech

Related

  • IBM SLRP
  • LSID
  • Jastor

We are…

a group of IBM Semantic Web-focused software engineers in Cambridge, Massachusetts. This site aggregates our collective efforts, thoughts, and musings.

February 07, 2007

Updates to sparql.js

by lee on TechnicaLee Speaking

I'm not sure if anyone is using Elias and my sparql.js JavaScript library for issuing SPARQL queries. (Probably not, given its Firefox-and-friends-only orientation and the standard cross-site XMLHttpRequest security restrictions.) Since I first blogged about the library last year, we've made a few changes to the library, Most notably, we've removed the dependency on the Yahoo! connection manager (or on any other third-party libraries, for that matter). Additionally, we've added a setRequestHeader method which passes the given headers and values along to the underlying HTTP request object. We use this functionality, for example, to provide user credentials (via HTTP Basic Auth) when SPARQLing against a Boca server.

The update should be transparent to any current uses of the library. Please let me know if you try it out and experience any problems.

February 06, 2007

Text indexing and query in Boca

by wingerz on ~wingerz

slrplucene.png

Just about every computer user is very familiar with and competent in text search. While end users may not be writing custom search queries, they appreciate UIs that allow them to search with more accuracy and precision. Occasionally users want to find something very specific by searching across people’s names, or book titles, or paper abstracts instead of all of the indexed text in a system. Clever keyword searching and luck can only get you part of the way there.

Sleuth, Boca’s text indexing component, addresses this problem (in the Boca world). We’ve been using it for quite a while. Similar to LARQ, Boca uses Lucene to index string literals when the feature is enabled. We’ve designated a magic predicate for querying the text index with SPARQL and hooked it into Glitter, our wonderfully-named SPARQL engine. So now we can do SPARQL queries with integrated text queries, like “find me people (not airplane components or animal appendages) where the name matches ‘Wing’”:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX boca: <http://boca.adtech.internet.ibm.com/predicates/>
SELECT ?person ?name
WHERE {
	?person foaf:name ?name .
	?name boca:textmatch "Wing" .
}

This powerful feature allows SPARQL-aware developers to roll their own APIs. It’s easy to whip up a search across the all literals for traditional text search behavior. With a little more work, you can craft more sophisticated searches, like one for authors of a paper that mentions a specific search term in the abstract (say, “march madness”).

For more details on how to set this up, please see our documentation on Boca text indexing.

January 30, 2007

ODO 0.20 Perl libraries Released

by sean on IBM Semantic Layered Research Platform

Late last week I released ODO 0.20 which has some important updates:

  • Jena Database compatibility layer. It is now possible to connect to and read/write databases created with HP Lab’s Jena. The code is still experimental but provides a good starting point for feedback. There are test cases in this release so regression testing and bug demonstration should be easy.
  • RDFS code generator. Ontologies and more specifically the RDFS code generator have been updated and bug tested. The version in the first release was a port of older code that didn’t translate perfectly in to the new ODO framework. I am happy to report that the code is now used to host the ODO -Jena compatibility layer and there are now test cases for it as well.
  • N3 RDF parser.This release includes an initial version of an N3 RDF parser. It isn’t complete because I don’t have a test suite to make sure it accepts valid N3. Hopefully with each release this component will mature.
  • OWL-Lite code generator is still in development. A casualty of the debugging / testing of the RDFS code generator this system still needs to be updated and tested with the new Ontology “layer.”
  • Many more bug fixes and test cases.

More info about ODO can be found here and of course POD style documentation. Please send all feedback to the list.

Stephen Evanchick

January 29, 2007

Released ODO 0.20

by Stephen Evanchik on Stephen Evanchik - Semantic Web

Late last week I released ODO 0.20 which has some important updates:

  • Jena Database compatibility layer
    It is now possible to connect to and read/write databases created with HP Lab's Jena . The code is still experimental but provides a good starting point for feedback. There are test cases in this release so regression testing and bug demonstration should be easy.
  • RDFS code generator.
    Ontologies and more specifically the RDFS code generator have been updated and bug tested. The version in the first release was a port of older code that didn't translate perfectly in to the new ODO framework. I am happy to report that the code is now used to host the ODO-Jena compatibility layer and there are now test cases for it as well.

read more

January 19, 2007

Announcing: Boca 1.8 - new database support

by lee on TechnicaLee Speaking

While I've been writing dense treatises on Semantic Web development, Matt's been hard at work on the latest release of Boca. Matt's announcement of Boca 1.8 carries all the details as well as a look at what Boca 2.0 will bring. Amidst the usual slew of bug fixes, usability improvements, and performance fixes, the major addition to Boca is support for three new databases beyond DB2. Boca now also runs on MySQL, PostgreSQL, and HSQLDB. Cool stuff.

In other Semantic Layered Research Platform news, we're working towards pushing out stable releases(with documentation and installation packaging) of two more of our components: Queso (Atom-driven Web interface to Boca) and DDR (binary data repository with metadata-extractor infrastructure to store metadata within Boca). We're hoping to get these out by the middle of February, so stay tuned.

Boca version 1.8

by mroy on IBM Semantic Layered Research Platform

Today we released Boca version 1.8. I’m happy to learn that users are starting to use Boca and encourage people to let me know any feedback they have not only about the current version, but also about the future goals I’ve described here.

Changes for 1.8:

Added preliminary support for more databases. These include MySql(Version 5), Postgres(Version 8.2), and HSQL(Version 1.8). While these database run and pass all junit tests, no performance or tuning has been done as has been done on DB2, so your milage may vary with those databases. We provide example configuration files for these database, but you will need to download the appropriate java libraries from the database vendor.

Various fixes for bugs with node storage and database access when creating administrative objects like users and roles. We changed some of the tables to provide better performance for node storage.

We have changed the authentication mechanism slightly. In previous versions, a user would use their full URI as the id used when authenticating with the system. This meant in an application that prompted for a userid and password, a user would have to type in their full URI. A User now has a userId property, which is used as the login id. If a userId is not specified when a user is created, their URI is used as the userId. Our examples and property files have been updated to reflect this change, so for example “http://boca.adtech.internet.ibm.com/users/default” has been updated to have a userId of “default”. There is further use of this userId if when using a custom IAuthenticationProvider, but that will be covered in a follow-up post.

Since a majority of changes in this release addressed database issues, it will be necessary to create a new database as some of the tables have been changed in a significant enough way that doesn’t facilitate a simple upgrade path. We hope this is the last major change to the database schema.

Future work:

The current 1.x releases of Boca is considered our stable branch and we plan on maintaining this branch into the future. While the current design has served us well for many years, we have decided that we would like to move to an underlying RDF API that better reflects our system, and for this reason I have been porting Boca to use the RDF interfaces in Sesame 2. In the current 1.X branch of Boca, we have been basing our interfaces around the Jena Graph API, but as we moved toward Named Graphs and away from ideas like reification, I realized we are maintaining a codebase that we are shoehorning into this current API. Due to this fact, we are probably missing key pieces that would make it 100% Jena compatible and missing out on the ability to make an API more appropriate for our system. While it is still our intention to provide a way to access Boca using the Jena API, I decided that the underlying implementation of the system should move forward using APIs that are more compatible with our design goals.

What does this mean to Boca and the users of Boca? Our current plan is to release an early version of a Boca 2.0 branch based on these new interfaces. In this new branch, the core Boca server and core client libraries are built using the org.openrdf.model interfaces for things like URIs, Literals and Statements. The overall client experience will remain very similar to the current API, with the big difference being we will no longer implement the Jena Graph APIs We will have a Jena wrapper around our DatasetService which will provide a means to get Jena compatible graphs from the DatasetService. We also plan to provide a Sesame 2 SailConnection wrapper as well, so that users of Sesame will also be able to use Boca in a manner more closely resembling the Sesame API.

One advantage of this move is that we have been able to build Boca using Maven2 using standard Maven repositories, so for those users wanting to build Boca locally, it should provide an easier path than in the 1.X branch.

All our application/infrastructure related components will be migrated to Boca 2.0 and together with any new work we embark on, will likely go forward only on the new API. We would therefore recommend that existing users of Boca might also want to consider moving to Boca 2.0, since Boca 1.8 will be maintenance only.

Matt Roy

Using RDF on the Web: A Vision

by lee on TechnicaLee Speaking

(This is the second part of two posts about using RDF on the Web. The first post was a survey of approaches for creating RDF-data-driven Web applications.) All existing implementations referred to in this post are discussed in more detail and linked to in part one.

Here's what I would like to see, along with some thoughts on what is or is not implemented. It's by no means a complete solution and there are plenty of unanswered questions. I'd also never claim that it's the right solution for all or most applications. But I think it has a certain elegance and power that would make developing certain types of Web applications straightforward, quick, and enjoyable. Whenever I refer to "the application" or "the app", I'm talking about browser-based Web application implemented in JavaScript.

  • To begin with, I imagine servers around the Web storing domain-specific RDF data. This could be actual, materialized RDF data or virtual RDF views of underlying data in other formats. This first piece of the vision is, of course, widely implemented (e.g. Jena, Sesame, Boca, Oracle, Virtuoso, etc.)

  • The application fetches RDF from such a server. This may be done in a variety of ways:

    • An HTTP GET request for a particular RDF/XML or Turtle document
    • An HTTP GET request for a particular named graph within a quad store (a la Boca or Sesame)
    • A SPARQL CONSTRUCT query extracting and transforming the pieces of the domain-specific data that are most relevant to the application
    • A SPARQL DESCRIBE query requesting RDF about a particular resource (URI)

    In my mind, the CONSTRUCT approach is the most appealing method here: it allows the application to massage data which it may be receiving from multiple data sources into a single domain-specific RDF model that can be as close as possible to the application's own view of the world. In other words, reading the RDF via a query effectively allows the application to define its own API.

    Once again, the software for this step already exists via traditional Web servers and SPARQL protocol endpoints.

  • Second, the application must parse the RDF into a client-side model. Precisely how this is done depends on the form taken by the RDF received from the server:

    • The server returns RDF/XML. In this case, the client can use Jim Ley's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns Turtle. In this case, the client can use Masahide Kanzaki's parser to end up with a list of triples representing the RDF graph. The software to do this is already implemented.
    • The server returns RDF/JSON. In this case, the client can use Douglas Crockford's JSON parsing library (effectively a regular expression security check followed by a call to eval(...) While the software is implemented here, the RDF/JSON standard which I've cavalierly tossed about so far does not yet exist. Here, I'm imagining a specification which defines RDF/JSON based on the common JavaScript data structure used by the above two parsers. ( A bit of work probably still needs to be done if this were to become a full RDF/JSON specification, as I do not believe the current format used by the two parsers can distinguish blank node subjects from subjects with URIs.)

    In any case, we now have on the client a simple RDF graph of data specific to the domain of our application. Yet as I've said before, we'd like to make application development easier by moving away from triples at this point into data structures which more closely represent the concepts being manipulated by the application.

  • The next step, then, is to map the RDF model into a application-friendly JavaScript object model. If I understand ActiveRDF correctly (and in all fairness I've only had the chance to play with it a very limited amount), it will examine either the ontological statements or instance data within an RDF model and will generate a Ruby class hierarchy accordingly. The introduction to ActiveRDF explains the dirty-but-well-appreciated trick that is used: "Just use the part of the URI behind the last ”/” or ”#” and Active RDF will figure out what property you mean on its own." Of course, sometimes there will be ambiguities, clashes, or properties written to which did not already exist (with full URIs) in the instance data received; in these cases, manual intervention will be necessary. But I'd suggest that in many, many cases, applying this sort of best-effort heuristics to a domain-specific RDF model (especially one which the application has selected especially via a CONSTRUCT query) will result in extremely natural object hierarchies.

    None of this piece is implemented at all. I'd imagine that it would not be too difficult, following the model set forth by the ActiveRDF folks.

    Late-breaking news: Niklas Lindström, developer of the Python RDF ORM system Oort followed up on my last post and said (among other interesting things):

    I use an approach of "removing dimensions": namespaces, I18N (optionally), RDF-specific distinctions (collections vs. multiple properties) and other forms of graph traversing.

    Sounds like there would be some more simplification processes that could be adapted from Oort in addition to those adapted from ActiveRDF.

  • The main logic of the Web application (and the work of the application developer) goes here. The developer receives a domain model and can render it and attach logic to it in any way he or she sees fit. Often this will be via a traditional model-view-controller approach: this approach is facilitated by toolkits such as dojo or even via a system such as nike templates (nee microtemplates). Thus, the software to enable this meat-and-potatoes part of application development already exists.

    In the course of the user interacting with the application, certain data values change, new data values are added, and/or some data items are deleted. The application controller handles these mutations via the domain-specific object structures, without regards to any RDF model.

  • When it comes time to commit the changes (this could happen as changes occur or once the user saves/commits his or her work), standard JavaScript (i.e. a reusable library, rather than application-specific code) recognizes what has changed and maps (inverts) the objects back to the RDF model (as before, represented as arrays of triples). This inversion is probably performed by the same library that automatically generated the object structure from the RDF model in the first place. As with that piece of this puzzle, this library does not yet exist.

    Reversing the RDF ORM mapping is clearly challenging, especially when new data is added which has not been previously seen by the library. In some cases--perhaps even in most?--the application will need to provide hints to the library to help the inversion. I imagine that the system probably needs to keep an untouched deep copy of the original domain objects to allow it to find new, removed, and dirty data at this point. (An alternative would be requiring adds, deletes, and mutations to be performed via methods, but this constrains the natural use of the domain objects.)

  • Next, we determine the RDF difference between our original model and our updated model. The canonical work on RDF deltas is a design note by Tim Berners-Lee and Dan Connolly. Basically, though, an RDF diff amounts simply to a collection of triples to remove and a collection of triples to add to a graph. No (JavaScript) code yet exists to calculate RDF graph diffs, though the algorithms are widely implemented in other environments including cwm, rdf-utils, and SemVersion. We also work often with RDF diffs in Boca (when the Boca client replicates changes to a Boca server). I'd hope that this implementation experience would translate easily to a JavaScript implementation.

  • Finally, we serialize the RDF diffs and send them back to the data source. This requires two components that are not yet well-defined:

    • A serialization format for the RDF diffs. Tim and Dan's note uses the ability to quote graphs within N3 combined with a handful of predicates (diff:replacement, diff:deletion, and diff:insertion). I can also imagine a simple extension of (whatever ends up being) the RDF/JSON format to specify the triples to remove and add:
        {
          'add' : [ RDF/JSON triple structures go here ],
          'remove' : [ RDF/JSON triple structures go here ]
        }
      
    • An endpoint or protocol which accepts this RDF diff serialization. Once we've expressed the changes to our source data, of course, we need somewhere to send them. Preferably, there would be a standard protocol (à la the SPARQL Protocol) for sending these changes to a server. To my knowledge, endpoints that accept RDF diffs to update RDF data are not currently implemented. (Late-breaking addition: on my first post, Chris and Richard both pointed me to Mark Baker's work on RDF forms. While I'm not very familiar with any existing uses of this work, it looks like it might be an interesting way to describe the capabilities of an RDF update endpoint.)

    As an alternative for this step, the entire client-side RDF model could be serialized (to RDF/XML or to N-Triples or to RDF/JSON) and HTTP PUT back to an origin server. This strategy seems to make the most sense in a document-oriented system; to my knowledge this is also not currently implemented.

That's my vision, as raw and underdeveloped as it may be. There are a large number of extensions, challenges and related work that I have not yet mentioned, but which will need to be addressed when creating or working with this type of Web application. Some discussion of these is also in order.

Handling Multiple Sources of Data

To use the above Web-application-development environment to create Web 2.0-style mash-ups, most of the steps would need to be performed once per data source being integrated. This adds to the system a provenance requirement, whereby the libraries could offer the application a unified view of the domain-specific data while still maintaining links between individual data elements and their source graphs/servers/endpoints to facilitate update. When the RDF diffs are computed, they would need to be sent back to the proper origins. Also, the sample JavaScript structures that I've mentioned as a base for RDF/JSON and the RDF/JSON diff serialization would likely need to be augmented with a URI identifying the source graph of each triple. (That is, we'd end up working with a quad system, though we'd probably be able to ignore that in the object hierarchy that the application deals with.) In many cases, though, an application that reads from many data sources will write only to a single source; it does not seem particularly onerous for the application to specify a default "write-back" endpoint.

Inverting SPARQL CONSTRUCT Queries

An appealing part of the above system (to me, at least) is the use of CONSTRUCT queries to map origin data to a common RDF model before merging it on the client and then mapping it into a domain-specific JavaScript object structure. Such transformations, however, would make it quite difficult--if not impossible--to automatically send the proper updates back to the origin servers. We'd need a way of inverting the CONSTRUCT query which generated the triples the application has (indirectly) worked with, and while I have not given it much thought, I imagine that that is quite difficult, if not impossible.

SPARQL UPDATE.

The DAWG has postponed any work on updating graphs for the initial version of SPARQL, but Max Völkel and Richard Cyganiak have started a bit of discussion on what update in SPARQL might look like (though Richard has apparently soured on the idea a bit since then). At first blush, using SPARQL to update data seems like a natural counterpart to using SPARQL to retrieve the data. However, in the vision I describe above, the application would likely need to craft a corresponding SPARQL UPDATE query for each SPARQL CONSTRUCT query that is used to retrieve the data in the first place. This would be a larger burden on the application developer, so should probably be avoided.

Related Work

I wanted to acknowledge that in several ways this whole pattern is closely related to but (in some mindset, at least) the inverse of a paradigm that Danny Ayers has floated in the past. Danny has suggested using SPARQL CONSTRUCT queries to transition from domain-specific models to domain-independent models (for example, a reporting model). Data from various sources (and disparate domains) can be merged at the domain-independent level and then (perhaps via XSLT) used to generate Web pages summarizing and analyzing the data in question. In my thoughts above, we're also using the CONSTRUCT queries to generate an agreed-upon model, but in this case we're seeking an extremely domain-specific model to make it easier for the Web-application developer to deal with RDF data (and related data from multiple sources).

Danny also wrote some related material to www-archive. It's not the same vision, but parts of it sound familiar.

Other Caveats

Updating data has security implications, of course. I haven't even begun to think about them.

Blank nodes complicate almost everything; this may be sacrilege in some circles, but in most cases I'm willing to pretend that blank nodes don't exist for my data-integration needs. Incorporating blank nodes makes the RDF/JSON structures (slightly) more complicated; it raises the question of smushing together nodes when joining various models; and it significantly complicates the process of specifying which triples to remove when serializing the RDF diffs. I'd guess that it's all doable using functional and inverse-functional properties and/or with told bnodes, but it probably requires more help from the application developer.

I have some worries about concurrency issues for update. Again, I haven't thought about that much and I know that the Queso guys have already tackled some of those problems (as have many, many other people I'm sure), so I'm willing to assert that these issues could be overcome.

In many rich-client applications, data is retrieved incrementally in response to user-initiated actions. I don't think that this presents a problem for the above scheme, but we'd need to ensure that newly arriving data could be seamlessly incorporated not only into the RDF models but also into the object hierarchies that the application works with.

Bill de hÓra raised some questions about the feasibility of roundtripping RDF data with HTML forms a while back. There's some interesting conversation in the comments there which ties into what I've written here. That said, I don't think the problems he illustrates apply here--there's power above and beyond HTML forms in putting an extra JavaScript-based layer of code between the data entry interface (whether it be an HTML form or a more specialized Web UI) and the data update endpoint(s).


OK, that's more than enough for now. These are still ideas clearly in progress, and none of the ideas are particularly new. That said, the environment as I envision doesn't exist, and I suppose I'm claiming that if it did exist it would demonstrate some utility of Semantic Web technologies via ease of development of data- and integration-driven Web applications. As always, I'd enjoy feedback on these thoughts and also any pointers to work I might not know about.

January 16, 2007

Using RDF on the Web: A Survey

by lee on TechnicaLee Speaking

(This is part one of two posts exploring building read-write Web applications using RDF. Part two will follow, shortly. Update: Part two is now available, also.)

The Web permeates our world today. Far more than static Web sites, the Web has come to be dominated by Web applications--useful software that runs inside a Web browser and on a server. And the latest trend in Web applications, Web 2.0, encourages--among other things--highly interactive Web sites with rich user interfaces featuring content from various sources around the Web integrated within the browser.

Many of us who have drank deeply from the Semantic Web Kool-Aid are excited about the potential of RDF, SPARQL, and OWL to provide flexible data modeling, easier data integration, and networked data access and query. It's no coincidence that people often refer to the Semantic Web as a web of data. And so it seems to me that RDF and friends should be well-equipped to make the task of generating new and more powerful Web mash-ups simple, elegant, and enjoyable. Yet while there are a great number of projects using Semantic Web technologies to create Web applications, there doesn't seem to have emerged any end-to-end solution for creating browser-based read-write applications using RDF which focus on data integration and ease of development.

Following a discussion on this topic at work the other day, I decided to do a brief survey of what approaches do already exist for creating RDF-based Web applications. I want to give a brief overview of several options, assess how they fit together, and then outline a vision for some missing pieces that I feel might greatly empower Web developers working with Semantic Web technologies.

First, a bit on what I'm looking for. I want to be able to quickly develop data-driven Web applications that read from and write back to RDF data sources. I'd like to exploit standard protocol and interfaces as much as possible, and limit the amount of domain-specific code that needs to be written. I'd like the infrastructure to make it as easy as possible for the application developer to retrieve data, integrate the data, and work with it in a convenient and familiar format. That is, in the end, I'm probably looking for a system that allows the developer to work with a model of simple, domain-specific JavaScript object hierarchies.

In any case, here's the survey. I've tried to include most of the systems I know of which involve RDF data on the Web, even those which are not necessarily appropriate for creating generalized RDF-based Web apps. I'll follow-up with a vision of what could be in my next post.

Semantic Mediawiki

This is an example of a terrific project which is not what I'm looking for here. Semantic Mediawiki provides wiki markup that captures the knowledge contained within a wiki as RDF which can then be exported or queried. While an installation of Semantic Mediawiki will allow me to read and write RDF data via the Web, I am constrained within the wiki framework; further, the interface to reading and writing the RDF is markup-based rather than programmatic.

The Semantic Bank API

The SIMILE project provides an HTTP POST API for publishing and persisting RDF data found on local Web pages to a server-side bank (i.e. storage). They also provide a JavaScript library (BSD license) which wraps this API. While this API supports writing a particular type of RDF data to a store, it does not deal with reading arbitrary RDF from across the Web. The API also seems to require uploaded data to be serialized as RDF/XML before being sent to a Semantic Bank. This does not seem to be what I'm looking for to create RDF-based Web applications.

The Tabulator RDF parser and API

MIT student David Sheets created a JavaScript RDF/XML parser (W3C license). It is fully compliant with the RDF/XML specification, and as such is a great idea for any Web application which needs to gather and parse arbitrary RDF models expressed in RDF/XML. The Tabulator RDF parser populates an RDFStore object. By default, it populates an RDFIndexedFormula store, which inherits from the simpler RDFForumla store. These are rather sophisticated stores which perform (some) bnode and inverse-functional-property smushing and maintain multiple triple indexes keyed on subjects, predicates, and objects.

Clearly, this is an excellent API for developers wishing to work with the full RDF model; naturally, it is the appropriate choice for an application like the Tabulator which at its core is an application that eats, breathes, and dreams RDF data. As such, however, the model is very generic and there is no (obvious, simple) way to translate it into a domain-specific, non-RDF model to drive domain-specific Web applications. Also, the parser and store libaries are read-only: there is no capability to serialize models back to RDF/XML (or any other format) and no capability to store changes back to the source of the data.

(Thanks to Dave Brondsema for an excellent example of using the Tabulator RDF parser which clarified where the existing implementations of the RDFStore interface can be found.)

Jim Ley's JavaScript RDF parser

Jim Ley created perhaps the first JavaScript library for parsing and working with RDF data from JavaScript within a Web browser. Jim's parser (BSD license) handles most RDF/XML serializations and returns a simple JavaScript object which wraps an array of triples and provides methods to find triples by matching subjects, predicates, and objects (any or all of which can be wildcards). Each triple is a simple JavaScript object with the following structure:

{
  subject: ...,
  predicate: ...,
  object: ...,
  type: ...,
  lang: ...,
  datatype: ...
}

The type attribute can be either literal or resource, and blank nodes are represented as resources of the form genid:NNNN. This structure is a simple and straightforward representation of the RDF model. It could be relatively easily mapped into an object graph, and from there into a domain-specific object structure. The simplicity of the triple structure makes it a reasonable choice for a potential RDF/JSON serialization. More on this later.

Jim's parser also provides a simple method to serialize the JavaScript RDF model to N-Triples, though that's the closest it comes to providing support for updating source data with a changed RDF graph.

Masahide Kanzaki's Javascript Turtle parser

In early 2006, Masahide Kanzaki wrote a JavaScript library for parsing RDF models expressed in Turtle. This parser is licenses under the terms of the GPL 2.0 and can parse into two different formats. One of these formats is a simple list of triples, (intentionally) identical to the object structure generated by Jim Ley's RDF/XML parser. The other format is a JSON representation of the Turtle document itself. This format is appealing because a nested Turtle snippet such as:

@prefix : <http://example.org/> .

:lee :address [ :city "Cambridge" ; :state "MA" ] .

translates to this JavaScript object:

{
  "@prefix": "<http://example.org/>",
  "address": {
    "city": "Cambridge",
    "state": "MA"
  }
}

While this format loses the URI of the root resource (http://example.org/lee), it provides a nicely nested object structure which could be manipulated easily with JavaScript such as:

  var lee = turtle.parse_to_json(jsonStr);
  var myState = lee.address.state; // this is easy and domain-specific - yay!

Of course, things get more complicated with non-empty namespace prefixes (the properties become names like ex:name which can't be accessed using the obj.prop syntax and instead need to use the obj["ex:name"] syntax). This method of parsing also does not handle Turtle files with more than a single root resource well. And an application that used this method and wanted to get at full URIs (rather than the namespace prefix artifacts of the Turtle syntax) would have to parse and resolve the namespaces prefixes itself. Still, this begins to give ideas on how we'd most like to work with our RDF data in the end within our Web app.

Masahide Kanzaki also provides a companion library which serializes an array of triples back to Turtle. As with Jim Ley's parser, this may be a first step in writing changes to the RDF back to the data's original store; such an approach requires an endpoint which accepts PUT or POSTed RDF data (in either N-Triples or Turtle syntax).

SPARQL + SPARQL/JSON + sparql.js

The DAWG published a Working Group Note specifying how the results of a SPARQL SELECT or ASK query can be serialized within JSON. Elias and I have also written a JavaScript library (MIT license) to issue SPARQL queries against a remote server and receive the results as JSON. By default, the JavaScript objects produced from the library match exactly the SPARQL results in JSON specification:

{
  "head": { "vars": [ "book" , "title" ]
  } ,
  "results": { "distinct": false , "ordered": false ,
    "bindings": [
      {
        "book": { "type": "uri" , "value": "http://example.org/book/book6" } ,
        "title": { "type": "literal" , "value": "Harry Potter and the Half-Blood Prince" }
      } ,
      ...

The library also provides a number of convenience methods which issue SPARQL queries and return the results in less verbose structures: selectValues returns an array of literal values for queries selecting a single variable; selectSingleValue returns a single literal value for queries selecting a single variable which expect to receive a single row; or selectValueArrays which returns a hash relating each of the query's variables to an array of values for that variable. I've used these convenience methods in the SPARQL calendar and SPARQL antibodies demos and found it quite easy for SPARQL queries returning small amounts of data.

Note, however, that this method does not actually work with RDF on the client side .Because it is designed for SELECT (or ASK) queries, the Web application developer ends up working with lists of values in the application (more generally, a table or result set structure). Richard Cyganiak has suggested serializing entire RDF graphs using this method by using the query SELECT ?s ?p ?o WHERE { ?s ?p ?o } and treating the three-column result set as an RDF/JSON serialization. This is a clever idea, but results in a somewhat unwieldy JavaScript object representing a list of triples: if a list of triples is my goal, I'd rather use the Jim Ley simple object format. But in general, I'd rather have my RDF in a form where I can easily traverse the graph's relationships without worrying about subjects, predicates, and objects.

Additionally, the SPARQL SELECT query approach is a read-only approach. There is no current way to modify values returned from a SPARQL query and send the modified values (along with the query) back to an endpoint to change the underlying RDF graph(s).

JSONC, JSONI, and JSONP

Benjamin Nowack implemented the SPARQL JSON results format in ARC (W3C license), and then went a bit further. He proposes three additions/modifications to the standard SPARQL JSON results which result in saved bandwidth, more directly usable structures, and the ability to instruct a SPARQL endpoint to return JavaScript above and beyond the results object itself.

  • JSONC: Benjamin suggests an additional jsonc parameter to a SPARQL endpoint; the value of this parameter instructs the server to flatten certain variables in the result set. The result structure contains only the string value of the flattened variables, rather than a full structure containing type, language, and datatype information.
  • JSONI: JSONI is another parameter to the SPARQL endpoint which instructs the server to return certain selected variables nested within others. Effectively, this allows certain variables within the result set to be indexed based on the values of other variables. This results in more naturally nested structures which can be more closely aligned with domain-specific models and hence more directly useful by JavaScript application developers.
  • JSONP: JSONP is one solution to the problem of cross-domain XMLHttpRequest security restrictions. The jsonp parameter to a SPARQL server would specify a function name which the resulting JSON object will be wrapped in in the returned value. This allows the SPARQL endpoint to be used via a <script src="..."></script> invocation which avoids the cross-domain limitation.

The first two methods here are similar to what the sparql.js feature provides on the client side for transforming the SPARQL JSON results format. By implementing them on the server, JSONC and JSONI can save significant bandwidth when returning large result sets. However, in most cases bandwidth concerns can be alleviated by sending gzip'ed content, and performing the transforms on the client allow for a much wider range of possible transformations (and no burden on SPARQL endpoints to support various transformations for interoperability). As far as I know, ARC is currently the only SPARQL endpoint that implements JSONC and JSONI.

JSONP is a reasonable solution in some cases to solving the cross-domain XMLHttpRequest problem. I believe that other SPARQL endpoints (Joseki, for instance) implement a similar option via an HTTP parameter named callback. Unfortunately, this method often breaks down with moderate-length SPARQL queries: these queries can generate HTTP query strings which are longer than either the browser (which parses the script element) or the server is willing to handle.

Queso

Queso is the Web application framework component of the IBM Semantic Layered Research Platform. It uses the Atom Publishing Protocol to allow a browser-based Web application to read and write RDF data from a server. RDF data is generated about all Atom entries and collections that are PUT or POSTed to the server using the Atom OWL ontology. In addition, the content of Atom entries can contain RDF as either RDF/XML or as XHTML marked up with RDFa; the Queso server extracts the RDF from this content and makes it available to SPARQL querying and to other (non-Web) applications.

By using the Atom Publishing Protocol, an application working against a Queso server can both read and write RDF data from that Queso server. While Queso does contain JavaScript libraries to parse the Atom XML format into usable JavaScript objects, libraries do not yet exist to extract RDF data from the content of the Atom entries. Nor do libraries exist yet that can take RDF represented in JavaScript (perhaps in the JIm Ley fashion) and serialize it to RDF/XML inthe content of an Atom entry. Current work with Queso has focused on rendering RDFa snippets via standard HTML DOM manipulation, but have not yet worked with the actual RDF data itself. In this way, Queso is an interesting application paradigm for working with RDF data on the Web, but it does not yet provide a way to work easily with domain-specific data within a browser-based development environment.

(Before Ben, Elias, and Wing come after me with flaming torches, I should add that Queso is still very much evolving: we hope that the lessons we learn from this survey and discussion about a vision of RDF-based Web apps (in my next post) will help guide us as Queso continues to mature.)

RPC / RESTful API / the traditional approach

I debated whether to put this on here and decided it was incomplete without it. This is the paradigm that is probably most widely used and is extremely familiar. A server component interacts with one or more RDF stores and returns domain-specific structures (usually serialized as XML or JSON) to the JavaScript client in response to domain-specific API calls. This is the approach taken by an ActiveRDF application, for instance. There are plenty of examples of this style of Web application paradigm: one which we've been discussing recently is the Boca Admin client, a Web app. that Rouben is working on to help administer Boca servers.

This is a straightforward, well-understood approach to creating well-defined, scalable, and service-oriented Web applications. Yet it falls short in my evaluation in this survey because it requires a server and client to agree on a domain-specific model. This means that my client-sde code cannot integrate data from multiple endpoints across the Web unless those endpoints also agree on the domain model (or unless I write client code to parse and interpret the models returned by every endpoint I'm interested in). Of course, this method also requires the maintenance of both server-side and client-side application code, two sets of code with often radically different development needs.

This is still often a preferred approach to creating Web applications. But it's not really what I'm thinking of when I contemplate the power of driving Web apps with RDF data, and so I'm not going to discuss it further here.


That's what I've got in my survey right now. I welcome any suggestions for things that I'm missing. In my next post, I'm going to outline a vision of what I see a developer-friendly RDF-based Web application environment looking like. I'll also discuss what pieces are already implemented (mainly using systems discussed in this survey) and which are not yet implemented. There'll also be many open questions raised, I'm sure. (Update: Part two is now available, also.)


(I didn't examine which of these approaches provide support for simple inferencing of the owl:sameAs and rdfs:subPropertyOf flavor, though that would be useful to know.)

January 09, 2007

Who loves RDF/XML?

by lee on TechnicaLee Speaking

I wrote the following as a comment on Seth's latest post about RDF/XML syntax, but the blog engine asked me to add two unspecified numbers, and I had a great deal of difficulty doing that correctly. So instead, it will live here, and I'd love to learn answers to this question from Seth or anyone else who might have any answers. Quoting myself:

Hi Seth,

This is a completely serious question: Who are these people who are insisting on RDF/XML as the/a core of the semantic web? Where can I meet them? Or have I met them and not realized it? Or are they mostly straw-men, as part of me suspects?

Inquiring minds -- and SWEO members -- want to know.

thanks,
Lee

January 02, 2007

Boca Admin has a few new features

by rmeschian on Rouben's Blog

  1. The properties table is storable.
  2. Refresh individual items in lists such as user metadata
  3. Refresh lists such as the list of all graphs in the system
  4. Alphabetically sort lists in ascending/descending order
  5. Search for items in lists by typing in any substring in any case (regular expressions are not yet supported).
  6. Add users to a role by just going through a list of users and checking off the ones you want to be in the specified role (this one is for you Matt).

Screenshot:

newbocafeatures1.jpg

December 27, 2006

Playing Fetch with the DAWG

by lee on TechnicaLee Speaking

The summary: I was looking for an easy way to search through minutes of the DAWG, given that some but not all of the minutes are reproduced in plain text within a mailing list message. All minutes are (in one way or another) URL accessible, however, so I setup Apache Nutch to crawl, index, and search the minutes. I learned stuff along the way, and that's what the rest of this post shares.


One of the first things I'm doing as I'm getting up to speed in my new role as DAWG chair is finding the issues the DAWG has not yet resolved and determining whether we're on target to address the issues. One of the issues raised a few months ago was the syntactical order of the LIMIT and OFFSET keywords within queries. I had remembered that the group had reached a decision about this issue, but did not remember the details. I wanted to find the minutes which recorded the decision.

I could have searched the mailing list for limit and offset and probably found what I needed by perusing the search results. But not all minutes make it into mailing list messages as something other than links or attachments, and I didn't want to wade through general discussion. I'd rather be able to search the minutes explicitly. So here's what I did:

(I work in a Windows XP environment with a standard Cygwin installatoin.)

  1. Updated the DAWG homepage, adding links to minutes of the the past few months' teleconferences.
  2. Dug up a script I'd written last year to pull links from a Web page where the text of the link matches a certain pattern. Invoked this script with the pattern '\d+\s+?\w{3}' against the URL http://www.w3.org/2001/sw/DataAccess/ to pull out all the links to minutes from the Web page. This heuristic approach works well, but it would feel far more elegant to have the markup authoritatively tell me which links were links to minutes. Via RDFa, perhaps. I redirected the list of links produced by this script to the text file, dawg-minutes/root-urls/minutes.
  3. Downloaded the latest version of Apache Nutch and unzipped it, adding a symlink from nutch-install-dir/bin/nutch such that nutch ended up in my path.
  4. Followed instructions #2 and #3 from the Nutch user manual. This involves supplying a name to the user agent which Nutch crawls the Web with and also specifying a URL filter that decides which pages to crawl (or which pages not to crawl). To be on the safe side, I added these two lines to nutch-install-dir/conf/crawl-urlfilter.txt:
      +^http://([a-z0-9]*\.)*w3c.org/
      +^http://([a-z0-9]*\.)*w3.org
    
  5. The next step was to crawl the list of links I had already generated. I didn't want to follow any other links from these URLs, so this was a pretty simple invocation of Nutch. I did get trapped for a bit by the fact that earlier versions of Nutch required the command-line argument to be a text file with the list of URLs while the current version requires the argument to be the directory containing lists of links. I ended up invoking nutch as:
      cd dawg-minutes ; nutch crawl root-urls -dir nutch/ -depth 1
    
    This fetched, crawled, and indexed the set of DAWG minutes (but no other links thanks to the -depth 1) and stored the resulting data structures within the nutch subdirectory.
  6. At this point, I had (still unresolved) trouble getting the command-line search tool to work:
      nutch org.apache.nutch.searcher.NutchBean apache
    
    Regardless of the working directory from which I executed this, I always received Total hits: 0. This problem led me to discover Luke, the Lucene Index Toolbox, which confirmed for me that my indexes had been properly created and populated.
  7. I pressed ahead with getting Nutch's Web interface setup. I already had an installation of Apache Tomcat 5.5, so no installation needed there. Instead, I copied the file nutch-install-dir/nutch-version.war to nutch.war at the root of my Tomcat webapps directory.
  8. I started Tomcat from the dawg-minutes/nutch directory (where Nutch had put all of its indexes and other data structures), and launched a Web browser to http://localhost:5000/nutch. (The default Tomcat install runs on port 8080, I believe; I have too many programs clamoring for my port 8080.)
  9. The Nutch search interface appeared, but again any searches that I performed led to no hits being returned!
  10. Some Web searching led me to a mailing-list message which suggested investigating the searcher.dir property in webapps/nutch/WEB-INF/classes/nutch-site.xml. I added this property with a value of c:/documents and settings/.../dawg-minutes/nutch and restarted tomcat.
  11. All's well that ends well.

So I ran into a few speed bumps, but in the end I've got a relatively lightweight system for indexing and searching DAWG minutes. Hooray!

Searching the DAWG minutes with Apache Nutch

December 23, 2006

boca administrator is on its way

by rmeschian on Rouben's Blog

I have spent the last few weeks working on an administration system for boca.

The system is written using javascript on the client that communicates to a servlet using AJAX.

This system is run as a plugin to the boca server.

Below I have listed the various features that will be in the first release as well as screenshots of the application.

- LOGIN -

Feature list:

  1. Allow any user in the system to log in
  2. Only allow the user to perform operations that they have access to

Status:Functionally Complete

Screenshots:

- OVERVIEW -

Feature list:

  1. Display the properties the server was initialized with

Status:Functionally complete, may add some more metadata about the server

Screenshots:

overviewTabScreenshot.jpg

- GRAPHS -

Feature list:

  1. List all graphs the user has access to in the system
  2. Allow the user to be able to create new graphs
  3. Display revision number of the graph
  4. Display the creator of the graph
  5. Display who the graph was last modified by
  6. Display ACL information including which roles have access to what actions (read, add, remove, etc)
  7. Allow the user to change the graph ACL by adding roles to various permissions (read, add, remove, etc)

Status:Functionally Complete, requires more error checking and styling

Screenshots:

- USERS -

Feature list:

  1. List all users in the system.
  2. Allow the user to be able to create new users
  3. Display the user ID of each user in the system
  4. Allow the user to edit the user ID of each user in the system
  5. Allow the user to edit the user password of each user in the system
  6. Display all roles each user in the system is in
  7. Allow the user to add/remove roles each user in the system is in
  8. Display the default ACL template including which roles have access to what actions (read, add, remove, etc)
  9. Allow the user to change the default ACL template for each user in the system by adding roles to various permissions (read, add, remove, etc)

Status:Functionally Complete, requires more error checking and styling

Screanshots:


- ROLES -

Feature list:

  1. List all roles in the system
  2. Allow the user to be able to create new roles
  3. Display all roles each role is a subrole of
  4. Allow the user to edit which roles each role is a subrole of
  5. Display all users that are in each role
  6. Allow user to edit which users are in each role

Status:Functionally Complete, requires more error checking and styling

Screenshots:

- QUERY -

Feature list:

  1. Allow users to perform SPARQL queries against the boca server
  2. Render the results as a table/xml/etc

Status:Not written yet

no screenshot available

Please note that I have only listed the features that I intend to implement for the first release of this tool. I realize that there is a long wish list in the minds of all boca developers, so feel free to post comments and suggestions so that I can add them to the wish list for the second release.

December 17, 2006

SPARQL for Flickr: Picture the Possibilities

by wingerz on ~wingerz

flickr.jpg

I recently purchased my first DSLR camera. It wasn’t an easy decision, and at some point I was looking for sample photos taken by a non-DSLR under a certain condition (wide aperture). I started with the Flickr Camera Finder. There is so much wonderful data on pages like the list of Canon cameras and the individual camera pages. The data can be viewed in several ways, but it all just leaves me wanting more. Sure, for a particular camera I can search for pictures tagged with “food”, but what if I want to specify photos with a wide aperture that were taken on November 23, 2005?

They’re sitting on a gold mine of data, but the only way to get at it is through the web API (The advanced search is not very powerful). It’s possible to get at some of the EXIF data (photo metadata), but only if you have the ID for a photo; there’s no way to search across all of the images. Even if they managed to implement this particular interface, what if I want to search for photos that satisfy these restrictions that were posted by users within three friend-links of me?

If Flickr slaps a SPARQL endpoint on its data, it opens up all sorts of amazing possibilities. Using API keys, they could allow paid access to the data from photo equipment sellers (and free access to web hackers), who would be able to offer their customers the ability to find pictures taken with particular cameras and lenses and the people who own them (possibly restricting this set of people to friends or foafs). Of course, Flickr could put together a proprietary web API and do this now, but then they would have to code up every new API method request themselves rather than letting data subscribers write their own queries. And SPARQL-able data has the additional benefit of being easier to integrate with other sources.

December 14, 2006

ODO - A Perl framework for the Semantic Web

by sean on IBM Semantic Layered Research Platform

ODO is an acronym for “Ontologies, Databases, and Optimizations” a PERL framework for RDF manipulation. ODO is still evolving and I have more features to push out but right now it supports:

  • Nodes, statements and graph backed by memory
  • Plastor, RDFS and OWL-Lite to Perl code generators (like Jastor, but for Perl)
  • Queries using RDQL with SPARQL on its way
  • RDF/XML and NTriple parsers

Each of these items is built inside the ODO framework so it is possible to extend and enhance the library over time without breaking applications (hopefully!).

I have additional components of the library to push out over the next few weeks and am in the process of making the POD available on this site. I will write some demo applications soon! Feedback is always welcome on our developer list.

Stephen Evanchick, December 2006

ODO: Semantic Web libraries in Perl

by lee on TechnicaLee Speaking

My colleague Stephen Evanchik has announced the release of ODO, part of the IBM Semantic Layered Research Platform:

ODO is an acronym for "Ontologies, Databases, and Optimizations," which are the three items I was most interested in experimenting with at the time. They were also the three categories of functionality I couldn't find in the existing Perl RDF libraries. ODO is still evolving and I have some more features to push out but right now it supports:

  • Nodes, statements and graph backed by memory
  • RDFS and OWL-Lite to Perl code generators
  • Queries using RDQL with SPARQL on its way
  • RDF/XML and NTriple parsers

The second point on that list is our Perl analog of the Jastor project, which generates Java code for RDF data access from OWL ontologies.

Feed

Atom

Contributors

Ben Szekely
Into the Woods
Elias Torres
Elias Torres
IBM SLRP
IBM Semantic Layered Research Platform
Lee Feigenbaum
TechnicaLee Speaking
Rouben Meschian
Rouben's Blog
Stephen Evanchik
Stephen Evanchik - Semantic Web
Wing Yung
~wingerz

Last updated on February 12, 2007 06:00 PM (All times are UTC)

Powered by Planet