Web 2.0 in this sense means using XML to expose data structures, file formats, and services such as feeds. Web 3.0 means going further to expose and link the concepts and relationships found in data structures, file formats, documents, and web pages using semantic web and natural language technologies.
RECOVERY.GOV
Tracking, making sense, and assessing the stimulus program requires other kinds of data that provided today in recovery.gov.
The Obama administration values openness, transparency, collaboration, and connected governance. This is a big topic, especially given the scale of the Federal government (the scope of policy, information, and services that it deals with) and the number of individual citizens, media and social media, private companies, non-profit organizations, localities, states and regional organizations involved in one way or another with the stimulus program.
The Recovery.gov site is intended to enable people (citizens, investigative journalists, interest groups, government employees, … anyone) to find out how recovery funds are being spent and what we are getting for their money. Today, information being reported by agencies is limited. It encompasses the program areas authorized, and the funds obligated by the agency. But, not how the money is being used, and what is being accomplished with the funds.
So, this is not enough information to enable us to know whether any part of the stimulus program is working. We need to link obligation data from agencies with other sources that will enable us to track who is actually spending the money, where they are spending it, what they are spending it on, how they are spending it, and most importantly, what measurable results we are able to see coming from the monies spent. Where does this data come from? Some will come from Recovery.gov, but much more will come from sources outside of the recovery.gov spreadsheets. So, we need ways to connect information from disparate sources together, if we are going to follow the money from authorized program to use of funds to reported result.
Also, to investigate and analyze recovery initiatives, we need data that allows citizens to go deeper than mere surface reports. Creating 200 jobs in a town of 100 people would be suspicious. Creating 200 jobs in a locality that needs 2 million would be insignificant, hence a failure. We need to see recovery programs, spending, and accomplishments in the context of local statistical data and demographics. Sources of local information will be important in assessing the performance of specific projects. Data that enables citizens to gauge whether funds are being wasted or used fraudulently will come from varied public and private sources, e.g., internet social media, news feeds. We need transparency all the way into each specific project, and across its life cycle.
The idea is to take a “data.gov approach to recovery.gov” and to implement it using Web 2.0 and Web 3.0 techniques. Web 2.0 in this sense means using XML to expose data structures, file formats, and services such as feeds. Web 3.0 means going further to expose and link the concepts and relationships found in data structures, file formats, documents, and web pages using semantic web and natural language technologies.
For additional background on the role of cloud computing, web 2.0 and web 2.0 semantic technologies in implementing Obama administration themes of transparent, open, and collaborative e-governance, see the following presentation:
DATA.GOV
The Data.gov site is intended to provide a one-stop location where citizens (and their machines) can gain access to public information in a usable form. For data.gov to function, data providers need to expose five key categories on information that make transparency and accessibility operational for both people and machines, namely:
1 data set -- can be uploaded or remote
2 tagging or metadata about the data giving the structure of the data, provenance, etc.
3 definitions of what this metadata means (concepts, relationships, assertions, etc.), also "domain knowledge" explaining how to interpret it.
4 API or web services method of accessing the data, preferably REST-ful
5 URL where the data can be accessed.
The data for data.gov is to be selected and provided by all 25 agencies and 125 programs. The basic idea, in its simplest form, is to expose tables of information using XLM and provisioning data sets to the public using RSS feeds. The back-end concept is under discussion. One view is that agencies should expose data in some standard format that would be ported to some form of data.gov warehouse where it is made available through services. Another concept is that agencies should select from data they already make available on their websites, and provide a web services interface to enable data.gov to crawl the site and harvest the information needed. Trade-offs either way. Personally, I favor making fewer demands on agencies and handling conversion and interpretation issues as part of data.gov's value add.
Metadata, Tags & Semantics.
To create mash-ups through data.gov or any aggregation vehicle, we have to know something about the data structure and what the data names (or column headings, or stubs, etc.) mean. For example, to answer a recovery.gov query I may need two or more spread sheets containing different information from different sources. I want to build a composite that combines some information from each of them. Sure, I can always rebuild the spreadsheet by hand using Excel. Hand-building is tedious because excel knows nothing about what any of the column heads mean. But, what if I could expose the structure of the spreadsheet so that the computer software could interpret that the column heading was the name of a property, and the cells beneath were instances of this, etc. Then, assuming I've been careful about how I named columns etc. I'd have a semi-automated way to build composite spreadsheets. That’s what semantic web technologies give me.
Returning to recovery.gov, it gets more complicated if I want to combine (in some way) tables created by different people at different times using different data definitions. Even more complex if I want to map and link content from different spreadsheets, files, documents, web pages, etc. together. In the past we’ve always done this mapping one-off, by hand, with the results embedded in a particular solutions and not readily re-usable.
To map or mash-up different data sources and services, I have a meta data problem to resolve. What did person X mean by data field Y? Is the value presented in spreadsheet #1, for example a measurement of something over some time period, computed in a way that is compatible with other values that are presented in spreadsheet #2? Also, what do I know about the provenance and quality or integrity of the data in either table? When people resolve these questions, they draw on additional information about the data, as well as some knowledge about the domain in which the data is used. Typically, we find definitions written out in documents or on web pages, or we call someone who knows what the data names and data structures mean.
What if we would like to have an automated or semi-automated way to speed up and simplify the process of combining or mashing-up data? Then, the computer software needs to access and interpret more information than just the data set plus the data names. That is, if we want the computer to be able to help us to interpret, align, and link information from different sources, then we need to express the semantics of this metadata in a way that is machine computable.
What about natural language, and content structures other than tables? In addition, programs that extract information and concepts from natural language (or other sources) can apply the RSS strategy to communicate concepts and instances. This assumes that the classes of information being communicated are known, or otherwise discoverable, by the entity receiving the RSS feed. Typically, the number of classes is restricted, for example, to people, places, organizations, events. Also, when relationships are extracted, then the categories of these are restricted.
Linked Data
Linked data is powerful idea that has gained a growing following. Linked data calls for expressing structures, tags, and metadata as RDFa / xhtml, which provides a pretty robust way move the metadata around. Semantic web technologies provide a standard way to express both the data (or document, or page) and the meta information about the concepts and relationships of the data set, which can then be processed and transformed externally in various ways not only to change formatting, but also to map, align and harmonize alternative meanings of terms.
RSS feeds can communicate the structure of information, together with the content or instances of it. Typically this is applied to tables of data, forms, and simple document structures. Conceivably, it could be extended to schemas of graphics as well. It is the responsibility of the recipient of the RSS feed to know what the tags and structure of information mean.
A precursor of current thinking about data.gov was a program that the DC government established to provide RSS feeds for a number of data sets that could be used to create mash-ups and composite services. There was even a contest for putting together interesting mash-ups, widgets etc. that produced a number of cool services, and at zero cost to the government. The data.gov initiative will likely feature contests.
Semantic publishing.
Semantic publishing means sharing both the human readable, and the machine interpretable forms of information. In addition to tables of structured data, present the data structure and data element definitions and metadata in both human and machine-readable forms. Key ideas here include:
· First of all, the concepts of "public information" cover a lot more kinds of content than structured data tables, for example, all of the human-readable content published through government web sites, document collections, published reports, and much, much more.
· Second, the "publishing" concept carries with it the implication that this body of information has been subject to editing for understanding by people as well some "quality standard" of consistency, accuracy, and credibility.
· Third, the "semantic" aspect of the publishing means that all associated contents -- text, tables, graphics, images, etc. -- come with their structure and metadata expressed digitally (using semantic web, ontology, or other standard) so that any content in the publication can be machine processed as well as read by humans.
· Fourth, important government data aggregations and document collections already exist that combine multiple sources, have well formulated methodologies, have checked their facts, are well curated, and supported etc. These make logical building blocks for open public information that is trustworthy, accessible by concepts, and accessible in an internet friendly way.
While it is always a good idea to start simply and get some early successes, it is also important to look ahead so as not to get boxed in. There is plenty of value from exposing the structure of tables together with the data they contain using RSS. But, this approach will encounter limitations as the data.gov site scales to more and data sets from more sources, as well as sources of higher knowledge content, etc. For data.gov to succeed, we will need to share knowledge about the data and domain in forms that both humans and machines can interpret. This will include incorporating terminology references and techniques for disambiguating the senses of natural language used to identify and define data concepts, provenance, and trust. This requires a form of syndication that is inclusive of ontology (definitions) and natural language understanding.