A Distributed Data Object Architecture for One Stop
Shopping for Government Statistics on the Internet.
Cavan Capps Bureau of the Census William Rankin Bureau of the Census
To make comments on this paper, go to our HyperNews comment page.
Introduction: The Current Environment and Current Expectations
Note: This document reflects the views of the authors and is not intended to represent the official views or policy of the U.S. Census Bureau.
Comments are appreciated.
The expansion of the Internet is redefining the information explosion. It is rapidly becoming the media of choice for inexpensive, timely access to information of all types. In effective organizations, it is allowing business increased access to customers and clients, and increasing feedback between provider and client on scales previously never experienced. As the web provides direct access to intelligent computers and databases, the sophistication of information and services are increasing. Originally developed as a way to post documents, the web is becoming the provider of a new class of information, referred to as active content. Documents on the web can come alive, be queried, adapt to individual user requests, and drill down to provide more precise granular content upon request. In effect, the documents on the web are a new media, providing distinctively different capabilities than the current paper reports. It is into this environment that the current statistical community must thrust itself. Taxpayers are demanding more for their money, and statistical information can no longer be the domain of the highly educated few with the access to expensive computer resources. Better statistical graphics, and easier access to data will increase both the range of use and the influence of the important statistical datasets carefully maintained by the different government agencies, at taxpayer expense.
Specifically, Web Technology makes:
Needs and Public Expectations:
New technologies have growing pains. The very success of the Internet Web is achieved by an explosion of complex innovation which is often inconsistent and difficult to master. The explosion of information makes it difficult to find information quickly. The Statistical community is working to address this through projects like 'Fedstats" and the "WhiteHouse Briefing Room". However, for many users, access to complex data and statistics still requires a time consuming searching effort that may often be unsuccessful. Specifically the public expects:
Due to advances in Internet and database technology, these
user requests are more technically achievable than generally perceived.
Making data and statistics more accessible requires less of an advance
in technology than a change in the way we think about data and it's delivery.
Basic Principals
Proposed Architecture
Data as treated Objects
Data is described as an object with associated relationships,
rules and display methods stored in a relational database. This provides
easy maintenance and allows us to gracefully extend the attributes, rules
and methods of data as needs develop and information is available.
Microdata attributes include information on:
Macrodata Tables attributes include information on:
Macrodata Timeseries attributes include information on:
A Distributed, Replicated Data Object Repository
Defined as:
Data Object Repositories should replicate public data
definitions to all other data repositories in the system. Confidential
or private data should not be replicated through the system . Organizations
that need to access private data on their internal networks should access
the local private repository that contains both local private and public
data definitions.
Each organization should host the data object definitions of their own private/confidential data and potentially the definitions of the public data from other organizations. The public data definitions should be replicated across servers from different organizations. This will create a system that distributes users across a distributed robust redundant system, insuring reasonable performance in the same way that the Internet Domain Name Service does.
Encapsulated Distributed Data Engines
The physical format, storage methods, large scale aggregation
methods, and large scale transformation methods are encapsulated in the
data engine object. The data-engine object knows how the data set is physically
stored and knows how to manipulate the data to return a data subset or
aggregation in the format requested by the user. This allows multiple data
storage technologies to be used by different organizations that are geographically
distributed. Systems must be maintainable. The most expensive part of any
system is the cost of labor needed to support it.
This architecture encourages the organization that sponsors the data to use the type of database technology that they understand best. It allows the organization to use the scale of database that is most cost effective. At the Census Bureau we are using SAS, SYBASE, and Oracle on UNIX computers. There is a great need for an inexpensive data wharehousing technology to allow other data sets to be mounted by smaller organizations with fewer resources. Data is separated organizationally from the various data request front ends. Data can be reorganized physically to improve performance without effecting other parts of the distributed system. New data warehousing technology can be substituted for older database technologies as appropriate without side effects.
Public Intergovernmental Registration Process
A major strength of this approach is that evolutionary
improvement and multi-organization participation is designed into the system.
Different organizations have different technical strengths. For example,
organizations that maintain geographic boundary information and geographic
display tools might share their display methods to the overall system.
In order to have multiple tools working together effectively, protocols
must be established that insure that different modules plug together seamlessly.
Each module should include a digital signature of the organization that
provides and supports it, as well as links to appropriate help and mail
to the supporting organization.
Summary
The Web has inspired greater public expectations and creates
new possibilities for intelligent information access. The statistical agencies,
in particular, have an opportunity to revolutionize the way the public
accesses and understands data. Documents on the web can be active, data
can be expressed graphically, the data can explain it's relationship to
other data items, explain how it was created and how statistically reliable
of the data The distributed nature of information on the web is both a
strength and a challenge. Finding the appropriate information quickly continues
to be a difficult task. The construction of a distributed data object repository
with distributed data engines can begin to help users find appropriate
data items more quickly and provide for a more powerful data expression
over the web. Many of the ideas expressed here are actively being explored
in the data access and dissemination system prototyping process started
in 1995 by the Bureau of the Census and the Bureau of Labor Statistics.