A Distributed Data Object Architecture for One Stop Shopping for Government Statistics on the Internet.

                        
Cavan Capps       Bureau of the Census             
William Rankin    Bureau of the Census

To make comments on this paper, go to our HyperNews comment page.

Introduction: The Current Environment and Current Expectations

Note: This document reflects the views of the authors and is not intended to represent the official views or policy of the U.S. Census Bureau.

Comments are appreciated.

The expansion of the Internet is redefining the information explosion. It is rapidly becoming the media of choice for inexpensive, timely access to information of all types. In effective organizations, it is allowing business increased access to customers and clients, and increasing feedback between provider and client on scales previously never experienced. As the web provides direct access to intelligent computers and databases, the sophistication of information and services are increasing. Originally developed as a way to post documents, the web is becoming the provider of a new class of information, referred to as active content. Documents on the web can come alive, be queried, adapt to individual user requests, and drill down to provide more precise granular content upon request. In effect, the documents on the web are a new media, providing distinctively different capabilities than the current paper reports. It is into this environment that the current statistical community must thrust itself. Taxpayers are demanding more for their money, and statistical information can no longer be the domain of the highly educated few with the access to expensive computer resources. Better statistical graphics, and easier access to data will increase both the range of use and the influence of the important statistical datasets carefully maintained by the different government agencies, at taxpayer expense.

Specifically, Web Technology makes:

  1. The Cost of Data Dissemination Lower
  2. Data more Accurate - Information can be rapidly updated
  3. Data that Interacts and can be Manipulated
  4. Information that can be stored once and maintained for different points of view by the original content provider.
  5. Government information more Democratic. The growth of the web is growing faster than the growth of previous information technologies including radios, televisions and telephones.


Needs and Public Expectations:

New technologies have growing pains. The very success of the Internet Web is achieved by an explosion of complex innovation which is often inconsistent and difficult to master. The explosion of information makes it difficult to find information quickly. The Statistical community is working to address this through projects like 'Fedstats" and the "WhiteHouse Briefing Room". However, for many users, access to complex data and statistics still requires a time consuming searching effort that may often be unsuccessful. Specifically the public expects:

  1. Data and statistics should be easy to find. They want to go to one place and easily find the answer to their question.
  2. Users want to navigate to data on the basis of the topic, not on the basis of the organizational structure of the statistical community.
  3. Users want the agency that sponsors the data to maintain it. On the Web accountability is credibility.
  4. Users are looking for consistency. Users are confused by inconsistency in approach. Currently, every data set is explored and accessed differently.
  5. Users want the latest technology and tools that are available at low cost.
  6. Users need help in using data properly.
  7. Users expect more visual displays of data.
  8. Users would like to use data from many sources to build composite pictures. Currently this is difficult.
  9. Users would like to use data from multiple government jurisdictions. For example, users would like to use national and state health, educational, and demographic data together.
  10. Users need more information about the data including how to use it and how to relate it to other data and statistics.

Due to advances in Internet and database technology, these user requests are more technically achievable than generally perceived. Making data and statistics more accessible requires less of an advance in technology than a change in the way we think about data and it's delivery.

Basic Principals

  1. Data should be treated as objects with characteristics, relationships with other data, and associated rules and methods.
  2. Data should be physically distributed. It should be located close to those who sponsor it, because they can effectively maintain the accuracy and immediacy of its content. Coordination must be enforced by technical protocols and effective tools.
  3. Access to data should be negotiated through a distributed, replicated "Data Object Repository".
  4. Common Internet Registration of Data Objects should be available and enforced.
  5. Common Internet Registration of Data Methods should be available and enforced.
  6. Data access, manipulation, display and data engine tools should be object oriented. Their functionality should be encapsulated in modules that are designed to be replaced as new technological options become available. Parts of the system should gracefully evolve without creating serious side-effects throughout other elements of the system.



Proposed Architecture

Data as treated Objects

Data is described as an object with associated relationships, rules and display methods stored in a relational database. This provides easy maintenance and allows us to gracefully extend the attributes, rules and methods of data as needs develop and information is available.

Microdata attributes include information on:

  1. History ( unedited data, recodes, fills, skip patterns, etc.)
  2. Comparability to data collected in previous surveys (for example CPS 1994 to CPS 1992)
  3. Links between other appropriate files (longitudinal files to supplements etc. )
  4. User notes
  5. Security and embargo
  6. Use of weights, variance estimates, reliability controls for aggregations
  7. Appropriate display methods for
  8. Display methods of the underlying data object. Connecting the display methods to the data will provide the ability to query the data from data displays. For example scatter-plots, graphs, barcharts, can be used to directly query data and related data.
  9. Related data may include:
  10. Matching Rules
  11. Aggregation Rules (for interpolated medians, percent change, weighting, etc.)
  12. Links to other User Requested Data Element metadata including:

Macrodata Tables attributes include information on:

  1. Reliability information by cell
  2. Creation information ( variables used, estimation used)
  3. Data set version (does the table reflect any reruns of the microdata )
  4. User notes
  5. Variances - Std Errs (reliability)
  6. Confidentially controls
  7. Links to related Timeseries
  8. Links to related Microdata components
  9. Code to run table with a slightly different universe

Macrodata Timeseries attributes include information on:

  1. Links to appropriate published tables that show data context.
  2. Creation information (variables used, estimation used, assumptions made to bridge survey changes made in the underlying micro data)
  3. Links to related timeseries (seasonal adjusted, unseasonally adjusted, inflation adjusted, nominal series, levels, flows, percentage changes, different geographic levels, different SIC levels, etc.)
  4. User notes
  5. Appropriate display methods (line charts, thematic maps, etc.)
  6. Links to variance information
  7. Breaks in timeseries (due to changes in measurement methods, etc.)
  8. Links to published analyses and press releases.
  9. Links to aggregation algorithms

A Distributed, Replicated Data Object Repository

Defined as:

  1. Containing data object definitions for multiple agencies, Federal & State data.
  2. Supports for confidential data at different agencies, with access control administrated by that agency.
  3. Containing common replicated public view of the data definitions. The public portion of the repository should be replicated or mirrored across organizations. This could follow the model of the current Internet Domain Name Service (DNS) that the Internet currently uses to map server names and HTML pages (www.bls.gov) to IP addresses.
  4. Definitions should be included to support searches for data across data sets and time. Support for matching between appropriate datasets.
  5. Data access tools should query the data object model to specify data requests. The data is now intelligent, it knows about itself and how it can be used. After the user has specified the query using the definitions available in the repository, the query is passed off to the appropriate data-engine for processing. The data-engine is an object that has methods that knows how to complete the physical query and do tabulation and recoding.
  6. The repository should contain pointers to the distributed data engines where the physical copies of the data exist. The repository should not record how the physical data is stored, the repository should remain logical in nature.
  7. The repository should remain lean. As the number of data objects increases the size of the repository will grow quickly. All text should be referred to by pointers in the repository and physically stored outside of the repository.




Data Object Repositories should replicate public data definitions to all other data repositories in the system. Confidential or private data should not be replicated through the system . Organizations that need to access private data on their internal networks should access the local private repository that contains both local private and public data definitions.

Each organization should host the data object definitions of their own private/confidential data and potentially the definitions of the public data from other organizations. The public data definitions should be replicated across servers from different organizations. This will create a system that distributes users across a distributed robust redundant system, insuring reasonable performance in the same way that the Internet Domain Name Service does.


Encapsulated Distributed Data Engines

The physical format, storage methods, large scale aggregation methods, and large scale transformation methods are encapsulated in the data engine object. The data-engine object knows how the data set is physically stored and knows how to manipulate the data to return a data subset or aggregation in the format requested by the user. This allows multiple data storage technologies to be used by different organizations that are geographically distributed. Systems must be maintainable. The most expensive part of any system is the cost of labor needed to support it.

This architecture encourages the organization that sponsors the data to use the type of database technology that they understand best. It allows the organization to use the scale of database that is most cost effective. At the Census Bureau we are using SAS, SYBASE, and Oracle on UNIX computers. There is a great need for an inexpensive data wharehousing technology to allow other data sets to be mounted by smaller organizations with fewer resources. Data is separated organizationally from the various data request front ends. Data can be reorganized physically to improve performance without effecting other parts of the distributed system. New data warehousing technology can be substituted for older database technologies as appropriate without side effects.


Public Intergovernmental Registration Process

A major strength of this approach is that evolutionary improvement and multi-organization participation is designed into the system. Different organizations have different technical strengths. For example, organizations that maintain geographic boundary information and geographic display tools might share their display methods to the overall system. In order to have multiple tools working together effectively, protocols must be established that insure that different modules plug together seamlessly. Each module should include a digital signature of the organization that provides and supports it, as well as links to appropriate help and mail to the supporting organization.


Summary


The Web has inspired greater public expectations and creates new possibilities for intelligent information access. The statistical agencies, in particular, have an opportunity to revolutionize the way the public accesses and understands data. Documents on the web can be active, data can be expressed graphically, the data can explain it's relationship to other data items, explain how it was created and how statistically reliable of the data The distributed nature of information on the web is both a strength and a challenge. Finding the appropriate information quickly continues to be a difficult task. The construction of a distributed data object repository with distributed data engines can begin to help users find appropriate data items more quickly and provide for a more powerful data expression over the web. Many of the ideas expressed here are actively being explored in the data access and dissemination system prototyping process started in 1995 by the Bureau of the Census and the Bureau of Labor Statistics.