Class Projects
Home

 

[Home]

Class Projects for IT 864/ CSI 710 Fall 2003

Class Projects - Fall 2002 Link

Class Projects - Fall 2001 Link

Class Projects Fall 2000 Link.

Group 1

Title: Knowledge SIFTER (Smart Information Filtering Technology for Electronic Resources): Knowledge Acquisition and Integration from multiple heterogeneous sources.

Name E-mail

Mizan Chowdhury

mchowdh1@gmu.edu
Scott Mitchell smitch@erols.com
Hanjo Jeong hjeong@gmu.edu
Stephen Smith ssmith7@gmu.edu
Alberto Damiano adamiano@gmu.edu
Jingwei Si jsi@gmu.edu

Abstract:

We will extend WebSifter to handle user spatio-temporal queries to generate queries to existing public geo-located data and return a prioritized list. This will require the conversion of a non-technical user query into spatial and/or temporal elements. So the appropriate queries can be made to the databases. This conversion will be based on the existing WebSifter concept of the Ontology Agent. Extra agents will be developed to derive location and time information from the user query. We propose two such agents that will convert location (ex. Washington) and event (ex. 9/11) names to location(s) and time(s). Clearly these agents can be rather extensive; we plan on either accessing a existing database or creating a minimal test database of our own. As proof of concept, that information will then be used to formulate queries for each public database. Further development must be done to access public data sites (ex. USGS and NASA Goddard DAAC) and retrieve the relevant data.

Ten Questions:

  1. What kind of information does each ontology agent supply to the query?
  2. How are the USGS, Goddard DAAC, and NOAA databases queried?
  3. How do we generate location information from a user query?
  4. How do we generate temporal information from a user query?
  5. How should the user format the query?
  6. How do we gain extra value from our query?
  7. How do we rank our data?
  8. How do we handle user queries outside of our source database scope?
  9. How do we assure ourselves of the accuracy of our response?
  10. How can user feedback be incorporated?


Group 2

Title: Development of a Versatile SNP Database Management Tool by Bridging Pre-existing LIMS with Novel Polymorphic DNA Analysis Tools and Allowing Dynamic User-centric Design and Query

Name E-mail

Andrew Carr

dcarr2@gmu.edu
Ari Kahn akahn1@gmu.edu
Billy Wang bwanghome@yahoo.com

Abstract:

One of the most common avenues for studying genetic variations within complex diseases is single nucleotide polymorphism (SNP) analysis. SNPs are small changes that occur within a DNA sequence. A large volume of this genetic variation information has accrued. For example, the dbSNP has over 9577627 SNPs in its database. Management of this information from data collection through analysis is crucial.

The production of such data sets is increasingly common and it is recognized that the data can support a variety of analyses. However, there is a lack of tools to manage and analyze SNP data. This becomes especially problematic for smaller organizations having little or no in-house bioinformatics software support. Currently most scientists use a disconnected collection of software scripts, and much of the process is manual and error-prone, often involving many reformatting steps. As a result the process is slow and cumbersome. In addition, results are kept in simplistic databases (i.e. flat files or schema databases.) Analysis may be laborious requiring manual entry of data into a spreadsheet format; dynamic analysis of the data at intermediate stages is not an option in these instances. Final results are often made public at data warehouses such as dbSNP, but often require an additional format change before publication. Further compounding the issue is the nonstandard platform compatibility of analysis tools. What is needed is a streamlined process of acquiring, analyzing and curating SNP project data so that it can be efficiently shared and mined.

One approach to solving this problem is to develop a database management system that is flexible enough to allow different SNP projects with variable attribute sets that can allow user-centric questions to be asked about the polymorphic DNA and its metadata. This management system, through a system of dynamic argument lists and monitored constraints, can be developed to interface with pre-existing SNP LIMS and polymorphic DNA analysis tools.

Ten Questions:

1) How feasible and useful will this approach to SNP database management be to the real world database projects?

2) How/ What SQL tool will work best to create a dynamic view? i.e. Should we use SQL, database programming functions, or an outside program to manage queries?

3) What are the best methods to join and make cross study comparisons? (i.e. between projects)

  • Should an attribute ontology universal to all studies/projects be defined?
  • How does the query adapt to newly defined project specific attributes?
  • What is the best way to reconcile similar attributes with different names across studies/projects?

4) What methods should be used to cleanse data to HIPPA standards?

  • Should cleansing be done by allowing of only HIPPA attributes or should security be installed to allow only those with permission to view sensitive data?

5) What is the optimal way for storing and accessing row modeled data?

  • Is it faster to have one long list of attributes containing all projects or should each project be stored as an individual file?
  • Eventually data should be pulled from the warehouses. Should this information be stored locally or queried when needed?

6) What is the best way to deal with the enforcement of constraints and triggers?

7) What is the best method for efficiently managing the SNP frequency data?

  • What happens to the data and the frequency tables necessary for analysis when a rollback occurs?
  • Should we use SQL, database functions, or an outside program to manage the queries?

8) Given a set of polymorphisms (i.e. a haplotype), what is the probability that this matches an existing set of attributes?

9) What is the best manner, in regards to attribute defined haplotypes, for probability information to be stored and updated?

10) How should user preferences be monitored and tracked?

  • What is the best way to keep track of session information?
  • How should analysis on specific data me married to the data as to alleviate the need for repetition of a studies?

 


Group 3

Title:  Data Management System for the Generation and Validation of an Electronic Common Technical Document (eCTD)

Name E-mail
Chris Holland chris-holland@comcast.net
Sanjeev Raman sraman@gmu.edu
Andrey Makeev amakeev@ssd5.nrl.navy.mil

Abstract: 

Recent guidelines among global health authorities will be revolutionizing the means by which the pharmaceutical and biotechnology industries will be applying for licenses to market new drugs. Guidelines for the electronic Common Technical Document (eCTD) allow for the elimination of paper submissions and therefore greatly facilitate the entire process for creation and review of these applications. The creation process, however, will still be a difficult one. Although these guidelines specify a rigid organizational structure for these submissions, the content will vary from one drug application to the next. The numerous files that will represent scientific documents, study reports, and databases will have to be specified with each submission and then organized properly. The availability of a tool to 1) enter and store the required information needed for each required file, 2) automatically organize the files, and 3) generate the required XML “backbone” file(s) (which are used to electronically navigate through the comprehensive document) would greatly facilitate the submission process and replace the tedious, labor intensive process of moving files, renaming files, and entering information manually. This purpose of this project would be to develop such a tool, or set of tools, for achieving this task.

Ten Questions:

  1. What specific information would be needed in an eCTD document database?
  2. What database architecture would best suit an eCTD document database?
  3. Can a flexible data-entry tool be developed so that modifications to the eCTD format can be easily reflected in the data-entry system?
  4. Using a database of the necessary files and file attributes needed for each specific eCTD, can a tool be developed that would automatically create the appropriate INDEX.XML file used as the “backbone” of the eCTD?
  5. Using the same database, can a tool also be developed for automatically creating the appropriate “study tagging” XML files necessary for each clinical (or non-clinical) study report?
  6. Can a tool be developed to handle the task of moving and renaming the files necessary for each specific eCTD from the various locations on a network into the appropriate file structure under the eCTD instance?
  7. Can a tool be developed to handle the creation of MD5 checksum values for each file referenced in the eCTD backbone file (index.xml)?
  8. Can all of the above-mentioned tools be incorporated into one application?
  9. How will changes to health authority and International Conference of Harmonization (ICH) regulations affect each of the eCTD creation tools?
  10. How viable would such an application be to the pharmaceutical and biotechnology industry?

Group 4

Title: Molecular Cluster Database

Name E-mail
Xiao Dong xdong@scs.gmu.edu
Pamela Williams williams_pamela@bah.com
Chunling Zhang czhang1@gmu.edu
Vitoreia Addei vaddei@gmu.edu

Abstract:

We propose developing a database of molecular cluster structures based on research conducted by the GMU Laboratory for the Computer Design of Materials, and integrating this database with the Cambridge Structural Database (CSD) and the American Mineralogist Crystal Structure Database (AMCSD). We plan to incorporate the interactive software for the AMCSD that allows the molecular cluster structures to be viewed and manipulated and computes physical properties for clusters such as geometry and electron densities. Further, we plan to develop software to compute other cluster properties, and conduct statistical analysis of properties for different size clusters.

Ten Questions:

  1. How should attribute data with different physical units be reconciled?

  2. How should cluster data be properly attributed in the database?

  3. The CCD URL ( http://www-wales.ch.cam.ac.uk/CCD.html ) contains a series of links to articles and the research group that provided the data for each cluster data set.

  4. What physical properties can be computed for each cluster data set?

  5. Should these computed properties be stored in the database at the request of the user?

  6. Which cluster properties as a function of cluster size should be analyzed statistically?

  7. How can the cluster structures be viewed and manipulated by the user?

  8. Should the cluster property computation and statistical analysis be part of the interactive user interface? How should the results be displayed to the user?

 


Group 5

Title: Integrated online system for forest fire query, detection and forecasting

Name E-mail
Vatsula Bisht vbisht@gmu.edu
Xianjun Hao xhao1@gmu.edu
Mona Al Razgan malrazga@gmu.edu
Gaurav Jain gjain@gmu.edu
Wen Zhang wzhang@gmu.edu

Abstract: 

Forest fire is a major natural disaster in the United States. Each year, forest fires cause lots of property damage. It is very important to develop an online information system to provide forest fire occurrence information and forecast forest fires. Satellites can observe the earth with large spatial coverage and long temporal coverage, so satellite remote sensing provides very efficient way for forest fire detection and forest fire forecasting.

The main objective of this project is to develop an integrated platform for forest fire occurrence information management, forest fire detection and forest fire forecasting, the system can be used for forest fire occurrence information query, forest fire danger alert, forest fire forecasting algorithm development and validation.

Currently, there are some systems for fire occurrence information management, but most of them don't provide convenient user interface for query. As for forest fire forecasting, there are some algorithms based on satellite remote sensing, but few of them provide convenient user interface. And current systems usually use different data format, different spatial and temporal resolution, it is very difficult to integrate information from different systems.

In this project, we will

  1. Create a database for related satellite remote sensing datasets and fire occurrence information, provide web interface for query.
  2. Implement some fire detection and forecasting algorithms based on remote sensing, provide web interface for fire danger query.
  3. Integrate fire occurrence information and fire detection/forecasting for algorithm validation, provide a very efficient platform for algorithm research.

Ten Questions:

  1. Why create a database rather than use remote sensing datasets directly?
  2. Why it is important to create a scientific database for fire occurrence and remote sensing datasets?
  3. How to integrate scientific algorithms into the system?
  4. What are the difficulties to integrate datasets from different sources?
  5. How to integrate datasets from different sources?
  6. What are current approaches for forest fire detection?
  7. What are current approaches for forest fire forecasting?
  8. With an integrated scientific database, is it possible to develop new approaches for fire detection and fire forecasting?
  9. For forest fire detection and forecasting, real-time system is very important, why ?
  10. How to develop a real time system for fire alert based on this project?



Group 6

Title: Inferring Knowledge to Assist Law Enforcement Officers by Applying Data Mining Techniques to NIBRS Data.

Name E-mail

Doug Seeman (project lead)

william.seeman@ngc.com

Jarek Pietrzykowski jarek@scs.gmu.edu

Janusz Wojtusiak (webmaster)

jwojt@scs.gmu.edu
Jeff Ruby jeffruby@verizon.net
Chien-Chih Lin clin3@gmu.edu

Abstract:

In the field of law enforcement, the ability to obtain knowledge relative to current criminal activity is crucial to the effectiveness of any law enforcement agency. This knowledge exists in a variety of forms and attempts to answer a wide range of questions. The following is a sample of the various types of knowledge that may be of interest to law enforcement officers and various examples of each.

  • Clustering various criminal activities - new gangs can be realized by grouping together areas of criminal activity that are typically gang related, namely drug activity and homicide
  • Association of one criminal activity with another - the increase of drug activity in a particular area correlates to an increase in robbery in the same area
  • Classification and prediction of criminal activities - classification of murder victims can shed insight into the discovery of trends and thus allow effective diversion of resources


Data mining techniques lend themselves naturally to providing this sort of knowledge and answering these types of questions. The goal of our project is to apply various data mining and knowledge discovery techniques to data that is widely accepted and understood within the law enforcement community - the National Incident Based Reporting System (NIBRS). This data is an aggregation of criminal incident reports, in standard format, reported from across the United States. It contains a variety of symbolic attributes relating to the types of criminal activity, the demographic features of the victims and offenders, and other relative entities contained within incident reports (property, vehicles, etc.). In addition to this standard symbolic data, the NIBRS data also contains both a temporal component and a spatial component.

Our system utilizes an underlying database to store and manipulate both the original NIBRS data and the inferred knowledge. A variety of data mining techniques are used to infer knowledge from the data and knowledge contained within the database. In addition, a graphical user interface is used to guide the learning process and interface with both the underlying database and the various data mining tools.

Ten Questions:

Statistical:

  1. How do changes in drug offenses affect changes in the 8 most serious offenses?
  2. What is the correlation, if any, between various offenses contained within the 8 most serious offenses?
  3. What are the distributions of various variables for different offenses? A focus is placed on various biographic elements, locations where crimes occur, and times when various crimes occur.
  4. What is the geographic distribution of gang related crime over weeks? Over months? Over years?

    Data Mining:

  5. Using description methods, describe particular offenses in terms of other pertinent variables.
  6. Can valuable clusters be found by applying a clustering technique to various attributes associated with one particular offense? As an example, what clusters can be formed by clustering incidents that contain a murder offense in terms of drug activity, computer activity, gang activity, etc?
  7. Describe the fluctuation of the rates of various offenses in terms of other attributes associated with the offense. As an example, describe the decrease in the robbery rate by looking at drug use, gang-related crimes, and hate crimes.
  8. How can various agencies be clustered based on a number of pertinent variables?

    Design:

  9. How can data be aggregated over time?
  10. How is the rate of change represented within the system? Will a percentage be used? Or will raw numbers be presented?
  11. What hierarchies are necessary, and how should they be structured for various types of data? As an example, time can be represented by hour, time of day (morning, afternoon, evening, night), day of week, month of year, phase of the moon, etc.
  12. How is geographical data incorporated into the data model? The available geographic data is based on jurisdictional boundaries of the reporting agency, which may or may not overlap.

Group 7

Title: A System for Collection, Storage, and Analysis of Multi-platform Computer System Data

Name E-mail
Paulo Costa pcosta@pobox.com
Billy Liao bcliao@yahoo.com
Vijay Malgari vmalgari@gmu.edu
Jim Jones jhjones@cox.net

Abstract:

Consider a computer system or systems which require analysis for usage patterns, application information, interaction with other systems, possible compromise, etc.  The current state of the art is that a knowledgeable analyst uses native tools and commands, external tools, and their experience and knowledge to collect the necessary information and conduct the analysis.  This methodology has several limitations, including:

(a) the analyst must have specific knowledge and experience to run and interpret tool and command output on the particular platform in question,

(b) the lack of a standard for collected data means that analysis across multiple heterogeneous (and even homogeneous) systems is difficult. 

We propose to build a system which will address these two limitations.  Specifically, the proposed project will:

  1. define the information to be gathered from a system which will support the intended analysis,
  2. identify tools and techniques for gathering such information,
  3. define a database to store this information which will enable analysis across multiple heterogeneous platforms,
  4. populate the database with data from several heterogeneous sample systems, and
  5. implement a query capability and one or more advanced analytical and/or data mining tools to operate on the database. 

The resulting system will enable the collection of system information independent of the analysis, allowing for the conduct of analysis across multiple systems without requiring knowledge or expertise of the platforms in question or the tools used to gather the data.  The definition of the database may also facilitate information sharing and the future establishment of a standard for representing system information.

Ten Queries:

  1. What email addresses does user A communicate with?
  2.  What web sites does user A visit?
  3.  What are usage patterns for common applications on System X?
  4.  What applications are unused on System X?
  5.  What applications are installed on System X?
  6.  Are there indications of compromise on system X?
  7.  Are there any files in common on systems X and Y?
  8.  Is there any indication that system X communicates with System Y?
  9.  When was the operating system installed on System X?
  10. When was the last user A activity?
  11.  Are there any indications that user A and user B are the same person (same or different systems)?
  12.  What is the distribution of file sizes?
  13.  What files have never been accessed?
  14.  Is file C content the same as file D content?
  15.  How can we find patterns and relationships in data?
  16.  How can we find spatial, temporal, and coincidence associations in the data?
  17.  What effect does database structure (data definitions, indexing, etc.) have on subsequent queries and analysis?.

Group 8

Title: Web-based Agricultural Information Visualization and Statistical Analysis System

Name E-mail
Chunguang Yu cyu@gmu.edu
Roongroj Chokngamwong  rchoknga@gmu.edu
Retish Gautam rgautam@gmu.edu

Abstract:

Farm Structure, Income & Performance is one of key topics for research and analysis by Economic Research Service (ERS). This topic measures and explains indicators of economic performance for the U.S. farm sector and major crop and livestock farm types. Analysis and Data provide a perspective on the financial health of the US agricultural economy. This topic roughly includes four subtopics, which are Farm Employment & Wages, Farm Financial Performance, Farm Income Estimates, and Farm Structure/Characteristics. Associated with these four subtopics, three data products, which are Farm and Farm-related Employment, Farm Balance Sheet, and Farm Income data, are provided by ERS. Although these data have the large volume and relatively complicated to the user, the current solution from ERS is still to provide the raw data files to users only. Hence, for the general public, they cannot understand the data well before downloading the data or get any direct analysis results from ERS. For policy-maker (e.g. senator, congressman, or officials), they cannot get the enough information immediately from the data to make decisions either. For the scholars or students, they cannot query the data and have to download larger data files than what they really needs. Therefore, building a scientific database system for these data products and providing an online analysis tool based on this database system to enable different level users to utilize the data are becoming more and more necessary and urgent. Encouraged by these challenges, we decide to take these tasks as our objectives in CSI710 class project.

For online data analysis tool, we want to provide basic statistical analysis (such as mean, median, and sum), distribution analysis, ranking analysis, time series analysis and forecasting, clustering, and 3D visualization. Since all these data products contain geographic information, we will also combine these statistics analysis with geo-location presentation.

We would like to implement three layers architecture to the system: Client side, Server side and Database side. Technically, we will use Java applet for the client interface, Java servlet for the server (doing SQL query and scientific computing) and Oracle for database system. Also some issues that we will maybe consider if the time allows are the computing performance and security issue if we really hope to open our system to the public.

Ten Questions

  1. What is the most suitable architecture for agricultural information visualization and analysis system?
  2. How could we arrange the raw data into Oracle database in order to retrieve these data efficiently?
  3. Considering the frequent production of agricultural data, how could we improve the scalability of the system?
  4. How do we organize the metadata to integrate agricultural data products with geographic data?
  5. What kind of data query does our system support and how to query them in SQL efficiently?
  6. How can we combine the agricultural data analysis with geo-location information?
  7. How does our system maintain the high performance if many users try to access the data at the same time?
  8. How do we design an understandable interface to enable the users to do query and analysis easily and quickly?
  9. What is the significance of our web-based statistical analysis system to the decision-maker?
  10. What kinds of analysis do our objective users (e.g. decision-maker) really want and are more valuable and helpful to them?

Group 9

Title: Knowledge Query System for Astronomical Databases

Name E-mail
Davide Donato davide@physics.gmu.edu
Georgios Britzolakis gbritzol@gmu.edu

Abstract:

We are going to develop a database system that compiles scientific descriptive meta-data from three distributed major Astrophysics Databases. The ADS (NASA Astrophysics Data System), NED (NASA/IPAC Extragalactic Database) and HST (Hubble Space Telescope) are sources of information used extensively by the astronomical community. They store different sorts of information (ADS - literature, NED - scientific data, HST - observation data) in different formats through different interfaces.

Our goal is to create a so-called "meta-data knowledge warehouse" capturing information from multiple sources and providing a single unified warehouse and a single interface. This warehouse will allow scientists interested in various kinds of information related to study of specific astronomical objects (in our case Faranoff-Riley radio galaxies), conditions of observations of these objects performed by Hubble space telescope, bibliographic references (and all other sorts of data astronomers might be interested in), without having to sort through the enormous amounts of astronomical data collected so far.

Ten Questions:

  1. Are there articles related to Faranoff-Riley radio galaxies?
  2. What are the sources related to these articles?
  3. What is the basic data information on these sources?
  4. Which of these sources have been observed with HST and what are the main characteristics of these observations?
  5. Are there any X-ray observations for the sources observed at optical
    wavelengths?
  6. Are there any authors that published something about particular
    X-ray observations?
  7. Did the authors of the HST proposals publish the results of their
    optical observations?
  8. Does NED list other Faranoff-Riley radio galaxies that have not
    been observed with HST?
  9. What is the exposure time to take an HST image of Faranoff-Riley
    radio galaxies?
  10. Has the publication date of Faranoff-Riley radio galaxy papers
    changed since the HST ACS (Advanced Camera for Survey, the latest camera applied to HST optics) was put into operation?
     

Group Solutions