|
|
|
| Class Projects for IT 864/ CSI 710 Fall 2003Class Projects - Fall 2002 LinkClass Projects - Fall 2001 LinkClass Projects Fall 2000 Link.Group 1Title: Knowledge SIFTER (Smart Information Filtering Technology for Electronic Resources): Knowledge Acquisition and Integration from multiple heterogeneous sources.
Abstract:We will extend WebSifter to handle user spatio-temporal queries to generate queries to existing public geo-located data and return a prioritized list. This will require the conversion of a non-technical user query into spatial and/or temporal elements. So the appropriate queries can be made to the databases. This conversion will be based on the existing WebSifter concept of the Ontology Agent. Extra agents will be developed to derive location and time information from the user query. We propose two such agents that will convert location (ex. Washington) and event (ex. 9/11) names to location(s) and time(s). Clearly these agents can be rather extensive; we plan on either accessing a existing database or creating a minimal test database of our own. As proof of concept, that information will then be used to formulate queries for each public database. Further development must be done to access public data sites (ex. USGS and NASA Goddard DAAC) and retrieve the relevant data. Ten Questions:
Group 2Title: Development of a Versatile SNP Database Management Tool by Bridging Pre-existing LIMS with Novel Polymorphic DNA Analysis Tools and Allowing Dynamic User-centric Design and Query
Abstract:One of the most common avenues for studying genetic variations within complex diseases is single nucleotide polymorphism (SNP) analysis. SNPs are small changes that occur within a DNA sequence. A large volume of this genetic variation information has accrued. For example, the dbSNP has over 9577627 SNPs in its database. Management of this information from data collection through analysis is crucial. The production of such data sets is increasingly common and it is recognized that the data can support a variety of analyses. However, there is a lack of tools to manage and analyze SNP data. This becomes especially problematic for smaller organizations having little or no in-house bioinformatics software support. Currently most scientists use a disconnected collection of software scripts, and much of the process is manual and error-prone, often involving many reformatting steps. As a result the process is slow and cumbersome. In addition, results are kept in simplistic databases (i.e. flat files or schema databases.) Analysis may be laborious requiring manual entry of data into a spreadsheet format; dynamic analysis of the data at intermediate stages is not an option in these instances. Final results are often made public at data warehouses such as dbSNP, but often require an additional format change before publication. Further compounding the issue is the nonstandard platform compatibility of analysis tools. What is needed is a streamlined process of acquiring, analyzing and curating SNP project data so that it can be efficiently shared and mined. One approach to solving this problem is to develop a database management system that is flexible enough to allow different SNP projects with variable attribute sets that can allow user-centric questions to be asked about the polymorphic DNA and its metadata. This management system, through a system of dynamic argument lists and monitored constraints, can be developed to interface with pre-existing SNP LIMS and polymorphic DNA analysis tools. Ten Questions:1) How feasible and useful will this approach to SNP database management be to the real world database projects? 2) How/ What SQL tool will work best to create a dynamic view? i.e. Should we use SQL, database programming functions, or an outside program to manage queries? 3) What are the best methods to join and make cross study comparisons? (i.e. between projects)
4) What methods should be used to cleanse data to HIPPA standards?
5) What is the optimal way for storing and accessing row modeled data?
6) What is the best way to deal with the enforcement of constraints and triggers? 7) What is the best method for efficiently managing the SNP frequency data?
8) Given a set of polymorphisms (i.e. a haplotype), what is the probability that this matches an existing set of attributes? 9) What is the best manner, in regards to attribute defined haplotypes, for probability information to be stored and updated? 10) How should user preferences be monitored and tracked?
Group 3Title: Data Management System for the Generation and Validation of an Electronic Common Technical Document (eCTD)
Abstract:Recent guidelines among global health authorities will be revolutionizing the means by which the pharmaceutical and biotechnology industries will be applying for licenses to market new drugs. Guidelines for the electronic Common Technical Document (eCTD) allow for the elimination of paper submissions and therefore greatly facilitate the entire process for creation and review of these applications. The creation process, however, will still be a difficult one. Although these guidelines specify a rigid organizational structure for these submissions, the content will vary from one drug application to the next. The numerous files that will represent scientific documents, study reports, and databases will have to be specified with each submission and then organized properly. The availability of a tool to 1) enter and store the required information needed for each required file, 2) automatically organize the files, and 3) generate the required XML “backbone” file(s) (which are used to electronically navigate through the comprehensive document) would greatly facilitate the submission process and replace the tedious, labor intensive process of moving files, renaming files, and entering information manually. This purpose of this project would be to develop such a tool, or set of tools, for achieving this task. Ten Questions:
Group 4Title: Molecular Cluster Database
Abstract:We propose developing a database of molecular cluster structures based on research conducted by the GMU Laboratory for the Computer Design of Materials, and integrating this database with the Cambridge Structural Database (CSD) and the American Mineralogist Crystal Structure Database (AMCSD). We plan to incorporate the interactive software for the AMCSD that allows the molecular cluster structures to be viewed and manipulated and computes physical properties for clusters such as geometry and electron densities. Further, we plan to develop software to compute other cluster properties, and conduct statistical analysis of properties for different size clusters. Ten Questions:
Group 5Title: Integrated online system for forest fire query, detection and forecasting
Abstract:Forest fire is a major natural disaster in the United States. Each year, forest fires cause lots of property damage. It is very important to develop an online information system to provide forest fire occurrence information and forecast forest fires. Satellites can observe the earth with large spatial coverage and long temporal coverage, so satellite remote sensing provides very efficient way for forest fire detection and forest fire forecasting. The main objective of this project is to develop an integrated platform for forest fire occurrence information management, forest fire detection and forest fire forecasting, the system can be used for forest fire occurrence information query, forest fire danger alert, forest fire forecasting algorithm development and validation. Currently, there are some systems for fire occurrence information management, but most of them don't provide convenient user interface for query. As for forest fire forecasting, there are some algorithms based on satellite remote sensing, but few of them provide convenient user interface. And current systems usually use different data format, different spatial and temporal resolution, it is very difficult to integrate information from different systems. In this project, we will
Ten Questions:
|
| Name | |
|
Doug Seeman (project lead) |
|
| Jarek Pietrzykowski | jarek@scs.gmu.edu |
|
Janusz Wojtusiak (webmaster) |
jwojt@scs.gmu.edu |
| Jeff Ruby | jeffruby@verizon.net |
| Chien-Chih Lin | clin3@gmu.edu |
In the field of law enforcement, the ability to obtain knowledge relative to current criminal activity is crucial to the effectiveness of any law enforcement agency. This knowledge exists in a variety of forms and attempts to answer a wide range of questions. The following is a sample of the various types of knowledge that may be of interest to law enforcement officers and various examples of each.
Data mining techniques lend themselves naturally to providing this sort of
knowledge and answering these types of questions. The goal of our project is to
apply various data mining and knowledge discovery techniques to data that is
widely accepted and understood within the law enforcement community - the
National Incident Based Reporting System (NIBRS). This data is an aggregation of
criminal incident reports, in standard format, reported from across the United
States. It contains a variety of symbolic attributes relating to the types of
criminal activity, the demographic features of the victims and offenders, and
other relative entities contained within incident reports (property, vehicles,
etc.). In addition to this standard symbolic data, the NIBRS data also contains
both a temporal component and a spatial component.
Our system utilizes an underlying database to store and manipulate both the original NIBRS data and the inferred knowledge. A variety of data mining techniques are used to infer knowledge from the data and knowledge contained within the database. In addition, a graphical user interface is used to guide the learning process and interface with both the underlying database and the various data mining tools.
Statistical:
Data Mining:
Design:
| Name | |
| Paulo Costa | pcosta@pobox.com |
| Billy Liao | bcliao@yahoo.com |
| Vijay Malgari | vmalgari@gmu.edu |
| Jim Jones | jhjones@cox.net |
Consider a computer system or systems which require analysis for usage patterns, application information, interaction with other systems, possible compromise, etc. The current state of the art is that a knowledgeable analyst uses native tools and commands, external tools, and their experience and knowledge to collect the necessary information and conduct the analysis. This methodology has several limitations, including:
(a) the analyst must have specific knowledge and experience to run and interpret tool and command output on the particular platform in question,
(b) the lack of a standard for collected data means that analysis across multiple heterogeneous (and even homogeneous) systems is difficult.
We propose to build a system which will address these two limitations. Specifically, the proposed project will:
The resulting system will enable the collection of system information independent of the analysis, allowing for the conduct of analysis across multiple systems without requiring knowledge or expertise of the platforms in question or the tools used to gather the data. The definition of the database may also facilitate information sharing and the future establishment of a standard for representing system information.
| Name | |
| Chunguang Yu | cyu@gmu.edu |
| Roongroj Chokngamwong | rchoknga@gmu.edu |
| Retish Gautam | rgautam@gmu.edu |
Farm Structure, Income & Performance is one of key topics for research and analysis by Economic Research Service (ERS). This topic measures and explains indicators of economic performance for the U.S. farm sector and major crop and livestock farm types. Analysis and Data provide a perspective on the financial health of the US agricultural economy. This topic roughly includes four subtopics, which are Farm Employment & Wages, Farm Financial Performance, Farm Income Estimates, and Farm Structure/Characteristics. Associated with these four subtopics, three data products, which are Farm and Farm-related Employment, Farm Balance Sheet, and Farm Income data, are provided by ERS. Although these data have the large volume and relatively complicated to the user, the current solution from ERS is still to provide the raw data files to users only. Hence, for the general public, they cannot understand the data well before downloading the data or get any direct analysis results from ERS. For policy-maker (e.g. senator, congressman, or officials), they cannot get the enough information immediately from the data to make decisions either. For the scholars or students, they cannot query the data and have to download larger data files than what they really needs. Therefore, building a scientific database system for these data products and providing an online analysis tool based on this database system to enable different level users to utilize the data are becoming more and more necessary and urgent. Encouraged by these challenges, we decide to take these tasks as our objectives in CSI710 class project.
For online data analysis tool, we want to provide basic statistical analysis (such as mean, median, and sum), distribution analysis, ranking analysis, time series analysis and forecasting, clustering, and 3D visualization. Since all these data products contain geographic information, we will also combine these statistics analysis with geo-location presentation.
We would like to implement three layers architecture to the system: Client side, Server side and Database side. Technically, we will use Java applet for the client interface, Java servlet for the server (doing SQL query and scientific computing) and Oracle for database system. Also some issues that we will maybe consider if the time allows are the computing performance and security issue if we really hope to open our system to the public.
| Name | |
| Davide Donato | davide@physics.gmu.edu |
| Georgios Britzolakis | gbritzol@gmu.edu |
We are going to develop a database system that compiles scientific
descriptive meta-data from three distributed major Astrophysics Databases. The
ADS (NASA Astrophysics Data System), NED (NASA/IPAC Extragalactic Database) and
HST (Hubble Space Telescope) are sources of information used extensively by the
astronomical community. They store different sorts of information (ADS -
literature, NED - scientific data, HST - observation data) in different formats
through different interfaces.
Our goal is to create a so-called "meta-data knowledge warehouse" capturing
information from multiple sources and providing a single unified warehouse and a
single interface. This warehouse will allow scientists interested in various
kinds of information related to study of specific astronomical objects (in our
case Faranoff-Riley radio galaxies), conditions of observations of these objects
performed by Hubble space telescope, bibliographic references (and all other
sorts of data astronomers might be interested in), without having to sort
through the enormous amounts of astronomical data collected so far.