Data Challenge 2018
February 24 – March 3, 2018
Access to datasets will be provided on February 24, 2018 (Challenge Kickoff)
3000 Rice Genome
The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries. Rice is the leading food source across the globe, and is a vital crop to study to address food security and other global issues. Through analysis of these genomes, researchers can potentially identify genes for important agronomic traits such as better nutrition, climate change tolerance, and disease resistance. The collaborating organizations are comprised of the Chinese Academy of Agricultural Sciences, BGI Shenzhen, and the International Rice Research Institute (IRRI).
The Common Crawl corpus includes web crawl data collected over 8 years. Common Crawl offers the largest, most comprehensive, open repository of web crawl data on the cloud. The corpus contains raw web page data, extracted metadata and text extractions. Common Crawl releases new web crawl data on a monthly basis. Machine-scale analysis of Common Crawl data provides insight into politics, art, economics, health, popular culture and almost every other aspects of life. Common Crawl data is used around the world, by people and organizations in many fields of interest, including academics, researchers, scientists, businesses, governments, technologists, startups, and hobbyists.
World Events Database
This world events database records hundreds of categories of physical activities around the world, from riots and protests to peace appeals and diplomatic exchanges, georeferenced to the city or mountaintop, across the entire planet. The dataset contains over a quarter-billion records organized into a set of tab-delimited files by date. This dataset was created by a well-known project that monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.
Risk-Screening Environmental Indicators
This dataset provides detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model which is based on their inventory that tracks toxic chemical releases and waste management activities at industrial and federal facilities across the United States and territories. The dataset would include the raw data in CSV format, complete documentation about the RSEI model as well as the limitations that it may pose, which should be taken into consideration before basing ananalys is on this dataset. The data can be used to examine trends in air pollution from industrial facilities over time and across geographies. Participants can examine relationships between RSEI impacts and population demographics for environmental justice analyses.
St. Louis Voyage
St. Louis, a German transatlantic liner, sailed from Germany on May 13, 1939. This ship contained 937 passengers who were fleeing from the Third Reich. Majority of the people on the voyage had applied for US visa and were planning to stay in Cuba until they could enter the United States. Unfortunately, the passengers were denied entry and were sent back to Europe. The passengers were dropped off at different locations in Europe to avoid Nazi Germany. This dataset contains information of the voyage, the locations, and the 937 passengers. The data was collected from a lot of sources such as old records, information from relatives of the refugees and various archives. The dataset was created to help analyze the people, places, and events to build a narrative for the St. Louis Voyage.
The 1929 stock market crash devastated America’s economy and triggered the beginning of a 10-year economic depression. During this time, American families were at risk of losing homes to foreclosure. To tackle the mortgage crisis and restart the Great Depression economy, President Franklin Delano Roosevelt created federal loan programs to refinance troubled residential homes. The US government established the Home Owners’ Loan Corporation (HOLC) to determine potential refinance investments by assessing housing and neighborhood conditions. HOLC created maps and area descriptions to describe the features and threats to a particular area; neighborhoods were graded based on the racial/ethnic presence, high and low-income families, and environmental problems. Referring to map shading, grading, and area descriptions, financial institutions made decisions on loan sizes, refinancing opportunities, etc. Unbeknownst to HOLC and the federal government, the 1939 surveys would have major effects on American cities, especially during Urban Renewal in the 1950’s. In short, HOLC orchestrated the denial of financial services based on race and ethnic background, better known as redlining.Using techniques like data analytics, database design and GIS referencing we would like to analyze this data set to unearth new patterns and engage any audience interested in learning more about modern and contemporary history.
The Next Generation Weather Radar (NEXRAD) is a network of 160 high-resolution Doppler radar sites that detects precipitation and atmospheric movement and disseminates data in approximately 5 minute intervals from each site. NEXRAD enables severe storm prediction and is used by researchers and commercial enterprises to study and address the impact of weather across multiple sectors. It consists of the real-time feed and full historical archive of original resolution (Level II) NEXRAD data from June 1991 to present.
The Landsat program is a joint effort of the U.S. Geological Survey and NASA. First launched in 1972, the Landsat series of satellites has produced the longest, continuous record of Earth’s land surface as seen from space.NASA is in charge of developing remote-sensing instruments and spacecraft, launching the satellites, and validating their performance. USGS develops the associated ground systems, then takes ownership and operates the satellites, as well as managing data reception, archiving, and distribution. Carefully calibrated Landsat imagery provides the U.S. and the world with along-term, consistent inventory of vitally important global resources.
USAID – Education Database
USAID is the lead U.S. Government agency that works to end extreme global poverty and enable resilient, democratic societies to realize their potential. In an interconnected world, instability anywhere around the world can impact us here at home. Working side-by-side with the military in active conflicts, USAID plays a critical role in our nation’s effort to stabilize countries and build responsive local governance; they work on the same problems as our military using a different set of tools. Resolving the global learning crisis–ensuring all children and youth are in school and learning– requires political will at the highest levels and strong collaboration in the countries where we work. USAID partners with other U.S. government agencies, donors, country governments, multilateral agencies, civil society, and the private sector to ensure equitable access to inclusive, quality education for all – especially the most marginalized and vulnerable. The goal is to reach 100 million children in the countries outlined in the USAID 2011-1015 Education Strategy.
Montgomery County’s Geo Mapping Tool Project
The dataset is compilation of data collected on resources available for transitional age youth (TAY) who are at-risk of disconnecting or are disconnected from their communities in Montgomery County. The data consists of information on organizational sites of non-profit organizations and government entities who provide services and facilitate programs for TAY. An organization may have multiple sites, physical locations, which are captured in this data set. A cross-sectional data set was compiled for this effort and includes variables such as location information(street, city,state, zip code), age groups served, contact information, services tags, program names, and other variables that identify requirements and accessibility of the site. A requirement, for example, could be to have literacy or no criminal background.
Montgomery County Ballfield Analysis Data
One of the planning functions at Montgomery County Parks is to recommend what, where, and when new facilities for sports and recreation should be constructed. Specific recommendations are based on numeric modeling, sport participation trends, and how much usage does the existing inventory have. This is then used to identify new areas for construction and changes needed in existing parks. One of the most important, popular and expensive facility to build is ballfields. This dataset has a variety of variables that can be used to derive trends, usage and planning areas for new and old ballfields in Montgomery County.
Possible areas of analysis in this dataset –
- Where are the sports groups playing in how many Planning Areas?
- Is there any correlations that can be derived looking at this data a different way in order to determine if a low use field of one type (diamond) can be converted to a rectangular field which may be needed?
- Are there specific Planning Areas that have very high utilization percentages? Are all sports groups getting enough time to utilize the field(availability percentages)?
- Can you develop a model to predict field usage?
Montgomery County Preventive Maintenance Work Orders
Montgomery County maintains computerized maintenance management systems (CMMS). These systems are large software programs that generate work orders of each and every work that any particular asset within the County needs. Each work order has a unique ID number assigned to it. The data within the spreadsheet represents one such report and contains information about system-generated maintenance work order numbers. The Work Order field shows the unique work order number. The dataset also contains a description of the work order and its status. Once the work is completed or in progress, the status has to be changed by the crew supervisor. This indicates where the work stands. Some areas of exploration using this data include:
- Number of preventive maintenance work orders issued for each maintenance area (To reassign the field staff as per these numbers).
- Which maintenance area has the highest work load?
- Which trade shop has the highest work load?
- How might staff be reassigned to even out the workload? Is it possible to determine this based on the preventive maintenance work order patterns?
Montgomery County Parks and Buildings (Work Order Report)
Montgomery County maintains computerized maintenance management systems (CMMS). These systems are large software programs that generate work orders of each and every work that any particular asset within the County needs.Each work order has a unique ID number assigned to it.The data within the spreadsheet represents one such report and contains information about work orders for assets such as Parks, Buildings, Playgrounds etc.The Park/Building field shows the unique asset ID assigned. The dataset also contains a description of the asset and the classes that the assets are divided into. Classes can be further sub-divided into categories (each class can have multiple categories). Parks in Montgomery County are divided into Maintenance Areas(MC-CJ : Cabin John , MC-WH: Wheaton Etc.)There are two regions in Montgomery Parks i.e. MC-South and MC-North.Each region covers a set of maintenance areas.Some areas of exploration using this data include
- What type of park buildings are in each maintenance area?
- Where are additional park buildings needed to increase social equity?
LEGACY OF SLAVERY
This program seeks to preserve and promote the vast universe of experiences that have shaped the lives of Maryland’s African American population. Black Marylanders have made significant contributions to both the state and nation in the political, economic, agricultural, legal, and domestic arenas. Despite what often seemed insurmountable odds, Marylanders of Color have adapted, evolved, and prevailed. The dataset contains 4 different types of records – Manumissions, Certificates of Freedom, Runaway Slave Ads and Slave Statistics. They contain vital information about those who were enslaved, as well as runaway advertisements and committal notices for African Americans from local newspapers.
Medicine and Human Health Sciences Database
The Medical Heritage Library (MHL) is a digital curation collaborative among some of the world’s leading medical libraries to promote free and open access to quality historical resources in medicine and the human health sciences. The goal is to provide the means by which readers and scholars across a multitude of disciplines can examine the interrelated nature of medicine and society, both to inform contemporary medicine and strengthen the understanding of the world in which we live. Proposed Data Challenges:
- Make ArchiveSpark with MHL more intuitive by developing a user-friendly interface (or other mechanism) for making ArchiveSpark functionality more broadly accessible. This project seeks to make ArchiveSpark workflows broadly accessible to the public. Products of this project could include creating a number of canned recipes for searching content with ArchiveSpark and considering new approaches to searching the dataset for the purpose of extraction and analysis easier for researchers.
- Connect Index cat to journal articles that have been digitized by the MHL. This challenge involves matching Index Cat entries with full text articles residing in the Medical Heritage Library
- Create an index of archaic medical terminology using medical dictionaries found in the Medical Heritage Library, map those terms to contemporary medical terminology (such as the Unified Medical Language System, and index the Medical Heritage Library corpus to facilitate the discovery of published content from the perspective of contemporary medicine.
The UMCP Department of Transportation Services (DOTS)- Bike Count
The UMCP Department of Transportation Services (DOTS) provides a full range of parking and transportation services, serving a diverse community of more than 37,000 students and 13,000 faculty and staff in the City of College Park. BikeUMD is an initiative by DOTS to encourage a healthy and cost-effective lifestyle. BikeUMD conducts manual bicycle counts at ten points of interest on campus. The data sets provided are from counts in 2015, 2016, and 2017. DOTS is interested in learning trends from this dataset, mainly location and time based. More insights on Gender based usage and maybe even usage of helmets will be helpful too!
National Cancer Institute (NCI) – HINTS
The U.S. National Cancer Institute (NCI) has been conducting the Health Information National Trends Survey (HINTS) since 2003 to learn about U.S. adults’ cancer-related perceptions and knowledge, their health behaviors, and their health-related information access, needs, seeking, and use. This survey is administered every few years to civilian, non-institutionalized adults in the U.S. Some possible investigations that could be conducted using this data set include assessing people’s varying levels of trust in different sources of health information, the extent to which they encounter barriers when searching for health information, their use of social media to share and ask for health information, and their use of technology to track their health and health behaviors. This data has already been prepared for statistical analysis and the data would lend itself nicely to interesting information visualizations, as well.
MORTEN BEYER AND AGNEW – AIRCRAFT DATA
mba REDBOOK is an advanced online aircraft valuation data platform provided by Morten Beyer & Agnew, a leader in aviation intelligence. Within its REDBOOK platform, mba provides access to the Systems Tracking Aircraft Repository fleet module; this module monitors and maintains data surrounding the global fleet of commercial aircraft. Boeing, Airbus, Bombardier, Embraer, Saab, ATR aircraft are all monitored and updated on a daily basis to best inform investors and operators about how the global fleet is growing, and changing. This dataset is pulled directly from the STAR module uniquely for the UMD ischool Data Challenge.You will be provided vital data covering over 41,000 “tails”; data points include operator, serial numbers, transaction history, historical and current status and much more. Lessors, airlines, banks and other financial institutions use this data on a daily basis.