Data Challenge

 February 24 – March 3, 2018


What is Data Challenge?

The University of Maryland (UMD) College of Information Studies will host Data Challenge: a week-long event providing an opportunity for students to solve a challenging problem by leveraging their skills and ideas.

All UMD students regardless of year or major, are invited to participate. Participants receive a data set from a company and must use their creativity and analytical prowess to solve an information problem based on that data set. In addition to providing financial prizes, the event provides unmatched networking opportunities with the companies involved. Company leaders mentor participants one-on-one and provide industry talks throughout the event.

This event is organized by the Master of Information Management Student Association (MIMSA) at the College of Information Studies. MIMSA believes that unrestricted access to data can have a positive impact on society. This event is intended to empower UMD students through the development of analytic skills while doing social good by solving real world challenges. The University of Maryland College of Information Studies (UMD iSchool) is driven by the pursuit of big ideas and new discoveries that empower people and inspire communities.



Individual Registration

November 1, 2017 – November 30, 2017

Every team member first registers individually for the event through the registration link.

Team Registration

December 1, 2017 –February 20, 2018

If you have formed a team, one of the team members must register through the registration link.

Team Building Event

February 16, 2018 4:00 PM to 6:00 PM

This event is highly recommended for those who do not have a team yet. Meet & network with other participants with similar ideas who are looking to join a team.You will receive an invite to RSVP for this activity.

Challenge Kickoff

February 24, 2018 9:00 AM to 5:00 PM

Teams meet with their mentors at the venue and discuss their ideas and expectations.

Tech Talks

11.00 AM – 12.00 PM: Mary Mouton and Dan Morgan | US Department of Transportation | ‘Open Data Initiatives at the US Department of Transportation’
1.00 PM – 2.00 PM: Stephen Alexander | Amazon Web Services | ‘Build Fast. Build Secure.’
2.00 PM – 3.00 PM: Mary Shelley | SESYNC | ‘Data Science for Interdisciplinary Team Science’


Challenge Finale

March 3, 2018 8:00 AM -12:30 PM

Teams present their work to the judges.



  • Participants may compete as individuals or as teams up to 4 for this challenge.
  • Each participant should register individually with their UMD email ID by November 30, 2017 11:59 PM.
  • If unable to attend after registration, please send an email at to inform us.


  • Team registration begins on December 1, 2017 12:00 AM.
  • All registered students would be sent out an email with the team registration form link.
  • Only one team member needs to fill out the form.
  • No changes in the team would be accepted once the team has registered.
  • If a participant has been registered for multiple teams, the participant will only be included with the first team to register.
  • The team registration would end on February 20, 2018 at 11:59 PM.


  • Ph.D. Students that are Candidates may participate only as a mentor.
  • Ph.D. Students that are Pre-Candidates may choose to be either a mentor or a participant.
  • Teams can be made up of a combination of Undergraduate, Masters and PhD students. However, a team may not have more than 50% Ph.D. students.




Access to datasets will be provided on February 24, 2018 (Challenge Kickoff)

  1. 3000 Rice Genome

    The 3000 Rice Genome Project is an international effort to sequence the genomes of 3,024 rice varieties from 89 countries. Rice is the leading food source across the globe, and is a vital crop to study to address food security and other global issues. Through analysis of these genomes, researchers can potentially identify genes for important agronomic traits such as better nutrition, climate change tolerance, and disease resistance. The collaborating organizations are comprised of the Chinese Academy of Agricultural Sciences, BGI Shenzhen, and the International Rice Research Institute (IRRI).

  2. Common Crawl

    The Common Crawl corpus includes web crawl data collected over 8 years. Common Crawl offers the largest, most comprehensive, open repository of web crawl data on the cloud. The corpus contains raw web page data, extracted metadata and text extractions. Common Crawl releases new web crawl data on a monthly basis. Machine-scale analysis of Common Crawl data provides insight into politics, art, economics, health, popular culture and almost every other aspects of life. Common Crawl data is used around the world, by people and organizations in many fields of interest, including academics, researchers, scientists, businesses, governments, technologists, startups, and hobbyists.

  3. World Events Database

    This world events database records hundreds of categories of physical activities around the world, from riots and protests to peace appeals and diplomatic exchanges, georeferenced to the city or mountaintop, across the entire planet. The dataset contains over a quarter-billion records organized into a set of tab-delimited files by date. This dataset was created by a well-known project that monitors the world’s broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

  4. Risk-Screening Environmental Indicators

    This dataset provides detailed air model results from EPA’s Risk-Screening Environmental Indicators (RSEI) model which is based on their inventory that tracks toxic chemical releases and waste management activities at industrial and federal facilities across the United States and territories. The dataset would include the raw data in CSV format, complete documentation about the RSEI model as well as the limitations that it may pose, which should be taken into consideration before basing ananalys is on this dataset. The data can be used to examine trends in air pollution from industrial facilities over time and across geographies. Participants can examine relationships between RSEI impacts and population demographics for environmental justice analyses.

  5. St. Louis Voyage

    St. Louis, a German transatlantic liner, sailed from Germany on May 13, 1939. This ship contained 937 passengers who were fleeing from the Third Reich. Majority of the people on the voyage had applied for US visa and were planning to stay in Cuba until they could enter the United States. Unfortunately, the passengers were denied entry and were sent back to Europe. The passengers were dropped off at different locations in Europe to avoid Nazi Germany. This dataset contains information of the voyage, the locations, and the 937 passengers. The data was collected from a lot of sources such as old records, information from relatives of the refugees and various archives. The dataset was created to help analyze the people, places, and events to build a narrative for the St. Louis Voyage.

  6. Mapping Inequality

    The 1929 stock market crash devastated America’s economy and triggered the beginning of a 10-year economic depression. During this time, American families were at risk of losing homes to foreclosure. To tackle the mortgage crisis and restart the Great Depression economy, President Franklin Delano Roosevelt created federal loan programs to refinance troubled residential homes. The US government established the Home Owners’ Loan Corporation (HOLC) to determine potential refinance investments by assessing housing and neighborhood conditions. HOLC created maps and area descriptions to describe the features and threats to a particular area; neighborhoods were graded based on the racial/ethnic presence, high and low-income families, and environmental problems. Referring to map shading, grading, and area descriptions, financial institutions made decisions on loan sizes, refinancing opportunities, etc. Unbeknownst to HOLC and the federal government, the 1939 surveys would have major effects on American cities, especially during Urban Renewal in the 1950’s. In short, HOLC orchestrated the denial of financial services based on race and ethnic background, better known as redlining.Using techniques like data analytics, database design and GIS referencing we would like to analyze this data set to unearth new patterns and engage any audience interested in learning more about modern and contemporary history.


    The Next Generation Weather Radar (NEXRAD) is a network of 160 high-resolution Doppler radar sites that detects precipitation and atmospheric movement and disseminates data in approximately 5 minute intervals from each site. NEXRAD enables severe storm prediction and is used by researchers and commercial enterprises to study and address the impact of weather across multiple sectors. It consists of the real-time feed and full historical archive of original resolution (Level II) NEXRAD data from June 1991 to present.

  8. Landsat

    The Landsat program is a joint effort of the U.S. Geological Survey and NASA. First launched in 1972, the Landsat series of satellites has produced the longest, continuous record of Earth’s land surface as seen from space.NASA is in charge of developing remote-sensing instruments and spacecraft,launching the satellites, and validating their performance. USGS develops the associated ground systems, then takes ownership and operates the satellites, as well as managing data reception, archiving, and distribution. Carefully calibrated Landsat imagery provides the U.S. and the world with along-term, consistent inventory of vitally important global resources.

  9. USAID – Education Database

    USAID is the lead U.S. Government agency that works to end extreme global poverty and enable resilient, democratic societies to realize their potential. In an interconnected world, instability anywhere around the world can impact us here at home. Working side-by-side with the military in active conflicts, USAID plays a critical role in our nation’s effort to stabilize countries and build responsive local governance; they work on the same problems as our military using a different set of tools. Resolving the global learning crisis–ensuring all children and youth are in school and learning– requires political will at the highest levels and strong collaboration in the countries where we work. USAID partners with other U.S. government agencies, donors, country governments, multilateral agencies, civil society, and the private sector to ensure equitable access to inclusive,quality education for all – especially the most marginalized and vulnerable. The goal is to reach 100 million children in the countries outlined in the USAID 2011-1015 Education Strategy.

  10. Montgomery County’s Geo Mapping Tool Project

    The dataset is compilation of data collected on resources available for transitional age youth (TAY) who are at-risk of disconnecting or are disconnected from their communities in Montgomery County. The data consists of information on organizational sites of non-profit organizations and government entities who provide services and facilitate programs for TAY. An organization may have multiple sites, physical locations, which are captured in this data set. A cross-sectional data set was compiled for this effort and includes variables such as location information(street, city,state, zip code), age groups served, contact information, services tags, program names, and other variables that identify requirements and accessibility of the site. A requirement, for example, could be to have literacy or no criminal background.

  11. Montgomery County Ballfield Analysis Data

    One of the planning functions at Montgomery County Parks is to recommend what, where, and when new facilities for sports and recreation should be constructed. Specific recommendations are based on numeric modeling, sport participation trends, and how much usage does the existing inventory have. This is then used to identify new areas for construction and changes needed in existing parks. One of the most important, popular and expensive facility to build is ballfields. This dataset has a variety of variables that can be used to derive trends, usage and planning areas for new and old ballfields in Montgomery County.
    Possible areas of analysis in this dataset –

    • Where are the sports groups playing in how many Planning Areas?
    • Is there any correlations that can be derived looking at this data a different way in order to determine if a low use field of one type (diamond) can be converted to a rectangular field which may be needed?
    • Are there specific Planning Areas that have very high utilization percentages? Are all sports groups getting enough time to utilize the field(availability percentages)?
    • Can you develop a model to predict field usage?
  12. Montgomery County Preventive Maintenance Work Orders

    Montgomery County maintains computerized maintenance management systems (CMMS). These systems are large software programs that generate work orders of each and every work that any particular asset within the County needs. Each work order has a unique ID number assigned to it. The data within the spreadsheet represents one such report and contains information about system-generated maintenance work order numbers. The Work Order field shows the unique work order number. The dataset also contains a description of the work order and its status. Once the work is completed or in progress, the status has to be changed by the crew supervisor. This indicates where the work stands. Some areas of exploration using this data include:

    • Number of preventive maintenance work orders issued for each maintenance area (To reassign the field staff as per these numbers).
    • Which maintenance area has the highest work load?
    • Which trade shop has the highest work load?
    • How might staff be reassigned to even out the workload? Is it possible to determine this based on the preventive maintenance work order patterns?
  13. Montgomery County Parks and Buildings (Work Order Report)

    Montgomery County maintains computerized maintenance management systems (CMMS). These systems are large software programs that generate work orders of each and every work that any particular asset within the County needs.Each work order has a unique ID number assigned to it.The data within the spreadsheet represents one such report and contains information about work orders for assets such as Parks, Buildings, Playgrounds etc.The Park/Building field shows the unique asset ID assigned. The dataset also contains a description of the asset and the classes that the assets are divided into. Classes can be further sub-divided into categories (each class can have multiple categories). Parks in Montgomery County are divided into Maintenance Areas(MC-CJ : Cabin John , MC-WH: Wheaton Etc.)There are two regions in Montgomery Parks i.e. MC-South and MC-North.Each region covers a set of maintenance areas.Some areas of exploration using this data include

    • What type of park buildings are in each maintenance area?
    • Where are additional park buildings needed to increase social equity?

    This program seeks to preserve and promote the vast universe of experiences that have shaped the lives of Maryland’s African American population. Black Marylanders have made significant contributions to both the state and nation in the political, economic, agricultural, legal, and domestic arenas. Despite what often seemed insurmountable odds, Marylanders of Color have adapted, evolved, and prevailed. The dataset contains 4 different types of records – Manumissions, Certificates of Freedom, Runaway Slave Ads and Slave Statistics. They contain vital information about those who were enslaved, as well as runaway advertisements and committal notices for African Americans from local newspapers.

  15. Medicine and Human Health Sciences Database

    The Medical Heritage Library (MHL) is a digital curation collaborative among some of the world’s leading medical libraries to promote free and open access to quality historical resources in medicine and the human health sciences. The goal is to provide the means by which readers and scholars across a multitude of disciplines can examine the interrelated nature of medicine and society, both to inform contemporary medicine and strengthen the understanding of the world in which we live. Proposed Data Challenges:

    • Make ArchiveSpark with MHL more intuitive by developing a user-friendly interface (or other mechanism) for making ArchiveSpark functionality more broadly accessible. This project seeks to make ArchiveSpark workflows broadly accessible to the public. Products of this project could include creating a number of canned recipes for searching content with ArchiveSpark and considering new approaches to searching the dataset for the purpose of extraction and analysis easier for researchers.
    • Connect Index cat to journal articles that have been digitized by the MHL. This challenge involves matching Index Cat entries with full text articles residing in the Medical Heritage Library
    • Create an index of archaic medical terminology using medical dictionaries found in the Medical Heritage Library, map those terms to contemporary medical terminology (such as the Unified Medical Language System, and index the Medical Heritage Library corpus to facilitate the discovery of published content from the perspective of contemporary medicine.
  16. The UMCP Department of Transportation Services (DOTS)- Bike Count

    The UMCP Department of Transportation Services (DOTS) provides a full range of parking and transportation services, serving a diverse community of more than 37,000 students and 13,000 faculty and staff in the City of College Park. BikeUMD is an initiative by DOTS to encourage a healthy and cost-effective lifestyle. BikeUMD conducts manual bicycle counts at ten points of interest on campus. The data sets provided are from counts in 2015, 2016, and 2017. DOTS is interested in learning trends from this dataset, mainly location and time based. More insights on Gender based usage and maybe even usage of helmets will be helpful too!

  17. National Cancer Institute (NCI) – HINTS

    The U.S. National Cancer Institute (NCI) has been conducting the Health Information National Trends Survey (HINTS) since 2003 to learn about U.S. adults’ cancer-related perceptions and knowledge, their health behaviors, and their health-related information access, needs, seeking, and use. This survey is administered every few years to civilian, non-institutionalized adults in the U.S. Some possible investigations that could be conducted using this data set include assessing people’s varying levels of trust in different sources of health information, the extent to which they encounter barriers when searching for health information, their use of social media to share and ask for health information, and their use of technology to track their health and health behaviors. This data has already been prepared for statistical analysis and the data would lend itself nicely to interesting information visualizations, as well.


    mba REDBOOK is an advanced online aircraft valuation data platform provided by Morten Beyer & Agnew, a leader in aviation intelligence. Within its REDBOOK platform, mba provides access to the Systems Tracking Aircraft Repository fleet module; this module monitors and maintains data surrounding the global fleet of commercial aircraft. Boeing, Airbus, Bombardier, Embraer, Saab, ATR aircraft are all monitored and updated on a daily basis to best inform investors and operators about how the global fleet is growing, and changing. This dataset is pulled directly from the STAR module uniquely for the UMD ischool Data Challenge.You will be provided vital data covering over 41,000 “tails”; data points include operator, serial numbers, transaction history, historical and current status and much more. Lessors, airlines, banks and other financial institutions use this data on a daily basis.






Organized by:

picture1      picture2



Frequently Asked Questions

Can I attend this challenge?

Yes, if you are an undergraduate, graduate, or Ph.D. student enrolled in the University of Maryland, College Park, we’d love to have you join us!


Do I have to pay?

This event is absolutely free! It includes admission, interaction with awesome mentors, swag, food and beverages.


How do I register?

Click the Register button and tell us about yourself. Registrations open at 12:00 AM (EST) on November 1, 2017 and close at 11:59 PM on November 30, 2017.


Do you provide food?

More that you can imagine. No, seriously, we will keep you well fed and watered.


Can I bring my own team?

Yes, you can! Your team can include 2-4 members.


I don’t have a team. What do I do?

We will have a team building session before the event so you can interact with fellow participants and make your team!


What do I bring to Data Challenge?

Bring a valid student ID, your laptop and charger and bundles of enthusiasm.


I don’t know how to code, Can I still register?

Yes! Students with varied skills and an interest to work with lots of data are encouraged to participate.


Where is the event?

The event is taking place at Riggs Alumni Center – Dorothy D. & Nicholas Orem Alumni Hall this year.
The entire event will be from February 24, 2018 to March 3, 2018. However, you are expected to be present only on February 24, 2018 and March 3, 2018.

Challenge Kickoff:

February 24, 2018

9.00 AM – 5.00 PM


Challenge Finale:

March 3, 2018

8.00 AM – 12.30 PM


I have more questions. Who can I talk to?

Drop us an email at, and we’ll get back to you as soon as we can.






We are looking for you to come up with a great idea and a working prototype within the duration of the Data Challenge event. What is crucial to any data enthusiast? –Data sets! Data Challenge provides these data sets to you to work with and come up with an innovative finding. We encourage you to use 3rd party services, APIs, open source projects, libraries, and frameworks. Let’s face it, we need all the help we can get during a hectic event. There’s no need to break DRY rules when there are so many great resources available to all data enthusiasts.


Participants may compete as individuals or as teams up to 4 for this challenge and no changes in team would be accepted after registration. If the participant has registered with multiple teams, the participant would be considered a part of the first team to register with.


To ensure a level field for all contestants, all code, design, art, music, SFX, and assets must be created during the duration of the Data Challenge. We want to ensure that all participants start off on the same footing and we also want to preserve the true nature of the event. You are, however, free to make plans and brainstorm prior to the event. We take this rule very seriously for the sake of all members attending the event. Failure to comply may result in the offending team’s disqualification.


All teams retain full ownership of what they have created during the Data Challenge. The organizers are here to help innovators, data enthusiasts and entrepreneurs realize their dreams, not destroy them.

Code of Conduct

Be respectful. Harassment and abuse are never tolerated. If you are in a situation that makes you uncomfortable at a Data Challenge event,if the event itself is creating an unsafe or inappropriate environment, or if interacting with a Data Challenge representative or event organizer makes you uncomfortable, please report it using the procedures included in this document.

Harassment includes but is not limited to offensive verbal or written comments related to gender, age, sexual orientation, disability, physical appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, sustained disruption of talks or other events, inappropriate physical contact, and unwelcome sexual attention. If what you’re doing is making someone feel uncomfortable, that counts as harassment and is enough reason to stop doing it.

Participants asked to stop any harassing behavior are expected to comply immediately. Sponsors, judges, mentors, volunteers, organizers, Data Challenge staff and anyone else at the event are also subject to the anti-harassment policy. In particular, attendees should not use sexualized images, activities, or other material both in their hacks and during the event. Event staff and volunteers should not use sexualized clothing/uniforms/costumes, or otherwise create a sexualized environment.

If a participant engages in harassing behavior, Data Challenge may take any action it deems appropriate, including warning the offender or expulsion from the event with no eligibility for reimbursement or refund of any type. Data Challenge representatives can engage University authorities to take appropriate action as needed. If you are being harassed, notice that someone else is being harassed, or have any other concerns, please contact Data Challenge using the reporting procedures defined below.

Reporting Procedures

Data Challenge representatives will be happy to help participants contact campus security or local law enforcement, and engage the right resources to assist those experiencing harassment to feel safe for the duration of the event. We value your attendance. We expect participants to follow these rules at all Data Challenge venues, Data Challenge related social events, and on Data Challenge supplied transportation.

If you feel uncomfortable or think there may be a potential violation of the code of conduct, pleasereport it immediately. All reporters have the right to remain anonymous.

Data Challenge reserves the right to revise, make exceptions to, or otherwise amend these policies in whole or in part. If you have any questions regarding these policies, please contact Data Challenge by email at



Saturday 24th of February 2018 12:00:00 AM