DC19

Data Challenge 2019

Data Integration and the Community

February 23 – March 2, 2019


Sponsors

 exabyte

gigabyte2

gigabyte1

gigabyte3


Winning Teams

Grand Prize

Team DC19028
Nisha Dayananda | Prathima Devanath | Sandeep Raju | Prashant Rathod
Dataset: Real-time Train Prediction
Supplied by: Washington Metropolitan Area Transit Authority

Best Community Integration

Team Dc19017
Luc d’Hauthuille | Yung Tzu Huang | Ruthwik Kuppachi | Ling Shu Kung
Dataset: mBike Bikeshare
Supplied by: City of College Park

Most Innovative Project

Team DC19016
Yushuang Chen | Wenyan Tuo | Can Yang
Dataset: Factba.se Trump Dataset
Supplied by: FactSquared

Highest Quality Project

Team DC19047
Sanaz Aliari | Moein Eshfagh | Longsheng Yin | Yaqian Zhang
Dataset: mBike Bikeshare
Supplied by: City of College Park

Best Expression of Results

Team DC19006
Matthew Chou | Yasmin Ibrahim | Kanika Taneja
Dataset: Police Crime Statistic Analysis
Supplied by: City of New Carrollton

Best Team Presentation

Team DC19051
Olivia Isaacs | K. Sarah Ostrach | Natalie Salive
Dataset: Legacy of Slavery in Maryland
Supplied by: University of Maryland Digital Curation Innovation Center

Outstanding Undergraduate Team

Team DC19005
Erick Herrera | Nathan Kwon | Jonah Lynn Rivera
Dataset:  Signal Detection Exercise
Supplied by: UMD National Consortium for the Study of Terrorism and Responses to Terrorism

People’s Choice Award

Team DC19003
Shruti Hegde | Vyjayanthi Kamath | Himanshi Manglunia
Dataset: Police Crime Statistic Analysis
Supplied by: City of New Carrollton


Datasets

  1. City of Baltimore, Department of General Services
    Facilities Maintenance Performance Metrics

As an organization we made the commitment to data driven decision making as the driving management framework for day to day operations, internal ‘stat’ meetings and as part of our participation in the city’s “Outcome Based Budgeting” paradigm. Improved performance metrics benefit our agency by providing more insightful and impactful quantitative measures for us to assess and improve upon operations and agency workforce; these result in better service delivery to the citizens who rely on city-owned facilities and services. Although our indicators are sufficient to determine some outcomes, we know that through more comprehensive research and analysis we could better understand what is happening in all facets of the facilities maintenance division, and therefore improve services to city agencies and citizens. We want to better understand the data around the facilities maintenance performance metrics so that we can optimize our labor workforce to best suit our city’s needs and show how this can affect work order execution and, in the future, optimal development of our human capital layout, routes, work teams and vendor relationships and responsibilities around parts inventory management.

Additionally, our department recently implemented a gainsharing model for the workforce in the DGS Fleet Division incentivizing and enabling maintenance staff to receive a portion of the savings from improved productivity and cost-saving measures that they initiate or participate in based on metrics negotiated and agreed upon by union and budget officials, technician leads and other relevant stakeholders. Results from our implementation of this pilot program have prompted interest in answering the policy and implementation questions associated with potential expansion of the gainsharing model to Facilities Maintenance Division operations. However, we lack the fundamental statistical standards and benchmarking tools and insights for work orders upon which we might base a gainsharing model, or another cost-saving initiative. Better assessment of performance measures and clear, statical measures from which we can base the division’s work performance on, can set the foundation for implementing a similar employee-empowering, efficiency model in our other divisions. More immediately, they will provide a tool for better day to day management and evaluation of work teams, and processes associated with work types, individuals or specific vendors that will feed continuous process improvement activities.

  1. City of College Park
    City Vehicle Fuel

Data is from 2/4/2016 to 12/19/2018 and contains 8,266 rows. The Fuel Dataset contains fields for Date, Time, Vehicle ID, Fuel Type (Gas/ Diesel), Gallons and the odometer mileage. We’d be interested to see if you could identify any patterns or trends in our fuel consumption data, either by vehicle or for the fleet overall. We have no pre-defined answer requirements and we’re open to any insights that you can guide the students to.

  1. City of College Park
    mBike Bikeshare

Data is from 6/14/2017 to 1/14/2019 and contains 1,851,924 rows. Generally speaking, the bikeshare dataset contains fields for GPS coordinates, Trip ID, Events (i.e. Docking/Undocking/Pausing), User ID, Date and Time. Bikeshare data analysis could help us locate new bikeshare stations based on current usage patterns or figure out what months/days/times are most popular. Data visualization would help with these questions. The students could even go much further to overlay GIS terrain info and make a model that predicts how much rebalancing a new station at the top of a hill might need based on elevation. This could be based on existing trip data for the number of trips that “start at” versus the number “that end at” higher elevations. Or, students could combine our data with historic weather data to see how much weather conditions affect ridership. We’re open to any insights that the students can discover.

  1. City of New Carrollton
    Police Crime Statistic Analysis

Data is from 01/01/2012 to 08/31/2018 and contains seven (7) subsets of data, by year, 2012-2018. We are looking at both the type and frequency of various offenses as well as the location in the City. Key statistics for analysis are: Date, Time, Call type and address.  This analysis will allow the City to better target geographic areas for increased patrol/presence as well as identifying trends in recurring crimes for training, prevention and public awareness initiatives. The analysis could also be used to help with patrol routes and shift work modifications to better address the specific areas, times and types of crimes most prevalent in the City.

Furthermore, the City of New Carrollton implemented a Special Tax District a number of years ago to provide enhanced police services to address increased crime in the City. We use statistical analysis of the crime statistics provided by Prince George’s County Police to track the effectiveness of the district as well as to identify hot spots and other recurring issues.

  1. Digital Curation Innovation Center
    Legacy of Slavery in Maryland

The Legacy of Slavery in Maryland is a major initiative of the Maryland State Archives (http://slavery.msa.maryland.gov/). The program seeks to preserve and promote the vast universe of experiences that have shaped the lives of Maryland’s African American population. Over the last 18 years, some 420,000 individuals have been identified and data has been assembled into 16 major databases. The DCIC has now partnered with the Maryland State Archives to help interpret this data and reveal hidden stories.

Students will get a chance to work with the 12,000+ records from the Runaway Slave Ads collection.  This collection features local newspaper advertisements placed by slaveowners in an attempt to retrieve their escaped slaves. The data links newspaper clippings, and fugitive, owner, and departure and destination data.

The project offers an opportunity to discover new insights into the patterns of fugitive slaves in Maryland, based on location, time period, and relationships.

  1. FactSquared
    Factba.se Trump Dataset

The Factba.se Trump dataset is the canonical source of everything Donald Trump has said publicly since 1976. It includes more than eight million words and spans more than 600 hours of transcribed video, along with interviews, speeches, press conferences, Vlogs that were deleted from the Internet, 15 hours of Howard Stern interviews, Op-eds he wrote and quite a bit more. The dataset also includes each word tokenized to the timestamp in the corresponding video, readability scores, sentiment analysis, entities, text emotion, second-by-second voice stress analysis, rate of speech, and a proprietary former Israeli Defense Force emotion analysis on all multimedia. It also includes his full @realDonaldTrump twitter account, including deleted tweets, and his schedule since assuming office, with geolocation.

The dataset can measure nearly anything about Trump. It is used daily by the Washington press corps to fact check. There’s only so much a single person can say in their lifetime. The transcripts and Twitter are about 60MB of text total. The rest is the supporting data to play with. An example of how to play with it, from The New York Times this month: https://nyti.ms/2E6SuhZ. We want you to think of what you can do with this. Some examples are: how does sentiment change on different topics over 40+ years? How does changes in language and tone affect what is being discussed? Has his language changed over time? What are patterns in questions asked of him? Does he speak differently in interviews with different reporters vs. speeches? Does he talk about men and women differently? Does he talk about countries differently over time? People of color? It’s all in the data.

  1. Maryland Small Business Development Center
    Individual Consulting Services

Maryland Small Business Development Center (SBDC) provides free individual business consulting and group training to small businesses, both existing and pre-venture ones. The dataset lists businesses who received individual consulting services during the last 10 years (2009-2018) and includes: (1) economic impact outcomes achieved by the clients as a result of consulting; (2) consulting and training activity; (3) characteristics of the businesses, and (4) socio-demographics of the owners.
The analysis could help uncover factors that determine positive economic impact of business consulting and may include, for example, the following areas:
• Which (if any) socio-demographic, geographical, and industry (NAICS code) characteristics of pre-venture businesses do predict successful business start? In other words, are clients with certain demographics more successful in starting business them others? Are there industries with higher success rate? Rate industries by the percentage of clients who started business.
• What are the determinants of securing capital investments by the clients? Which of the available factors (if any) affect the amount of investments?
• Are there factors that predict increase in revenue and increase in the number of employees for existing businesses? If yes, what are the most important factors?

  1. National Cancer Institute
    Health Information National Trends Survey

The U.S. National Cancer Institute (NCI) has been conducting the Health Information National Trends Survey (HINTS) since 2003 to learn about U.S. adults’ cancer-related perceptions and knowledge, their health behaviors, and their health-related information access, needs, seeking, and use. This survey is administered every few years to civilian, non-institutionalized adults in the U.S. Some possible investigations that could be conducted using this data set include assessing people’s varying levels of trust in different sources of health information, the extent to which they encounter barriers when searching for health information, their use of social media to share and ask for health information, and their use of technology to track their health and health behaviors. This data has already been prepared for statistical analysis and the data would lend itself nicely to interesting information visualizations, as well.

  1. Triadelphia Veterinary Clinic
    SNAP 4Dx Plus Test

One year worth of 4dx snap test results have been provided. The 4dx snap test is a test run during dog’s annual wellness and non-wellness examinations. This test provides results of negative or positive for heartworm disease and 3 other tick borne disease; anaplasmosis, erlichia, and Lyme disease.
Within this data we have provided the date the test was completed, the breed of the dog, the weight of the dog, the zip code in which the dog resides, and the test results.
Our statements would be as follows:
1) What is the most prevalent zip codes in our area for dogs testing positive for tick borne diseases and the breakdown of which tick borne disease are most prevalent in each area?
2) Does the size of the dog make it more or less prone to developing tick borne disease? If so, which breeds?
3) Is there a breed most likely to develop a particular tick borne disease?
4) Does the length of the dogs coat (according to their breed standard) have any effect on whether a dog is more or less likely to get a tick borne disease?

  1. UMD Civil & Environmental Engineering
    FEMA Public Assistance

This dataset contains all claims data for FEMA’s Public Assistance (PA) Program starting in 1998. The data are more consistently reported after 2003. PA is the largest source of disaster aid following most presidentially-declared disasters, and is available to owners of public and private non-profit infrastructure who experience losses during disasters. This program reimburses applicants for at least 75% of disaster-related expenses. The program has two broad categories – emergency management and rebuilding. The money allocated for rebuilding must be used to rebuild to pre-event conditions. Structural modifications or physical relocation that lessen the risk of damage in future hazards is covered by the applicant. Nearly all types of acute disasters are covered by this program (e.g., floods, earthquakes, chemical spills), though slow onset disasters (e.g., sea-level rise) are not covered. A typical year might have 60 to 100 presidentially-declared disasters. There have been many calls to reform this program to align better with the realities of climate change and effective risk mitigation.

  1. UMD Department of Nutrition and Food Science
    Lightly Processed Plant Foods

Micronutrient levels of lightly processed fruits, vegetables and nuts.
Dataset was derived from USDA National Nutrient Database for Standard Reference. The documentation can be found at https://www.ars.usda.gov/ARSUserFiles/80400525/Data/SR-Legacy/SR-Legacy_Doc.pdf.
These are likely lightly processed ready-to-eat foods. Can you make recommendations increase dietary diversity so that people do not eat similar foods all the time?

  1. UMD Department of Nutrition and Food Science
    Unprocessed and Minimally Processed Plant Foods

Micronutrient levels of unprocessed fruits, vegetables and nuts.
Dataset was derived from USDA National Nutrient Database for Standard Reference. The documentation can be found at https://www.ars.usda.gov/ARSUserFiles/80400525/Data/SR-Legacy/SR-Legacy_Doc.pdf.
Can the micronutrient profiles of unprocessed or minimally processed foods be grouped together based on similarity and used for recommendation of dietary diversification, nutrition practice and patient education? Many of them are plants grow in soil. Are there intrinsic relationships among the micronutrients of these plant foods?

  1. UMD National Consortium for the Study of Terrorism and Responses to Terrorism
    Signal Detection Exercise

The goal of this exercise to determine human’s ability to detect (whether performed with the aid of computer algorithms or manually) potential threat indicators (signals) from a naturally occurring nosy environment (background noise).
The scenario for this signal detection exercise is that a transnational criminal organization is plotting to conduct a criminal operation somewhere in the United States, and several local, state, and Federal law enforcement agencies have collected various intelligence that may be relevant to breaking open what this criminal organization is up to. Due to the size of intelligence collected, and limitations on the types of intelligence that can be collected within the bounds of the law, the agencies need some help in distinguishing between the actors (nodes) and transactions (edges) that are potentially the indicators (signals) of the criminal activity and the nodes and edges that are innocuous.

Data provided are all synthetically generated – in other words, these data are computer generated and are not real – with data characteristics representing a few different types of transactional data.
The data collected (and provided for this exercise) are unattributed, meaning they only contain the metadata of each transaction and no content (i.e. the data would contain the source and destination of a transaction, but not the specific content of that transaction).

  1. Washington DC, Department of For-Hire Vehicles
    Taxicab Trip Records

The data set includes complete taxicab trip records for calendar year 2017. Each record includes information about the driver, vehicle, time and distance traveled, trip origin location, trip destination location, and the fare. (For reference, taxi cab rates in the District of Columbia can be found here.). Traditionally, taxi cab fares have been fixed to ensure consistency and uniformity for consumers, and fares have not been allowed to change with regard to demand or availability. In the last few years, the taxi cab industry in Washington, DC has lost market share as demand for taxi cabs has fallen and Transportation Network Companies have entered the market.

As part of the agency’s commitment to encourage innovation in the for-hire vehicle industry, DFHV seeks to introduce a dynamic pricing system for taxi cabs. DFHV wants to implement three pricing tiers (eg, off-peak-normal-peak, or low-medium-high) based on trip data. The tiers could be based on times or areas that have more or less demand for taxi cab service. As the taxi cab regulator, DFHV sets taxi cab rates and would establish parameters for dynamic pricing, which taxi cabs would follow.

Goals of a dynamic pricing system include:
• Ensuring the economic vitality of the taxi cab industry by ensuring peak fares during periods of most demand and generating new trips through discounting.
• Encouraging new trips in areas or times that currently have marginal taxi cab service.
• Encouraging travel at off peak hours to relieve congestion.
• Questions for a project group that selects this data set include:
• What times and/or areas should DFHV implement peak or high fares? How high should fares be raised?
• What times and/or areas should DFHV implement off-peak or low fares? How much should fares be discounted?
• In your proposed dynamic pricing model, will individual drivers earn more?
• Based on the data, are there other opportunities for a varying pricing structure?
• Are there other variables to consider for a dynamic pricing system?

  1. Washington Metropolitan Area Transit Authority
    Real-time Train Prediction

WMATA provides real-time train prediction data — as well as other datasets — to assist customers in travel planning and navigating the rail system. This data appears on the in-station Passenger Information Display System (PIDS) screens, as well as on our public API where it is consumed by apps, screens, and devices. WMATA has been refining the engine that creates the real-time arrivals data over the past year and is interested in an independent assessment of the quality of the predictions. The real-time arrivals data can be paired with actual train arrival data and rail system alerts data to further identify quality of the predictions and correlate missed predictions with known events on the rail system. Data distribution contains predictions and actual train arrival data. Alerts data can be pulled from the internet: https://www.wmata.com/service/daily-report/list.cfm