To Build or Not To Build
During this past month I was lucky enough to be able to work with Bridges to Prosperity. A Non-profit organization that works to build bridges in rural areas to promote safer and easier access to local necessities. They have a simple mission “Rural isolation is the root cause of poverty; connection is the foundation for opportunity”. With these bridges they are allowing people to go about their lives in ways they would have never dreamed of. Students may safely travel to school, farmers can effectively sell and trade crops, and more people have access to opportunities previously believed to be unsafe.
Getting to the meat of the problem with Stakeholder Abbie
Abbie came to our data science team with 2 major issues. She needed some heavy duty data cleansing, and help with dialing in the efforts of her Senior Engineering team. The data that was in need of cleaning was from 6–7 years back that was collected in an unorganized fashion, leading to a large block of text that is not useful for many tasks in her organization in its current form. Our job was to parse and reorganize the bridge sites that contained this data, to be used in future predictive analysis or visualizations. Our second task was to utilize our “Data Science mojo” aka Machine Learning algorithms to assist her team in reducing the amount of times an engineer has to take an entire week to assess a new possible location, only to discover that it will not support a bridge build.
In order to tackle a problem, you need to understand all the players
10,000 ft view, down to staring it right in the eyes, that’s how I like to break down tasks given to us. Then as we break things down, we consistently reference the overall picture and point of view. We had our assignments from Abbie, although before we could divide and conquer, we needed to understand our objectives a little more. “Measure twice, cut once” my dad would always say. The data cleaning is directly correlated to the web teams success, that means we need to have a cleaned dataset, ripe for the querying, loaded to a database, hosted on our API, with all proper routes set, ASAP. Due to the size of the task we decided to work in two teams. team 1: 2013/2014 data parsing, team 2: General data cleaning on the remaining features
The majority of my time during this project was spent on the data cleansing task. To put it simply, it was very challenging. Each item needed a semi-custom solution to properly extract the feature you were looking for, and load it appropriately into the row.
See below: function 1/many using Regex to navigate the feature properly
After that I was entrusted on merging our newly created features to the cleaned original features from team 2’s work, and producing a dataset that had increased readability and was ready for analysis.
With Great Power Comes Great Responsibility
My next task was to back up the machine learning task. Our second team was able to move through their portion of data cleansing faster than team 1. So we decided for team 2 to begin diving into building a model to aid in predicting the success of a bridge assessment.
To understand the model building process, we must understand the bridge building process. First a local assessor performs some rough estimations for a location of a bridge, he includes basic demographic information and first pass measurements of what the bridge could look like. From there he labels the bridge with a “Flag for rejection” meaning, through his first assessment, will the bridge likely pass a senior engineering review. If he believes it to be a viable location, an engineer makes a plan to spend 1 week at the location taking extra measures and time to start the initial planning stages.
What happens if there is a discrepancy between the first pass assessment and the Engineering review — time and money lost. That was our goal, to help save the money it would take and more importantly, the time it would loose.
Before we even started the task of cleaning the data there was much talk of just organizing the features and target for the machine learning task. Through much conversation with Abbie and a little help from our data science manager we were able to isolate multiple categories to a binary target. After Senior review did the site turn out to be a “Good Site” or “Not Good Site”, aka 1 or 0.
With most machine learning tasks we look to ensure we have plenty of data for simple things such as cross validation metrics like a train/ validation/ test split. To our surprise we had much less data than we were used to. 65 instances in which the result was “Good site” and 24 instances in which the result was “Not Good Site”. My immediate thought was “uh oh”.
-1 in the above figure are the sites that have undergone initial assessment, but have not been assessed by the senior team.
Previously most tasks we have come across involved the idea of “too much data” (not really a thing in data science), but here the issue was: can we reasonably create a model in which the predictions are accurate enough to give Bridges to Prosperity more help than a field expert could?
Not only was the training data microscopic, it was also entirely imbalanced. When it rains it pouts amirite?
Our brainstorming led us to a few possible tools for use. Semi-supervised learning, SMOTE and Near Miss were the three chosen paths for a guiding hope in this problem. Semi-Supervised learning turned out to be a flop (< 25% accuracy), so we will focus on SMOTE
Synthetic Minority Over-sampling Technique
Utilizing SMOTE was the best chance we had at increasing our models efficacy and reducing the risk of overfitting the data. SMOTE works by looking at the locations in which the minority classes resides in vector space, it draws lines between all those points, then adds data points to class along those lines. In a simple world where everything is 2D its relatively easy to imagine. When data points are multi-dimensional, it can be a little harder to visualize.
Sounds pretty simple, build an ideal model, slap SMOTE in there and done. Everyone gets a prediction!… Not so fast, if we are creating data points, we need some way to validate the model is working appropriately. This is where cross-validation techniques comes in.
This is where all 4 teammates heads had to be put together. Everyone took to their notebooks to create their own ideal model and build out their own path to validation. Some visualized, some tested, some relied on numbers. We all came to a similar question. “We need to settle on the least wrong model” Yes I said that right, least wrong.
Once we settled on a model to bring back to Abbie, we made sure to bring the caveats with it. We knew it could be created, but relied upon, no. Making sure to include all conclusions with the model she was able to add our model predictions as data points for her decision making process. As anyone should do, take our predictions as suggestions, more than anything else.
Going into the experience, I was a little nervous in general, maybe call it stage fright. It’s very exciting to work with a real problem. From the start I have always been obsessed with impact, can my work make a difference in whatever I’m doing. I was extremely excited to make a positive mark on this already incredible company and mission.
As Data Scientists we have a responsibility to abide but not only being an expert in the first half of our name, but the second half as well. Science is a process, a process that makes suggestions, not predictions. When we find that the Null Hypothesis can be rejected, our next step is to suggest an alternative. We hope to support the suggestion with certainty and significance, examples and models. But when push comes to shove, being a Data Scientist is much more than making a model and shipping the best result for the stakeholder. It is respecting the process and understanding everything involved outside of what goes on within a notebook.
Wanted to give a huge Thank you to Abbie at Bridges to Prosperity for entrusting me and my team to work on her projects, listen to our results, and answer our questions. I am very thankful for the experience and the hard work of all my teammates, Web and Data Science. It had been a fun and eventful 4 weeks!