Insight 4: Manipulate the problem-model-data-metric equation
TLDR: In industrial data science the best solution is often reached by manipulating the problem, the metric or the data rather than the model.
In today’s post, we will describe how we believe academic data science and competitive data science differ from data science in the industry.
Industrial, academic and competitive data scientists
Machine learning research is moving forward at a rapid pace. Industrial data scientists have to stay up to date with the newest academic articles if they want to use cutting-edge methods. Because of the close ties to academia, it is a common mistake to copy the methodology of researchers when working on problems in the industry. This leads to suboptimal solutions because the goals of academia are different from those of the industry. Likewise, data scientists who used to participate in competitions such as the ones hosted by Kaggle are biased towards a methodology that is suboptimal for data science in the industry.
The 4 components of a data science solution
In academia and competitions, almost all resources go into improving models. In the industry, however, we are often allowed to manipulate all 4 components of the final solution: The problem, the model, the data and the metric.
• The problem is the central issue that we are trying to solve. Problem descriptions should immediately reveal the type of problem, e.g. classification, regression or outlier detection.
Example: An example of a problem description could be: “Given a sentence from a legal document, determine how likely it is that the sentence contains sensitive information.”
• The model is the algorithm or combination of algorithms that are developed to solve the problem, as well as their settings and hyperparameters. For most projects, the model that achieves the best performance is the right choice. However, properties such as interpretability, prediction time or theoretical bounds on the rates of convergence also come into play when choosing a model.
• The data is the datasets that are used for training and evaluation. Datasets should representative of the problem, contain very little or no noisy entries, and most importantly: They should be large. Usually, datasets will have to be pre-processed so that they are ready to be used by the model.
• The metric is the expression used to evaluate how well models are solving the problem. The metric must be automatically computable, even though this often means that we must sacrifice how well the metric fits the problem. Consider problems for which multiple solutions exist, such as text summarization; automatic evaluation of such problems is far from trivial.
How to manipulate each component of the equation
Because the problem, the data and the metric usually cannot be changed in academic or competitive contexts, industrial data scientists often forget to manipulate these components. In the following section, we will elaborate on how data scientists can manipulate each of the 4 components to their advantage.
Manipulating the problem
In competitions, the problem is always to reach the highest measurement on the test data. This cannot be changed. If the metric does not provide an accurate measurement of the intention of the organizer of the contest, the problem does not change. Competitors are allowed to explore such conflicts of interests.
In a competition where the goal is to determine the price of a taxi fare, competitors might exploit the fact the metric does not penalize some guesses sufficiently.
In academia, researchers can sometimes change the problems slightly to their advantage, but only problems that can be applied to a wide range of applications are of interest.
Academics will attempt to tackle problems such as assigning a Part of Speech category to words in input sentences. In an industry context, problems will usually be domain specific such as extracting significant named entities from certain types of legal documents.
In the industry, however, the initial problem should often be modified or completely changed. Many problems posed by laymen should be skipped altogether. Experienced data scientists avoid problems that are impossible to solve and manage the expectations of stakeholders.
Problems can be modified to improve commercial viability in 3 different ways:
1. Narrowing the problem
The initial problem description and tools available sometimes tackle a problem that is too general. By changing the problem to have a narrower use case, the variance of the input data will be reduced. It follows that the final solution will achieve equal or better performance.
Rather than building a general document classification system, you might want to build a classification system for documents in a particular domain, such as legal documents.
2. Switching the problem
The initial problem might not correlate with the goal of the end-user. By changing the problem to provide maximum value to the end-user, you might be able to achieve a better perception of performance.
You might realize that your classification problem should assume an open world instead of a closed world, or that it should be transformed into multiple binary classifications with confidence.
3. Solving an easier substitute problem
The initial problem might be so difficult that you cannot achieve sufficient performance. Thus, you must settle for solving a less complex problem.
Reducing the number of categories for a classifications problem will almost certainly improve the performance of the classifier.
Manipulating the metric
In competitions, the problem is always to reach the highest score on the test data. Thus, the metric has a total problem fit and should never be changed.
In academia, the metric is usually defined by previous research that you wish to improve upon. Metrics in academia are usually simple and standardized. The metric will often make solutions seem more capable than they actually are. This is because of an interesting feedback loop:
First, a group of academics writes an article that on the surface seems to achieve impressive results. Then, the article gains attention from the industry, and thus by other researchers. Finally, in order to compare results, other researchers must use the same metric. Thus, the metrics that on the surface makes solutions seem impressive are more likely to remain in use.
Consider the metric used to evaluate POS-taggers. The standard metric is word-level accuracy, and state of the art solutions reach an impressive ~95% accuracy for English. However, 95% word-level accuracy means that 1 in 20 words are classified incorrectly. The result is that every other sentence is a text contains a misclassified word. Considering how many words that have unambiguous POS-tags, the state-of-the-art performance does not seem so impressive.
In the industry, it is usually the case that the best metric is not one of the academic standards. Metrics should aim to measure the value provided to end-users. Inventing a scoring method that reflects user-value often requires weighting several domain specific heuristic measurements. In the industry we often need to sacrifice mathematical simplicity and comparability to previous results to create a metric that has as close a problem fit as possible.
Non-standard metrics can be created as a weighted combination of:
- Allowing multi-category assignments for classification tasks.
- Heuristically scoring of different categories, so that some categories penalized differently.
- Evaluating the model based on training error as well as evaluation error.
For example, in life or death scenarios such as cancer treatment, you might want to penalize false negatives much heavily than false positives.
We advise that data scientists put a great deal of effort into developing and refining your metric to reflect the value provided to end-users. Maintaining a representative metric is similar to building good automatic tests for traditional software: It will allow teams to continually refactor solutions, providing confidence that the project constantly progressing, never moving in the wrong direction.
Manipulating the model
In academia and competitions, the model is usually the only component of the equation that is allowed to be manipulated. Thus, all resources are spent on this component. While academics attempt to invent new models that can be applied to multiple domains, competitive and industrial data scientists focus on improving the models from academia by exploiting the specific constraints of narrow use cases.
In academia, simple models are preferred because they are generally applicable.
In competitions, the simplicity of models does not matter. In fact, a simple model implies that the solution is easily achieved and thus uncompetitive. Winning solutions usually stack multiple models in complicated ensembles to squeeze out every last bit of performance. In other words: Nobody cares how the sausage is made, any hack that improves the performance of the model is viable.
In the industry, model simplicity improves maintainability and decreases turn-around time. This is especially important for applications for which the dataset changes over time because maintainers might want to continually adjust the model accordingly.
In the industry, the choice of model can also depend on additional requirements and trade-offs such as:
- Interpretability of results: Interpretability is often very valuable in the development process and is even sometimes a strict requirement.
- Prediction time: In many cases, slow prediction times will result in a poor experience for end-users.
- Training time: In some cases, end-users train models on their own data, making training time a trade-off.
- Zero Training error: In a few cases, it is a requirement that the model performs with zero training error.
Because models are the primary subject of interest for academics and competitors alike, there is a wide range of great standard models available in the public domain. It is therefore often unnecessary for industrial data scientists to spend resources on manipulating this component. Most of the hard work is already provided by academia and open source contributors.
Novice data scientists in the industry tend to fall into the trap of spending excessive resources adjusting models. We believe this mistake is a result of the following factors:
- There is a lot of learning material available for learning how to do this because models are the primary subject of interest for academics and competitors alike.
- Mistakenly using the methodology taught in academia and used in competitions.
- Models are considered the “science” of data science and usually requires little “engineering”. Data scientists with little practical experience will have a hard time dealing with the complexity of engineering.
Manipulating the data
Andrej Karpathy, the Director of AI at Tesla has revealed the following diagram showing estimates of how much time he spent working on the models versus the data when writing his PhD and working at Tesla. This illustrates one of the main differences between working as a data scientist in academia and in the industry.
This diagram illustrates a very important point. The bottleneck in commercial AI is data, not algorithms. The data component is where industrial data scientists should spend the majority of their time. Data Visualization and Exploratory Data Analysis (“EDA”) are probably the most important subfields to master to develop good data science solutions. Even in academic and competitive settings where directly manipulating the data is disallowed, Data Visualization and EDA provide tremendous value.
In academia, data scientists usually work with clean standardized datasets. The datasets are easily interpreted and have a simple structure. Datasets are usually huge, allowing for researchers to experiment with large, complex models.
Data visualization and EDA is often completely overlooked in academia, as fitting solutions to datasets makes them less generally applicable. However, we recommend researchers spend time on EDA regardless, as data insights can often inspire novel ways to solve general problems.
In competitions, features are often secret, undocumented or simply difficult to interpret. This makes data visualization and EDA even more important.
In the industry, the structure of the dataset is often quite complex. The complexity stems from relationships between columns, correlated or untrustworthy data sources, missing data, noise, and outliers.
The data used to evaluate the standard models provided by academia rarely reflects common business use cases. For some problems, training and test data is inherently different, making it difficult to avoid biased models.
For many domain-specific problems, little or no data is available until users start providing the service with data. This is usually referred to as the “chicken and egg problem” or the “cold start problem”.
When little or no data is available, industrial data scientists have to find ways to obtain more. Being able to effectively obtain more data is often what sets experienced industrial data scientists apart of novices. Common methods include:
- Finding more data from open public sources.
- Annotating data internally. Often, the economics of manually labelling data pays off.
- Writing a good annotation description and use paid human labour services such as Mechanical Turk.
- Building graphical user interfaces for data annotation and provide users with incentives to perform the labelling.
The benefits of knowing the difference between different types of data science
It is important for industrial data scientists to be aware of the different goals of academia, competitions and the industry. We believe that internalizing these differences will help data scientists:
- Remember not only to modify the model, but also the problem, the data, and the metric.
- Determine which components to spend the most resources on manipulating.
- Avoid false impressions of the performance of academic results.
- Yield value from exploiting the gap between academic models and domain-specific implementations.
In recent years, data scarcity has led the field into using a variety of techniques that leverages patterns in data sets other than the target data, such as transfer learning and multitask learning. Weak supervision is also a promising technique in fields where data scientist have enough intuition about particular problems to define reasonable accurate label functions.
Image from: https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html
Data scientists should interpret results after manipulating the whole pipeline. This helps choose what components to manipulate in subsequent iterations. More often than not, working on the data component yields the best development time/performance economics. We encourage you to use exploratory methods such as SHAP to interpret the results of the entire pipeline. Investigating feature importance can aid the development process and can sometimes even provide valuable information to end-users.
To sum up
In this fourth insight “Manipulate the problem-model-data-metric equation” we go into detail about what distinguishes pragmatic data science in the industry from academic and competitive data science. In industrial data science, engineers can choose not only what algorithms we use to solve a particular problem and what data we train on. In many cases, it is advantageous to solve a different problem than the one originally formulated and to choose a different measurement of the feasibility of the solution. We argue that because industrial engineers can change more of the variables of the equation, they must master a broad range of skills, ranging from data collection techniques to stakeholder management.
In our fifth insight “Engineering means making the right trade-offs” we argue that pragmatic engineers must strike a balance between different opposing desirable properties of any solution. We explain how the end goals of industrial data science teams differ from that of academic and competitive data science teams.
Get ‘The Pragmatic Data Scientist’ as a whitepaper