Insight 5: Engineering means making the right trade-offs
TLDR: Pragmatic engineers make conscious decisions about the trade-offs between desirable properties of data science projects.
In the fifth insight, we explain how the end goals of industrial data science teams differ from that of academic and competitive data science teams.
6 properties to consider
Engineers must deal with the complexity of the real world. This often requires making trade-offs between several desirable properties of the final solution. Pragmatic engineers know this and make conscious decisions about what properties to prioritize.
We propose to consider the following 6 properties for data science projects:
- Performance: The performance of the model on the evaluation dataset, as calculated by the metric.
- Features: Desirable properties such as fast prediction time, fast training time, easy interpretability of results and confidence intervals are examples of features.
- Development time: The time required to develop the solution.
- Turn-around time: The time required to maintain the solution by making adjustments.
- Computational requirements: The storage, memory, CPU, GPU, and TPU requirements.
- Skill requirements: The level of education required by data scientists to contribute to the solution. Using uncommon technologies will require retraining time for new maintainers.
A common mistake of novice engineers is to put emphasis on the wrong properties. We believe this mistake stems from several factors:
- People with engineering-type personalities generally tend to be focused on things that can be directly measured, such as computational requirements.
- Not considering the final value perceived by end-users.
- Relying on the same prioritization as academic and competitive data science, for the reasons described at the beginning of the previous insight.
In the following section, we estimate how much engineers focus on each of the properties in academic, competitive and industrial contexts.
We have assigned two scores from 0 to 10 for each of the contexts: How many resources we believe data scientists usually spend on improving the property, and how many resources we believe should be spent on the property. A discrepancy between these two numbers highlights common mis-prioritizations. The scores are obviously very subjective, and we encourage you to write to us if you disagree with them.
(Needed/Actual), Academic: (9/9), Kaggle: (10/10), Industry: (7/9)
Performance is the primary focus of all three contexts.
In academia, researchers only stop optimizing performance when it would result in overfitting to the domain, thus losing general applicability.
In competitions, the focus on performance is pushed to the limit to the degree that competitors are willing to sacrifice all other properties for tiny performance improvements. Competitors will only stop optimizing the performance of their solution when they cannot possibly achieve better performance.
While performance is certainly of importance in the industry, other properties usually more important than data scientists realize, thus creating a discrepancy between the actual and the needed focus on performance. Pragmatic industrial data scientists should stop optimizing performance when the resources would be better spent elsewhere.
(Needed/Actual), Academic: (6/5), Kaggle: (0/0), Industry: (6/4).
Features represent a kind of catch-all category that cover a number of desirable properties.
In competitions, features do not matter.
In academia, desirable features are used to argue for the feasibility of solutions with lower performance than competing methods. However, researchers tend to focus on directly measurable properties, thus performance often takes focus from features. Unlike in competitive and industrial context, in academia, the general applicability of solutions is considered a feature.
In the industry, features such as fast prediction times and interpretability are often hard constraints. In settings where they are not, they are often neglected, even though they can have a big impact on end-user experience. Pragmatic industrial scientists realize this and make conscious decisions about the trade-offs between different solutions.
Features of data science projects include:
- Fast training time.
- Fast prediction time.
- General applicability.
- Zero training error.
- Interpretability of results.
- Confidence intervals.
- Novelty of the solution (Very important for academia)
- The ability to run on-device. The ability to run on encrypted data.
We encourage you to let us know if we have missed any features on this list.
(Needed/Actual), Academic: (7/7), Kaggle: (7/7), Industry: (5/8).
Academics and especially competitors experience hard constraints on development time.
In the industry, it is usually best to trade fast development time for a better turn-around time. This can only be achieved by doing everything conceivable to reducing the accumulation of complexity, as described in Insight 2. Similar to traditional software development – it is often beneficial to reduce development time by building a Minimal Viable Product. Novice managers and engineers tend to focus on the development time of solutions, neglecting turn-around time.
(Needed/Actual), Academic: (3/2), Kaggle: (2/1), Industry: (10/5).
In the industry, turn-around time is usually the most important property to consider. Too few software teams realize this, resulting in solutions that slowly but steadily disintegrates into an intangible mess.
In academia, turn around does not matter as much. Once a paper is out, researchers move on to the next problem. However, since the longevity of software projects is usually grossly underestimated, we believe that most academics would benefit from prioritizing turn-around time by adopting proper engineering methods.
(Needed/Actual), Academic: (2/2), Kaggle: (1/1), Industry: (2/5).
In most contexts, computational requirements do not matter much. Computing power is cheap relative to developer salaries. When huge amounts of data are available, or where data can be generated, computational requirements for training set hard limits on the performance of models.
It is our experience that novice engineers mistakenly focus on computational requirements even though this property tend to have a relatively small impact on feasibility of the final solution. It is usually better to trade-off computational requirements for other properties. Pragmatic engineers know the importance of avoiding such premature optimization.
(Needed/Actual), Academic: (1/1), Kaggle: (1/1), Industry: (6/3).
In academia, projects with high skill requirements are expected, and actively encouraged. Proving that you can use the most difficult techniques is a badge of honour, not a drawback.
In competitions, leveraging advanced techniques can give you the edge to outperform your competitors.
In the industry, however, using advanced techniques is a double-edged sword. While it can lead to outperforming the competition, it can also make up-scaling the team difficult and expensive. If the same product and be built with lower skill requirements, it a strictly better solution. Pragmatic engineers realize this and use the most common tools and frameworks when possible. If a commonly used, publicly available tool provides can be used, it is usually preferable to both obscure publicly available tools, and custom-built solutions.
The results of making conscious decisions about what property to prioritize
We find that there is little to no discrepancy between the needed and the actual prioritization of properties in competitions and academia. In the industry, however, mis-prioritization is the rule rather than the exception. This results in suboptimal solutions, that does not provide maximum value to end-users.
In data science as well as engineering in general, it is often the case that one of the properties of a project is more important for end-users than all the others combined. This is similar to the field of computational performance optimization, where engineers always make sure to determine the worst bottleneck. Since it is highly probable that this bottleneck has a larger overhead than all other bottlenecks by a large factor, it is the only part of the solution that it is worth considering optimizing. Pragmatic engineers realize this and avoid drawing a false equivalency between the importance of different properties.
To sum up
In this fifth and last insight “Engineering means making the right trade-offs” we argue that pragmatic engineers must strike a balance between different opposing desirable properties of the solution. The insight explains how the end goals of industrial data science teams differ from that of academic and competitive data science teams. We argue that in the industry, turn-around time is undervalued, while properties such as computational requirements and development time are subject to too much attention.
Thank you for reading through our lists of insights. We are always happy to receive comments and critique, so feel free to reach out to us.
Get ‘The Pragmatic Data Scientist’ as a whitepaper