Arundo participates in the 2017 Geilo Winter School in eScience organized by SINTEF
machine learning, data science, arundo, analytics, sintef, sessions,
Mark Tibbetts, Data Scientist, hosted a session detailing applications of machine learning in an industrial setting.
OSLO, Norway -- I am on the train back to Oslo having spent the last few days in Geilo, Norway at the 2017 Geilo Winter School in eScience organized by SINTEF.
This year’s school covered Machine Learning, Deep Learning, and Data Analytics. With over 120 registered participants, the school brought together those interested in Machine Learning from a wide variety of research disciplines with representatives from both academia and private companies across Norway. Arundo’s data science team was invited by SINTEF to host a session on the applications of Machine Learning in an industrial setting and I volunteered to run that session.
I have worked at Arundo as a data scientist for 5 months and in that time, have come across a wide variety of interesting use cases for advanced analytics. I, therefore, used my time at the school to discuss some of those use cases, as well as outline some general good practice tips for collaborative data science projects. Other sessions in the school focused on understanding the implementation and use of Machine Learning tools, meaning the participants were all fully trained data scientists by the time I spoke to them. I decided to keep the technical content low and instead discuss some of the following points.
Data scientists regularly have to bridge the gap between their own domain of expertise and the variety of other people they might work with such as project managers, engineers, software developers and anyone else whose eyes glaze over at the first mention of a cost function. This means being able to communicate highly technical concepts in a clear and concise manner is a vital skill. Especially when that project manager has heard that using something called an SVM (support vector machine; a machine learning algorithm) with their data is going to change their world.
Rather than training and testing Machine Learning algorithms, much of a data scientist’s time with industrial data is spent understanding those data. That can include finding the data across multiple database archives, mapping sensor signals to physical variables perhaps in some hierarchical structure, assessing the completeness of available data for modeling, and determining how to scale an analysis from a few assets to hundreds.
If as a data scientist you are lucky enough to have ended up with a somewhat complete and well-mapped source of industrial data, then understanding the best strategy for applying Machine Learning techniques and choosing the best model is the next skill required. Even a straightforward classification problem can fail spectacularly if you forget that your input data is a time series and decide to construct and validate your model by randomly splitting your historical data into training and testing samples. A vital part of best practice for model construction is having a clear picture of how that model is going to be applied to future data. How will you re-train if future data drifts from your historical data? Or will you use adaptive learning? If your model construction strategy was wrong, the model will never provide insights of value.
Even when you have constructed the perfect model, which is going to change the industrial world, it would not matter if you don’t have a clear strategy for how to apply that model to future data. Your model - and its industrial revolution - will likely die in a PowerPoint presentation, never to be mentioned again. You also need to convince the engineer who is going to be sitting in a control room and making decisions based on your model’s output, to trust that model. Those engineers will likely see your model as an un-intuitive black box, because how will you convince them otherwise? At Arundo, our DeepQ and LiveQ software solutions allow our data scientists deploy models immediately on new data and the social functions allow fast and direct communication between data scientists and engineers.
Finally, working as a data scientist with industrial data is a highly collaborative process. This means you need to understand good practice with project structure such as cookie cutter, using clear version and environment control, how to write modular and well-documented code, how to utilize cloud computing when data gets very large, and how to bookkeep complex data transformations either locally or in the cloud.
The school participants themselves contributed to some excellent discussions throughout the session around the above points. I hope they found my experiences and advice useful in their own careers. The slides and associated notebooks from my session will be available on the school’s webpage.