By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Cookie Policy and Privacy Policy for more information.

Creating Business Value from Unstructured Data

Creating Business Value from Unstructured Data

Creating Business Value from Unstructured Data

Consider an oil platform operating in the North Sea. This platform generates data from complex control systems, personnel and resource planning systems, and various computers and devices—in addition to tens of thousands of individual sensors attached to equipment, meters, and gauges.

Meanwhile, important data related to the platform’s operations, such as production losses, failure notifications, and maintenance records are generated at a consistent rate and stored across systems that vary by asset. The long-term trends of near-ubiquitous sensors, cheaper data storage, and cheaper compute resources drive an ever-growing blizzard of operational data from industrial assets such as this platform.

However, the vast majority of data generated each day from, about, or related to this platform will go unused and unnoticed.

Accessing, analyzing, and driving daily business decisions from this data is the fundamental challenge and promise underlying the Industrial Internet of Things and the big data revolution. It is the future of machine learning and smart industrial operations.

The format of this data varies in nature. Some of it is highly structured. Sensor data, for instance, is a time-stamped measurement of a physical value such as temperature, pressure, vibration, or flow rate. Sensor records may be generated at very frequent intervals, and may need to be understood in the context of other sensors, but once found and contextualized, the information in any individual sensor record is relatively easy to parse.

Often, however, critical information related to an industrial operation is much less structured than sensor data—but it needs to be joined with sensor data in order to gain valuable insight. For instance, on our oil platform, we might want to investigate whether different failure modes experienced by compressors could be predicted in advance by patterns in sensor readings. To achieve this, we would have to sift through process diagrams and sensor hierarchies to find all of the sensors related to compressors. We would then need to review thousands of historical entries in maintenance logs, written by humans in a natural and unstructured format (often in various languages!) to find which common failures occurred on compressors, and when.

This is not easy. For instance, one human operator might record a maintenance log entry that states: “the compressor leaked a light brown fluid that smells like eggs.” Another operator might enter: “maple syrup was leaking out of the compressor and it smelled like breakfast.” Human subject matter experts understand that these two entries refer to similar types of compressor failure. However, it is extraordinarily time-intensive, expensive, and error-prone to manually review and categorize years of unstructured maintenance data for any specific piece of equipment—much less all of the equipment across an entire fleet of platforms.

The process of getting input data into a state where it is properly understandable by a machine learning model can take significant time and effort, with many data scientists claiming that they spend 80% of their time finding, cleansing, and joining the data required to build advanced analytics models.

Data integration is a fundamental challenge to data science adoption in heavy industry, and not just in terms of data science and data engineering time and effort. Taking a manual approach to this process, it could take two months to access and blend data to analyze a single type of equipment across five sites. At that rate, in order to analyze ten types of equipment across fifty sites, it would take over fifteen years!

One solution to this challenge is to train machine learning models (through human-augmented review of topic clusters) to automatically classify maintenance entries by failure mode, and only query human experts in rare exceptions. This approach rapidly accelerates the data integration process. It makes the ultimate goal—a model that can ingest datasets from sensors, failure notifications, and maintenance logs on an ongoing basis, and return processed datasets and critical information related to equipment performance and potential downtime—much more attainable.

Automated labelling of unstructured event data allows machines to adapt to human processes. This ultimately enables humans to make better, more informed business decisions. This is the future of heavy industry.

To learn more, watch Alexandra's video on "Organizing data for industrial data science"

Learn more at

Alexandra Gunderson is a Data Scientist at Arundo. An engineer with a masters degree in computational methods, Alexandra previously worked at Aker Solutions, an engineering services firm in the offshore oil & gas industry.  Contact her at

Ellie Dobson, PhD, is VP Data Science at Arundo Analytics, a software company enabling advanced analytics in heavy industry. A physicist by training, Ellie previously worked at Pivotal Software, The Mathworks, and CERN. Contact her at