We talked about event labeling. How do you get sensors onto this equivalent equipment hierarchy? The way that sensors are streamed now, if I'm a control room engineer, I maybe want to look at 5 or 10 sensors per equipment that I deem most critical. The way that they do this right, they've gone back to these complicated diagrams and they've chosen the sensors that they want and that's probably good enough for their purposes, but as data scientists, we want all the data. We’re very greedy. How do you do that? How can you take this long list of sensors and assign it to a hierarchy?
This is an example of a list of sensors and how it would be provided to us by a shipping company. You can see that there's a tag ID. This would be like this number-letter combination that you see on this process and instrumentation diagram and then you have a tag description. This is a little bit more human readable and understandable that you start to see DG4 BREAKER, somebody who works on the ship probably knows what that means. What you want to do is you want to put a label like level one, level two, and break down where on the equipment structure the sensor sits. How do you make sense of the tag structure? Well, you need to use this tag description.
A ship engineer would probably see DG4 BREAKER, that's the breaker on diesel generator number four, DG1 BREAKER is the breaker on diesel generator number one and so forth. We're seeing something new. This is the BT1 BREAKER. This was on bow thruster number one. This is how it would look after you've labeled these sensors and assigned them here. You start to see wait a minute, everything on DG1 has BC30009B. We don't care about the letters at the end, those are different but you see that there is a sort of structure. There is some overlap between the sensors that are on each piece of equipment. That is because there is. There is a recurrent structure in the way that they name these tags. They don't choose 50,000 tag names to put on a ship or a rig ad hoc. They do it usually, I've only ever seen it this way, but they do it according to a standard.
The standard for oil and gas could be an ISO standard while in shipping, it could be an SFI standard. We see that they look different, but there is a similar structure between these two industries and that is that they always have this functional code. A functional code describes what the sensor is measuring. TE is probably a temperature element. Then, they have a system number. System number 23 for oil and gas tends to refer to the compression train. The sequence number is usually unique by equipment. It doesn't mean that it describes an equipment because there can be multiple sequences per equipment, and then you have suffix. The suffix, that can either mean it's a redundant sensor or it's a redundant piece of equipment. I think that there's two defining truths to correspond tags to equipment and that's the system number and the sequence number. Now you know that. That's great, but how do you scale that?
How do you take that knowledge and scale it without having to go in and identify what are the functional code, sequence numbers, suffix, and so forth for each equipment? Well, you could have somebody label some of the data and predict the rest. I actually did have somebody label all the data for me so that I could evaluate my model. It was actually my old boss. He went through a list of 25,000 sensors, assigned each one to a hierarchy so that I could start to see how well my model is performing.
A lot of the times when you're going to evaluate a model, you look at taking a random train test set and then you predict it on the remaining so that you can see how well it's performing. If you think that we start with we train a model in 10% of the data set and we predict it on 90%, we get about 50% accuracy so it’s not that good. Then, that accuracy doesn't really get much better even after you've done 50%. How can you improve those results without asking somebody to label everything one by one?
Well, you can take a set of all of the functional codes, sequence numbers, and suffixes. I'm not sure how many of you are familiar with data frames but you could, for example, split that tag ID like BC30009A and then split it into three columns and then start to do like a group by. You take the unique set and you disregard all of what is probably not indicative of an equipment terms at the end.
If you do this, for this maritime example, we went from 23,000 sensors that somebody would have to label to 3,500. How could this look as a tool? Well, you would split the tags, take this unique set so this 3,500, and you could provide a suggestion to the user. You could make an application, a tool that's really user-friendly and that doesn't require them to go in Excel. You can maybe make this suggestion using this tag description that would use some sort of fuzzy matching to the hierarchy and then you could verify.
The user would verify the placement, so they would go through these 3,000 sensors and verify the placement that the algorithm suggested. Then, you could train a model on this. After you've labeled these 3,500 sensors, you could train a model that this is what equipment looks like if this sensor is this type of structure. When we did this, that we ended up with it was about like 12% of the data set so that's where the line is in the middle of there, and we got 72% accuracy.
It didn't increase the amount of work, but it takes a much more representative subset, so the model is smarter instead of taking all the sensors, for example, on one piece of equipment, you're taking one sensor, so the algorithm is just learning from that.