How to prepare your dataset for machine learning?

Data management machine learning

In the latest video, Tim Eschert, machine learning expert, goes over the following two myths on what it takes to prepare for a machine learning project with regards to the dataset and its preparation:

  1. “Machine learning requires a very large dataset”
  2. “Preparing my dataset for machine learning takes man-years”

Machine learning myth buster series

Myth 1: “Machine Learning requires a very large dataset”

In Industrial Machine Learning, it is way more important to have the right data than having all data. The question is - What is right data?

First of all, we have to measure what we call the target we want to predict. A target can, e.g., be: quality, cost, tolerance, etc. But of course, we should measure more than just this target. In machine learning, we call these additional values “tags” or “factors”. In manufacturing, it is important that we differentiate between “measured” and “controllable” factors:

  • Measured factors: we can not influence; they are typically data from sensors.
  • Controllable factor: we can influence; they are typically parameters, set points, etc.

Let’s look at a simple example: If we imagine a pipe, we may have throughput readings from several flow sensors. But we only have a valve in one spot, where we can actually influence the throughput.

Here we can already see first learnings from any Machine Learning project: If sensor has a very high influence, a first outcome could be finding a way to control that factor, e.g. by adding a valve to our pipe.

In conclusion, the right data has significant influence on my target. This can also include unexpected parameters or parameters I have suspected to have an influence but where I am not 100% sure

Any machine learning problem relies on a solid data foundation. But in many use-cases, a couple of months of data can already get us a long way. So how can we assure to gain the most value from the data we have?

  • Have a clean database
    • There should be no (or close to none) times of missing data. In a first dataset, sensor readings should not include outliers or other strange artifacts. Data should not be unnecessarily truncated if more detailed measurements are being collected.
  • Understand that different machine learning methods are good in different use-cases. On the large map of machine learning methods, …
    • Statistical (or Bayesian) machine learning excel at extracting impressive value from little data
    • Neural Nets are very good at image recognition or voice recognition

The reason why people think that you need a huge amount of data is because of the recent re-emergence in popularity of neural networks. Tech giants like Google and Amazon use this type of ML to solve problems by throwing lots of data at them. However, this is just one approach to ML.

Myth 2: Preparing my dataset for machine learning takes man-years

Before starting here, keep in mind that we may already have data that is useful and not know about it: Every fairly automated production process inherently generates data, and sometimes it is just a question of exporting a dataset

However, it is true that every dataset needs some amount of planning. So, how can we overcome this?

The easy yet most important factor is to establish a data culture and to rase awareness around what is possible with data in all branches of the company. By meticulously documenting and archiving what is happening in production processes, rich datasets can easily come together. Once established, most of the actual logging process runs automatically itself.

In terms of project management, it makes more sense to have short “sprint”-cycles instead of trying to get one perfect dataset in the first try. We have to make sure we are collecting data and get an overview of what we already have, but also make sure that we are not blindly collecting arbitrary data.

Planning a machine learning project and planning the “data process” is as important as running a machine learning project itself. For more information on running your machine learning project, attend out “Leading the Factory of the Future Masterclass”.

Article Share Image Share this article on your TLN® Profile
Share this article TLN Linkedin Post TLN Twitter Post TLN Facebook Post