Preserving data privacy in Machine Learning pipelines with Federated Learning

July 31, 2024

The feature extraction capabilities of Machine Learning (ML) models have led to their wide adoption in a large variety of sectors: from anomaly detection for machinery, to user clustering and behavioral prediction, market trends predictions, or the analysis of text, sound, and image data. The performance of ML models heavily relies on their ability to perform pattern recognition on large amounts of data, which impelled companies to monitor their processes and build large data bases that could be used to train ML models, or even monetized through data marketplaces.

In these last years, one of the main problems with traditional ML is that most ML pipelines are centrally managed and, even when the training is distributed among several devices for faster training due to parallelization, the whole process is governed by the same entity. Data is aggregated into a common dataset even when it is collected from different sources. However, data aggregation is not possible when it is confidential or personal data protected by e.g., the General Data Protection Regulation (GDPR) in the European Union. In this latter case, the data holders may still be compelled to jointly train an ML model though, because their datasets overlap either in the features or in the sample space.

For these cases, the application of Federated Learning to preserve data privacy may become handy. Federated learning does not need data aggregation, but it aggregates the ML models instead. Clients train the model based on their local dataset. Once trained, the model is transmitted to a central server in charge of aggregating the updates from all clients. Since the training dataset could still be inferred if an attacker had access to the originally trained model, the server adds some random noise to the aggregated weights before forwarding them back to the clients, a technique known as differential privacy.

Additionally, Federated learning also improves communication efficiency by replacing frequent data transmission with more sporadic model transmissions, thereby reducing the throughput and energy demand of the system. When data is collected with low power devices with limited bandwidth, such as sensors for anomaly detection in industrial machinery, federated learning can greatly improve the energy efficiency of the system. This efficient communication also enables user profiling on a global scale, so that large enterprises can better understand their users’ needs while preserving their privacy.

The awareness for data privacy has increased in recent years among users, and governments enforce it with data protection laws. Federated learning brings together the best of both worlds, allowing companies to leverage the power of data while ensuring compliance with data protection regulations.

More info:
ATOS