Machine Learning

Modelling and machine learning

The platform has a separate micro-service component responsible for machine learning model training and prediction serving. This micro-service is implemented prominently in Python, leveraging the best-in-class auto-differentiation frameworks for implementing the specifics of the modelling pipelines. All models are accelerated on GPU hardware to speed up training and prediction processes that need to consider hundreds of millions of datapoints.

For model training purposes that need to consider large amounts of historic data, this ML micro-service accesses data derived from integrations on the platform (e.g. Open Banking data) through a dedicated data warehouse that contains redacted versions of the original data. This warehouse is populated using an Extract-Transform-Load (ETL) service. For existing models, these data-transformation and training processes are triggered automatically by our deployment process. This process produces an audit trail stored as logs and model artifacts in S3. All newly trained models are benchmarked against previous versions using a consistent test set and only if the model is improved is it published as the new version.

For prediction purposes, the micro-service has a RESTful API that the core platform can leverage for triggering predictions for batches of data. This API leverages a generic pipeline abstraction as its core resource that can be configured differently for different model types and data input and output schemas.

All models are deployed alongside a rules-based engine that can be used to over-ride behaviour in special edge cases and in mission critical applications.

The micro-service is deployed on AWS Fargate and therefore is horizontally scalable as required under load and benefits from the cross-platform observability and monitoring as described in Monitoring.

Depending on the use case, we use both simple and complex model architectures – from linear models to proprietary neural networks models that operate on top of multiple data sources and solve multiple prediction tasks simultaneously.