Machine learning

A Complete Machine Learning Project From Scratch: Setting Up

In this first of a series of posts, I will be describing how to build a machine learning-based fake news detector from scratch. That means I will literally construct a system that learns how to discern reality from lies (reasonably well), using nothing but raw data. And our project will take us all the way from initial setup to deployed solution. Full source code here.

I’m doing this because when you look at the state of tutorials today, machine learning projects for beginners mean copy-pasting some sample code off the Tensorflow website and running it through an overused benchmark dataset.

In these posts, I will describe a viable sequence for carrying a machine learning product through a realistic lifecycle, trying to be as true as possible to the details.

I will go into the nitty-gritty of the technology decisions, down to how I would organize the code repository structure for fast engineering iteration. As I progress through the posts, I will incrementally add code to the repository until at the end I have a fully functional and deployable system.

These posts will cover all of the following:

  1. Ideation, organizing your codebase, and setting up tooling (this post!)
  2. Dataset acquisition and exploratory data analysis
  3. Building and testing the pipeline with a v1 model
  4. Performing error analysis and iterating toward a v2 model
  5. Deploying the model and connecting a continuous integration solution

With that let’s get started!

Building Machine Learning Systems is Hard

There’s no easy way to say this: building a fully-fledged ML system is complex. Starting from the very beginning, the process for a functional and useful system contains at least all of the following steps:

  1. Ideation and defining of your problem statement
  2. Acquiring (or labelling) of a dataset
  3. Exploration of your data to understand its characteristics
  4. Building a training pipeline for an initial version of your model
  5. Testing and performing error analysis on your model’s failure modes
  6. Iterating from this error analysis to build improved models
  7. Repeating steps 4-6 until you get the model performance you need
  8. Building the infrastructure to deploy your model with the runtime characteristics your users want
  9. Monitoring your model consistently and use that to repeat any of steps 2-8

Sounds like a lot? It is.

And here’s what the current machine learning tooling/infrastructure landscape looks like (PC to FSDL):

When you couple the inherent complexity of building a machine learning product with the myriad tooling decision points, it’s no surprise that many companies report 87% of their data science projects never making it into production!

Veille-cyber

Share
Published by
Veille-cyber

Recent Posts

Les 7 menaces cyber les plus fréquentes en entreprise

Introduction La cybersécurité est devenue une priorité stratégique pour toutes les entreprises, grandes ou petites.…

4 jours ago

Cybersécurité : Vers une montée en compétence des établissements de santé grâce aux exercices de crise

Cybersécurité : les établissements de santé renforcent leur défense grâce aux exercices de crise Face…

2 semaines ago

Règlement DORA : implications contractuelles pour les entités financières et les prestataires informatiques

La transformation numérique du secteur financier n'a pas que du bon : elle augmente aussi…

2 semaines ago

L’IA : opportunité ou menace ? Les DSI de la finance s’interrogent

L'IA : opportunité ou menace ? Les DSI de la finance s'interrogent Alors que l'intelligence…

2 semaines ago

Telegram menace de quitter la France : le chiffrement de bout en bout en ligne de mire

Telegram envisage de quitter la France : le chiffrement de bout en bout au cœur…

2 semaines ago

Sécurité des identités : un pilier essentiel pour la conformité au règlement DORA dans le secteur financier

Sécurité des identités : un pilier essentiel pour la conformité au règlement DORA dans le…

2 semaines ago

This website uses cookies.