Presentation: Techniques and Best Practices for Prepping Data for ML Projects

Track: Machine Learning for Developers

Location: Soho Complex, 7th fl.

Duration: 4:10pm - 5:00pm

Day of week: Monday

Share this on:

Abstract

Gathering and preparing data is one of the biggest challenges facing anyone who is seeking to do advanced analytics and machine learning. If you are developer or software engineer, this talk will show you how to quickly and efficiently ingest a variety of data types and prepare them for analysis in the Python and Pandas data science ecosystem. This is a big topic and we will only be able to cover the essentials. We will discuss data preparation and “massaging” in order to lay the ground work for robust machine learning models.

We will focus on the tools and techniques that are directly applicable to the problem and use a real-world dataset to walk you through the entire data preparation process from end-to-end. The process can be summarized in the following main steps:

  • ML Problem formulation
  • Data collection
  • Extract, transform and load (ETL) data from a variety of sources into the Pandas ecosystem
  • Data Preprocessing
  • Feature engineering
  • Exploratory data analysis (EDA), an iterative process that integrates with data preprocessing and feature engineering.

In this session, you will learn:

  • Collect raw data and create a data set.
  • Recognize the effects of data quality on Machine Learning algorithms.
  • Set realistic time for data preparation (Data preparation accounts for about 80% of the work of data scientists).
  • Understand and explain every process of data transformation and feature encoding.
  • Transform timestamp, numeric, categorical and text data.

Speaker: Susan Li

Sr Data Scientist at Kognitiv Corporation

I am Susan Li, the Sr. Data Scientist at Kognitiv where I specialize in machine learning and NLP. I’m passionate about helping organizations realize the potential of big data and advanced analytics, and helping individuals enhance skills in data literacy. I frequently write and speak about predictive analytics, machine learning and NLP for technical and general audience. In my free time, I can be found training for the next half marathon.

Find Susan Li at

Tracks

Monday, 24 June

Tuesday, 25 June

Wednesday, 26 June