I wanted to take some time to self-reflect on my journey as a data scientist, and restart blogging as a declaration of intent to learn. This declaration of intent will hopefully generate future blog posts, so - stay tuned!

I’ve found that thinking about what to learn next to advance in the field of machine learning creates a lot of paralysis. I’m interested in learning more about deep learning, reinforcement learning, and Bayesian methods. At the same time, very few problems at work require trying these methods, so it’s been difficult to keep up with the latest.

In the time that I’ve had to think as a result of the Coronavirus shutdown, I’ve realized a few things about myself - I really enjoy building data products that people use, and I consider myself a full stack data scientist, or a generalist. There’s been a lot written about full stack data science, but the gist is that we work on all aspects of data collection, data processing, feature engineering, reporting, training, deployment, visualization, monitoring, testing… and are able to wrap that all into production-ready pipelines, as well as have the business acumen to approach whitespace problems, determine whether machine learning is an appropriate solution, and share results in a productive way.

That being said, many companies out there are going in the direction of hiring data science specialists (e.g., product analysts who do A/B testing, research scientists who only work on NLP, or machine learning engineers who optimize production code). There is no right answer to which is a better career path. I do believe that generally speaking, in the machine learning world, often the solution that delivers the most value is not the one that uses the most specific algorithmic technique, but rather the one that can take an inefficient decision-making process and build an system - through heuristics, machine learning, or otherwise - to address it.

Coming to the above conclusion has helped me to narrow the field of what I want to learn going forward - to be a better full stack data scientist, I’d like to focus my learning on the field of machine learning systems design!

The diagram below (from the paper Hidden Technical Debt in Machine Learning Systems, but found via this excellent post by Luigi of MLinProduction) explains it best - learning algorithms (“ML Code”) are only a small component of a production ML system. There is so much about deploying models that isn’t covered in most beginner data science learning resources, and the knowledge of how to build a sustainable ML system is often learned by experience.

Through paying attention to TWiMLAI, r/datascience and r/MachineLearning, I’ve collated several resources that I plan to review over the next few months:

ML-specific resources for systems design:

General systems design:

I’m excited to go down this learning path in my quest to become a better data scientist and machine learning engineer! In addition, I’m planning to kickstart my writing on a few other topics, specifically:

  1. Back to Basics: A series where I explain foundational machine learning or statistics concepts for myself, so I can better explain them to others
  2. Mentorship: I’ve had an opportunity to participate in several mentorship panels for women and high school students who are interested in data science careers, and would like to share my thoughts here

Do you know of any more machine learning systems design resources? Feel free to send me an e-mail at neo.kaiting@gmail.com.

Leave a Comment