How to Progress in Data Science

January 1, 2020

I’ve been a data scientist for nearly six years. As with most other things I’ve done in life, I kind of fell into it. I’d done some machine learning during my PhD, though this wasn’t the important bit. Post-PhD, having been wooed by its sexiness, I wound up in my first data science job. I didn’t really know what I was doing, but I assumed nobody else did either and cracked on.

Six years on and I’ve been moderately successful - I’m now senior enough people actually listen to what I say and, for better or worse, act on my recommendations! Over this time I’ve worked with and met a lot of other data scientists. There are many different types - between six and ten, depending who you ask. Each is valuable, but the context in which each succeeds varies.

I would probably describe myself as a machine learning engineer, though mainly because that’s the bit I find most fun! In reality I’ve done all sorts over the years - I’ve built web applications, written infrastructure as code for production servers, configured CI/CD pipelines… Whatever was required to solve the problem at hand. I once tried to migrate a WordPress site. In the end I gave up and we paid for someone who knew what they were doing!

So, how have I been successful? By “successful” I mean I now earn enough money to maybe afford a house some day, even though I eat avocado toast… Well, I firmly believe it’s because I’ve ~~focused on~~ ended up being cross-functional. There are many data scientists who know more statistics than me, and plenty more who can build higher performing machine learning models. However, in my experience most companies most of the time benefit from data scientists capable of (and interested in) more that just statistics and machine learning.

A more focused skill set is certainly valuable but, in my experience, the range of contexts in which you can succeed is narrower. Generally speaking, the smaller the company the greater the need for cross-functionality. However, larger organisations producing more mature data products require specialism, which is where increased focus will serve you well. That said, cross-functionality is beneficial wherever you work. Even if your day-to-day is pure data science, some degree of cross-functionality will give you greater context and improve your working relationship with other teams.

I love doing all these non-data science things - I can never commit to a side project because there are just so many things I want to try! - but most data scientists probably don’t. So what, specifically, are the key non-data science skills you need to progress, whichever area you decide to specialise in?

Reproducibility. Whilst the most important objective in data science is normally communication of results, you should always endeavor to make your analysis reproducible. This is especially important if you’re producing a prototype for a future production system rather than a research report. There are plenty of tools to help with this, such as Kedro or (shameless plug!) my own template, DataBake.

Learn to code. This doesn’t simply mean make Python do things. It means learn to code well, following software engineering best practices. “It works on my laptop” just doesn’t cut it I’m afraid! You normally don’t need to go as far as writing a full test suite. However, taking care to get the basics right, such as not hard-coding variables, adding helpful comments, and using functions/classes to organise your code, go a long way towards making your project reproducible and maintainable.

Be pragmatic. This means understanding the business context of what you’re doing and appreciating a product doesn’t need to perform as well as you want it to perform to be useful to a customer. Most of the web doesn’t work very well and data science is no exception. I’m frequently amazed at how low the bar is!

Avoid machine learning. This might be a controversial one (at least amongst data scientists!) but hear me out… Machine learning is hard. It’s hard to build. It’s hard to deploy. It’s hard to maintain. If you can get 80% of the value (that’s customer value, not necessarily model performance) without machine learning, do that. Or, at the very least, do that first then iterate.

Keep half an eye on production. I’ve lost count of the number of times I’ve seen a very highly performing model noped by engineering, normally due to some variation of “that won’t scale”. Bunch of kill joys! Still, it’s worth keeping production in mind whilst building your prototypes, ideally involving an engineer from the outset. Remember, 80% accuracy in production is infinitely more valuable than 98% on your laptop!

Learn to talk to non-technical people. I can’t stress enough how important this is. Pro tip: don’t show them equations! In fact, try and avoid technical detail. What they want to know is why the problem needs solving in the first place (typically this translates into how much money the solution will save/make) and whether you have solved it sufficiently well.

I hope you found this article useful - I follow this advice each working day and it’s served me well so far! For more (occasional!) perls of wisdom/streams of nonsense you can follow me on Twitter, or to see what I’ve been up to recently check out my GitHub.