Contributing to Open-Source Software (OSS) can be a rewarding endeavor, especially for new data scientists. It helps improve skills, provides invaluable experience when collaborating on projects, and gives you a chance to showcase your code. However, many data scientists do not consider themselves to be formally trained software developers. This can make contributing to OSS a scary proposition.
One source of fear is that it seems like every step in the software development process, from design to continuous integration, has a set of best practices that are often overlooked in data science training. This leaves data scientists feeling under-equipped to participate in an arena tailored for software developers. It’s been documented that even some skilled software developers have anxiety when deciding to contribute to OSS, well, multiply that anxiety for a data scientist.
Almost all the popular data science libraries we use every day are successful OSS projects sometimes maintained and often improved by the community. Scikit-learn, PyTorch, and TensorFlow come to mind. See, data scientists can write high-quality code! Ultimately, data scientists are writing code with the intent to ship a model to production or deploy a robust data pipeline. Since we’re writing software, we should be held to the same standards as other software projects.
In this post, I’d like to share a few things that I’ve learned over the past couple of years of developing software in data science. My intent is to share the pieces of information that I wish someone shared with me when I was beginning my data science journey. The hope is that inexperienced data scientists who are hesitant to get started will use these tips to feel empowered to start writing better data science code and contribute to OSS sooner rather than later.
…and pay attention! This item had the single greatest impact on my personal development. All my current software development habits are a direct result of incorporating experienced software developers in my life in some form or another. It’s worth noting that you don’t have to be best friends with these people. The experienced developer could take the form of a tech lead on your team at work, a classmate, or someone from a local meetup. Additionally, there are some really great developers who regularly tweet , post in forums , write technical blogs , and create podcasts ( Talk Python is my favorite). Try your best to identify people who are not data scientists, as they can provide you with opinions that would be more difficult to find in the data science field.
The key to this tip is to listen to and engage with these people. First, ask yourself, “Why are they are implementing such a method”? If you don’t understand, look it up. If you still don’t understand (or if you do) send them an email or reply with a comment asking them to expand on their point. Chances are, these people have good reasons behind their decisions. Seeing or hearing these reasons in the context of your question will give you a better understanding of the topic from a software development perspective. I’m not recommending you continually hound experienced developers, it’s important to respect their time and space.
An unfortunate misconception that prevents people from contributing to OSS is that if they make their code public it will be laughed at or ridiculed. The majority of the time, this is just not true. It’s no different than any other type of community, a few bad apples can ruin the party for everyone. I have contributed to small OSS projects, popular OSS projects, and even maintain a few of my own . In each of these scenarios, every interaction has been at the very least respectful, and in most cases, people are appreciative. After more than two years of contributions to OSS, if someone were disrespectful toward me, they would clearly be an outlier at this point.
Some people state “unfamiliarity with git” or even just “unfamiliarity with GitHub” as their reason for not contributing to OSS. This is why the title for number three contains “etc”. GitHub, Bitbucket, and GitLab have a substantial amount of functionality that supports a git workflow, but is not considered part of “git” per se. So even for an experienced git user who has never used GitHub, there is still a slight learning curve on the road to proficiency.
Instead of coding all of your projects locally, reserve some extra time to push to one of the platforms mentioned above. Make it private, you can still securely share the repo with a friend if you like. Practice making pull requests or visually inspect the git commit history after rebasing a feature branch. Becoming more familiar with the tabs, buttons, tools, and the git workflow in a private repository will give you confidence when working on your public contributions.
You will be hard-pressed to find a successful data science project not using some form of version control for their software development, git just happens to be the most popular. For this reason, try to practice using a git workflow for your everyday projects. Then, when collaborating with people on an OSS project, you won’t be hung up on git workflow mishaps and you spend your time focusing on the code.
4. Use software best practices
Rachel Tatman, a data scientist from Kaggle, recently posted a video and a kernel outlining how to write more professional-looking data science code, so I’ll recommend to follow her advice and keep the sections short where we overlap. Each of these will make your life easier when interacting with collaborators, again allowing you to focus on adding features rather than going back to clean up bad habits.
Other data scientists may know that
j are elements of the array
A but mathematical notations don’t translate to informative variable names in source code. I like to use variable names that would have meaning when re-reading the source code or for someone else reading for the first time. But I’ll leave it to Will Koehrsen to suggest naming conventions for data scientists .
Whether they are aware of it or not, data scientists often implement a coding style called method chaining . When there is a DataFrame object in memory, we often know the exact next steps to transform a column into the format we want. One downside to method chaining is that the chains can be difficult to debug, especially when there are no comments. Think of others when writing comments and remember, a short comment can go a long way.
You will thank yourself later, or whoever is reviewing your pull request will thank you. Sometimes unit tests just don’t make sense for small one-off data science projects. But unit tests become more important as you think about scaling up your code or collaborating with others.
Code is easier to test when it is modular, which translates to writing functions with the intent that they will perform one or two operations. Below is an example of a function that normalizes a pandas DataFrame in Python. The unit test for this function asserts that the mean and standard deviation are 0.0 and 1.0, respectively. If this test ever fails, the developer knows that something has gone wrong in the `normalize_dataframe` method and will know where to start debugging.
Congrats, you wrote unit tests and pushed your code to GitHub, but a few days later, someone reports an issue stating they can’t get your code to run on their machine. CI would help identify these problems before they become issues on GitHub. Travis is an example of a CI service that integrates directly with GitHub free of charge for public repos. CI services allow you to build and test your code in containerized environments. I recommend setting up one of the CI services for projects that you intend on collaborating with others.
Using a code editor with a debugger, sometimes referred to as an integrated development environment (IDE), can seem like overkill for data science projects. But once you learn how to use your editor’s debugger, it becomes a powerful tool for software development. A debugger allows users to execute code up until a breakpoint then explore the variable namespace, which makes it much easier to “debug” why a program may be crashing. You should give yourself at least 30 days of trying out a debugger for development before giving up hope and switching back to notebooks. Take the time to read the tutorial and ask a more experienced software developer (perhaps from Tip #1) if they can walk you through debugging your next bug.
Jupyter notebooks are a popular choice for many users when first starting a data science project. Notebooks are great for exploratory data analysis and visualizations, but they have their shortcomings, reviewed here by Joel Grus . Namely, the cells cannot be run out of order, they are difficult for beginners to understand, and they encourage bad habits. The two popular IDEs PyCharm and VS Code have recognized the popularity of notebooks and recently integrated support for them directly in the IDE. Hopefully, data scientists will come to IDEs for the notebook support but stay for the debugger.
Bonus: Be receptive to constructive criticism
Keep in mind that OSS is maintained and improved by both formal and informal code reviews. People will look at your code, critique it, and might ask you to change a few lines. If you’ve made it this far, isn’t this what you were looking for? Someone has taken the time to review your code and spent enough energy to try and improve it.
As your software development skills improve, you will encounter other users asking to you modify your code, but you might want to leave it as-is. Defend your decision with the same respect that you were expecting to receive, it’s really no different than being a sensible human.