How to get started with Data Analysis and Data Science in Python and R — A pragmatic approach
Often people get frustrated when using a software for data analysis which is not particularly suitable for a given task but nevertheless continue using it because they are familiar with that software. For example, using MS Excel for data which consists of mainly text. Using Python or R would make the job way easier and allow people to work more efficiently. However, just as often people shy away from learning Python or R because they believe that coding is difficult. A common misconception is that you need to be good at maths in order to be a good programmer (check this out for more misconceptions). If you are one of those people, let me assure you that this is not true. In this article, I want to provide you a set of tools to get started with Data Analysis and Data Science in Python or R. I have taught a beginner’s Data Science summer course at University College London in 2019 and will share all my tips and resources here (for free).
Just like with everything new you learn, you need to start with the basics. In this case, learn basic syntax. I would suggest spending at least a weekend to get a feel for the language you want to learn by doing some simple arithmetics, familiarise yourself with simple data structures (lists, sets, dictionaries, etc) and write some functions, if-else statements, and for-loops. There are enough resources out there to get you started. I suggest you check out sites like Coursera, Udemy, edX, and Udacity and find a course that fits your learning style (I personally like the syllabus of IBM’s course on Coursera “ Python for Data Science and AI ”). Hackerrank and Leetcode are also great websites to practice your coding skills and do not require you to download anything onto your computer — you can practice in the browser (although I prefer doing it offline — more on that in the next section). Check out Hackerrank’s Python challenges and Datacamp’s Introduction to R .
For Python, as well as R, I suggest you download Anaconda . Anaconda is a free distribution of Python and R for scientific computing, that aims to simplify package management (if you don’t know what packages are — don’t worry, more on that later). Anaconda comes with a tool called Jupyter Notebook which is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations and narrative text. It’s my favourite tool for data analysis. If you want to get a feel for it, check out Google Colab which is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.
As I have told my students countless times over the past few years “you are not the first person to encounter that problem — someone has already asked that question before — google it!” . If you get an error message you don’t understand, or you don’t know how to round a number to two decimal points in python — stackoverflow is your friend! I can guarantee you that no matter what it is — someone has already answered your question on stackoverflow. So don’t be afraid of googling something when you get stuck. Programmers google all-the-time. (And as I told my Chinese summer school students — Google being banned in your country is NO excuse to not search the web for answers)
What I have noticed, however, while teaching the summer school course is that many people don’t know how to formulate their problem and therefore struggle to find an answer on the internet. Knowing how to put your problem into words is a skill in itself and requires practice. Taking the rounding example from above — googling “how to round number” will not give you the answer you are looking for (try it). The top result when googling “how to round number python” will talk you through Python’s inbuilt round() function, which will require you to read through more text than necessary. Only if you google “how to round number python two decimal points” you will get a list of suggested questions on stackoverflow. This may sound obvious but might prove difficult when dealing with questions and problems one has not dealt with before.
Ok, you know the basics, you have familiarised yourself with Jupyter Notebooks, Rstudio, Google Colab, or whatever other tool you want to use — what now? Now you look for a famous dataset that is easy to understand but challenging enough to do some interesting things with. Go through a tutorial step-by-step and try to understand every line of code and what it is doing. During my summer school course, I used the Titanic dataset . It’s a great dataset where you can apply many data exploration and visualisation techniques, as well as different classification algorithms. The same principles you can then use for your own data. Here are two step-by-step tutorials on how to analyse the Titanic dataset and how to train a classifier that predicts whether someone survives or dies on the Titanic.
Note: The R example covers more stuff so I encourage you to have a look, even if you want to get started with Python.
While going through such a tutorial you will encounter packages/libraries you might want to look into (like pandas or scikit-learn in Python) and machine learning algorithms like logistic regression, decision trees, support vector machines or neural networks. It is up to you how deep you want to dig into each of these topics and beyond the scope of this article. I just want to provide you with the initial tools to get started.
If you want to apply your new skills at work or are interested in a particular type of data, your next step should be to go through a famous dataset that is similar to the sort of data you want to analyse (analysing images, sound, text or numerical data all require different approaches). In my PhD I mainly analyse text data — a process called natural language processing (NLP). Analysing text data like movie reviews is, of course, quite different to analysing the type of data given in the Titanic dataset. I will write an article on how to get started with NLP soon and link it here. Here are some examples of famous datasets (you will find many resources on the internet on how to analyse and train machine learning models on them):
Note: You can use all the classification datasets to train unsupervised learning algorithms like k-means which is beyond the scope of this article.
I hope this article was helpful and gave you some clues on how to get started with data analysis, data science, and machine learning using Python or R. If you have any questions or want me to write an article like “Intro to Machine Learning” please let me know.