Exploring the two leading programming languages for statistical analysis and data science
R vs Python is a popular topic for those interested in data science, but R and Python are very different by design and have very different ecosystems, ranging from their IDEs and package libraries.
There are some high level similarities between R and Python; they are both free and open source programming languages with communities of passionate developers – even though Python’s community is much larger than R’s.
Both languages are multi-platform (Windows, macOS and Linux support), that entail both desktop local development to large-scale cloud-based development.
Both Python and R are also interpreted languages (E.g. not having to compile your programs before running them – they are converted to machine language at runtime). This allows developers to rapidly execute scripts that can reduplicate a particular task – such as training an ML model, or generating a set of plots for a data set.
The downside of interpreted languages is that they are often a lot slower than compiled languages – a well known fact in computer science. R suffers from slower execution more than Python.
Table of Contents
The Python Programming Language
Python is a general purpose programming language – it is used for a wide range of uses and can accommodate multiple fields ranging from web frameworks and servers to Machine Learning frameworks and game engines.
There are a range of text editors, IDEs and cloud-based workspaces that support Python code.
Python has a plethora of use cases beyond just data science, and is therefore found not just in one domain, but throughout the whole IT industry.
Being a multi-paradigm language, Python can take the role of an object oriented programming language, a functional language, and more – and it is this flexibility that makes Python so “general purpose”.
The R Programming Language
R on the other hand is not a general purpose language. R, a statistical programming language, was designed to accommodate the field of statistical computing and graphics.
Common tasks undertaken with R include data wrangling (cleaning and unifying inconsistent / incomplete or messy data sets), analysing data and providing statistical analysis with plotting and data visualization packages. R’s set of tools allow the data scientist to develop sophisticated data models.
With R being a more niche language than Python, it is more widely used amongst academia for research purposes (not just by programmers, but by any field that requires efficient data analysis – medicine, genetics, astronomy, etc).
R is also object oriented, but conforms a lot to the functional paradigm – and this becomes obvious when you start working with it.
R and Python Job Market
An interesting takeaway on Stack Overflow’s 2021 Developer Survey is that Python and R both share the same average annual salary, at $59,454.
The job market is healthy for data scientists and computer science in general, with the rapid growth of some key industries that are enabling higher demand for Python and R programmers:
Affordable and scalable clouds for data analysis
More capable cloud ecosystems enable data scientists to work with huge data sets, backed up with ample CPU and GPU resources not possible with standalone or small local networks.
R’s RStudio IDE (Integrated Development Environment) in-particularly offers great enterprise level support for large-scale data analysis, while Python’s popular accompanying Jupyter Notebooks tool is now integrated into cloud-based environments with scalable resources at hand.
Artificial intelligence and Machine Learning
Machine Learning and the encompassing field of artificial intelligence entails statistical analysis on almost every project, as a result of Neural Networks essentially being black boxes of numbers that data scientists need to make sense of to optimise.
Packages such as MatplotLib for Python and R’s ggplot2 are popular libraries to visualize data. The most used Machine Learning frameworks are TensorFlow, Keras and PyTorch – all of their flagship implementations being for Python.
R generally handles data visualization, plotting and graph generation better than Python does. Python’s visualisation offerings do work, albeit with more complex APIs and less intertwined with the IDE.
R and Python Popularity
Although this difference in popularity may be striking at first glance, it becomes apparent that there is not much competition above R in terms of a programming language for analysing data and being able to quickly generate graphs, reports, all with the ecosystem R provides – that we will endeavour more into further down.
This edge R has over other popular programming languages becomes more apparent with the PYPL index (that ranks languages based on Google searches of tutorials for that language) now ranks Python at number 1, and R at number 7.
Interestingly, both have maintained their respective positions from a year ago, suggesting their popularity will not be changing in the near future.
Python and R are the two main languages used for data science. To read about other upcoming languages, check out our accompanying Best Data Science Programming Languages article.
With a general introduction of Python and R, the next section will explain how each programming language is used, and get a real intuition of the differences between the two.
Working with R
The Majority of R programming is done within RStudio.
RStudio can be downloaded for free for desktop or cloud-based deployments, but also comes as paid enterprise-level versions (with some high price tags) that come with more support, commercial licences, tighter integration with the surrounding RStudio ecosystem of tools, and more.
R Studio is used in big data, finance, life sciences, and more. The Comprehensive R Archive Network (CRAN) further expands R’s capabilities, hosting a large library of packages maintained by R’s data science community.
Working with Python
The Notebook interface gives data analysis tasks some of the GUI elements that R users enjoy with RStudio, allowing data visualisations, code execution and output, and comments, all formatted in one cohesive and interactive document.
Complex programs with many steps – such as Machine Learning tasks that entail data import and formatting, model initialisation, training, and testing – are very well suited for notebook environments.
Python vs R: Working with Data
Let’s now compare common tasks that both R and Python are designed for, and explore some of their key differences.
R is designed to import data via Excel, CSV and text files. As R is a progression from GUI-based programs, files built in Minitab or in SPSS format can also be turned into R data frames.
Python supports a wider range of data formats including SQL and other databases, CSV and JSON formatted files, and easy access to web APIs – making it ideal to log user data. Python is ideal for production deployments, whereas R is more suited to offline data analysis and statistics.
R is optimised for numerical modeling analysis, and offers a number of packages to sort and display data, that are tightly integrated into the RStudio development environment.
Alongside standard plotting visuals, probability distributions and statistical tests can be applied to data.
Python data is typically explored with Pandas, the flagship data analysis library for Python. Pandas focused on data manipulation – filtering, sorting, formatting and other wrangling techniques can be deployed alongside Python’s standard library and other complementary packages.
The collection of packages known as Tidyverse are the go-to solutions for data modelling with R.
The Tidyverse packages give you a standardised library of tools (all conforming to the same philosophy and data structure design patterns) that can take on the entire analysis pipeline, from data importing, wrangling, manipulation and visualisation.
Python also includes strong standard libraries for modelling. The closest Python alternative to Tidyverse is SciPy; a collection of fundamental tools for scientific computing – that just happen to be extremely useful for handling data too.
RStudio’s professional (paid) versions offer scalability features such as load balancing, running parallel processes and resource monitoring. Google Colaboratory is Python’s strongest alternative to a scalable cloud-hosted environment, allowing access to GPU resources that are essential for Machine Learning tasks.
Outsource Data Science with Iglu
With over 10 years of experience in the industry, Iglu has a track record of attracting talented data science specialists from all over the world. Our Enterprise-grade employees range from senior talents with decades of experience to junior employees for more affordable solutions.
See our comprehensive list of services for more information and we will look forward to working with you.
Python vs R: Who Wins?
Although Python has a much larger share of the market, a much larger community and many more use cases, R has chosen to do one thing, and one thing very well: data science, statistics, and data visualisations.
It is common for data sets to be cleaned, formatted, and processed in Python, but this process is a lot more streamlined with R’s ecosystem. But actually hosting services or live deployments, or for web development tasks away from static markdown – Python is the go-to programming language.
This raises the interesting proposition of R and Python working hand-in-hand, which is indeed a supported workflow in RStuido – with enhanced Python support being introduced in version 1.4.
A data scientist with programming skills in both R and Python can leverage the strengths of both languages; use R for preparing data or analysing results by utilising the .rmd (R Markdown) format (or export the findings into a web project via Shiny), and use Python to do everything else.
Regardless, R does not have to lose in order for Python to enjoy its large market share – both languages have differentiating purposes for a range of applications, and complement each-other’s offerings well.