Data analysis is fundamentally about finding answers to questions with data. When we perform some calculation or compute a statistic for a set of data it is usually not enough to do that across the entire dataset. Instead we will usually want to split the data into groups, perform the computation and then compare the results across different groups.

Data analysis is fundamentally about finding answers to questions with data.

Let’s say we were a digital marketing team investigating the potential reasons behind a recent decline in conversion rate. Looking at conversion rate as a whole over time would be…

Over the last few years I have written serveral articles about learning data science using online resources. During my own learning journey I have identified some of the best free or low cost material available for learning data science.

I recently spent some time consolidating this list into this Github repository so that it can be used as a quick reference for anyone who wants to expand their data science skills. …

Statistics is “*a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data**”.* Throw programming and machine learning into the mix and you have a pretty good description of the core skills for data science.

Statistics is used in almost all aspects of data science. It is used to analyse, transform and clean data. Evaluate and optimise machine learning algorithms. It is also used in the presentation of insights and findings.

The field of statistics is extremely broad and determining what exactly you need to learn and in what order can be difficult. Additionally…

The Python programming language has many different versions. Similarly, all Python libraries also have multiple versions, work with specific versions of Python and most of them depend on other packages to run, this is known as a set of **dependencies**.

Every data science project that you undertake is likely to require its own unique set of **third-party Python packages**. Virtual environments act as self-contained environments encapsulating the Python version and all dependencies for a project. Creating a new virtual environment is one of the first steps that is usually taken when starting any new data science project.

Creating a new…

Statistics is a fundamental skill that data scientists use every day. It is the branch of mathematics that allows us to collect, describe, interpret, visualise, and make inferences about data. Data scientists will use it for data analysis, experiment design, and statistical modelling.

Statistics is also essential for machine learning. We will use statistics to understand the data prior to training a model. When we take samples of data for training and testing our models we need to employ statistical techniques to ensure fairness. …

When I started learning data science a few years ago most job ads requested a PhD, or at the very least a masters, in maths, statistics or a similar subject as an essential requirement.

Over the last couple of years, things have evolved. With the development of machine learning libraries that abstract away much of the complexity behind the algorithms, and a realisation that practically applying machine learning to solve business problems requires a set of skills that are not usually acquired through academic study alone. …

Pandas is typically the default choice for data scientists, analysts and engineers when it comes to manipulating and analysing data with Python. The fundamental data structure used when working with this library is the DataFrame. The Pandas DataFrame is a two-dimensional structure consisting of rows and columns of data, not unlike an Excel spreadsheet or SQL database table.

Once data is made available in this data structure you can perform a variety of operations to prepare your data for analysis. …

Machine learning has the potential to help to solve a wide range of problems both in business and in the world in general. Ordinarily to develop a machine learning model, and to deploy that model to a state where it can be operationally used, requires deep knowledge of programming and a good understanding of the algorithms behind it.

This limits the use of machine learning to a small group of people and also, therefore, limits the number of problems that can be solved.

Fortunately, over the last couple of years, a number of libraries and tools have sprung up that…

Scikit-learn, first developed as a Google Summer of Code project in 2007, is the now widely considered to be the most popular Python library for machine learning.

There are a number of reasons why this library is seen as one of the best choices for machine learning projects, especially in production systems. These include, but aren’t limited to the following.

- It has a high level of support and strict governance for the development of the library which means that it is an incredibly robust tool.
- There is a clear, consistent code style which ensures that your machine learning code is…

Pandas is the definitive library for performing data analysis with Python. It was originally developed by a company called AQR Capital Management but was open-sourced for general use in 2009.

It rapidly became the go-to tool for data analysis for Python users and now has a huge array of features for data extraction, manipulation, visualisation and analysis.

Pandas has many useful methods and functions here are ten things you might not know about the library.

Pandas can be pip installed if you don’t already have it. The full documentation, with some excellent general data analysis tutorials, can be found here.

`…`

Data Scientist | Writer, Speaker, Founder DatAcademy | www.rebecca-vickery.com | www.linkedin.com/in/rebecca-vickery