Getting started with data analysis using this useful tool

Photo by Camylla Battani on Unsplash

Data analysis is fundamentally about finding answers to questions with data. When we perform some calculation or compute a statistic for a set of data it is usually not enough to do that across the entire dataset. Instead we will usually want to split the data into groups, perform the computation and then compare the results across different groups.

Let’s say we were a digital marketing team investigating the potential reasons behind a recent decline in conversion rate. Looking at conversion rate as a whole over time would be…


Introducing a curated list of free resources for learning data science

Photo by vnwayne fan on Unsplash

Over the last few years I have written serveral articles about learning data science using online resources. During my own learning journey I have identified some of the best free or low cost material available for learning data science.

I recently spent some time consolidating this list into this Github repository so that it can be used as a quick reference for anyone who wants to expand their data science skills. …


Getting Started

… explained in plain English

Photo by ThisisEngineering RAEng on Unsplash

Statistics is “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data”. Throw programming and machine learning into the mix and you have a pretty good description of the core skills for data science.

Statistics is used in almost all aspects of data science. It is used to analyse, transform and clean data. Evaluate and optimise machine learning algorithms. It is also used in the presentation of insights and findings.

The field of statistics is extremely broad and determining what exactly you need to learn and in what order can be difficult. Additionally…


What are they, what are the options and why do we need them?

Photo by Lewis Ngugi on Unsplash

The Python programming language has many different versions. Similarly, all Python libraries also have multiple versions, work with specific versions of Python and most of them depend on other packages to run, this is known as a set of dependencies.

Every data science project that you undertake is likely to require its own unique set of third-party Python packages. Virtual environments act as self-contained environments encapsulating the Python version and all dependencies for a project. Creating a new virtual environment is one of the first steps that is usually taken when starting any new data science project.


Learn all the statistics you need for data science for free

Photo by Daniel Schludi on Unsplash

Statistics is a fundamental skill that data scientists use every day. It is the branch of mathematics that allows us to collect, describe, interpret, visualise, and make inferences about data. Data scientists will use it for data analysis, experiment design, and statistical modelling.

Statistics is also essential for machine learning. We will use statistics to understand the data prior to training a model. When we take samples of data for training and testing our models we need to employ statistical techniques to ensure fairness. …


and why the data science generalist will triumph

Photo by Markus Winkler on Unsplash

When I started learning data science a few years ago most job ads requested a PhD, or at the very least a masters, in maths, statistics or a similar subject as an essential requirement.

Over the last couple of years, things have evolved. With the development of machine learning libraries that abstract away much of the complexity behind the algorithms, and a realisation that practically applying machine learning to solve business problems requires a set of skills that are not usually acquired through academic study alone. …


Getting started with this extremely useful data structure

Photo by Danielle MacInnes on Unsplash

Pandas is typically the default choice for data scientists, analysts and engineers when it comes to manipulating and analysing data with Python. The fundamental data structure used when working with this library is the DataFrame. The Pandas DataFrame is a two-dimensional structure consisting of rows and columns of data, not unlike an Excel spreadsheet or SQL database table.

Once data is made available in this data structure you can perform a variety of operations to prepare your data for analysis. …


Low code AI with PyCaret, BigQueryML and fastai

Photo by Kasya Shahovskaya on Unsplash

Machine learning has the potential to help to solve a wide range of problems both in business and in the world in general. Ordinarily to develop a machine learning model, and to deploy that model to a state where it can be operationally used, requires deep knowledge of programming and a good understanding of the algorithms behind it.

This limits the use of machine learning to a small group of people and also, therefore, limits the number of problems that can be solved.

Fortunately, over the last couple of years, a number of libraries and tools have sprung up that…


Getting Started

Getting started with the number one python machine learning library

Photo by Clément H on Unsplash

Scikit-learn, first developed as a Google Summer of Code project in 2007, is the now widely considered to be the most popular Python library for machine learning.

There are a number of reasons why this library is seen as one of the best choices for machine learning projects, especially in production systems. These include, but aren’t limited to the following.

  • It has a high level of support and strict governance for the development of the library which means that it is an incredibly robust tool.
  • There is a clear, consistent code style which ensures that your machine learning code is…


Until now…

Photo by Jon Tyson on Unsplash

Pandas is the definitive library for performing data analysis with Python. It was originally developed by a company called AQR Capital Management but was open-sourced for general use in 2009.

It rapidly became the go-to tool for data analysis for Python users and now has a huge array of features for data extraction, manipulation, visualisation and analysis.

Pandas has many useful methods and functions here are ten things you might not know about the library.

Pandas can be pip installed if you don’t already have it. The full documentation, with some excellent general data analysis tutorials, can be found here.

Rebecca Vickery

Data Scientist | Writer, Speaker, Founder DatAcademy | www.rebecca-vickery.com | www.linkedin.com/in/rebecca-vickery

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store