Can Genealogical data be tidy?

Happy families are all alike; every unhappy family is unhappy in its own way — Leo Tolstoy Like families, tidy datasets are all alike but every messy dataset is messy in its own way — Hadley Wickham In this post, I’ll be exploring how genealogical data stored in the de-facto standard format, GEDCOM, could be made tidy, and arguing that this is not really ideal. About 6 years ago, long before I got involved with Data Science and when R was just the 18th letter of the alphabet, I started researching my family history. [Read More]

What's in a package?

Happy Christmas! The holiday season has got me thinking about how discovering a new R package is like receiving a Christmas gift…you’re not quite sure what’s inside, but you’re hoping it’ll enrich your programming or analysis life in some way! In this short blog post I’ll be exploring a method to do this, of peeling back the wrapping paper so to speak, and getting straight to the point of the package in a succinct way. [Read More]

Experimenting with Hierarchical Clustering in a galaxy far far away...

Introduction This post will be taking a bit of an unexpected diversion. As I was experimenting with hierarchical clustering I ran into the issue of how many clusters to assume. From that point I went deep into the rabbit hole and found out some really useful stuff that I wish I’d have known when I wrote my previous post. I’ve discovered that choosing a number of clusters is a whole topic in itself, and there are, in general, two ways of validating a choice of cluster number: [Read More]

Use the k-means clustering, Luke

In my last post I scraped some character statistics from the mobile game Star Wars: Galaxy of Heroes. In this post, I’ll be aiming to try out k-means clustering in order to see if it comes out with an intuitive result, and to learn how to integrate this kind of analysis into a tidy workflow using broom. First I’ll load the required packages and set some plot preferences. library(tidyverse) ## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1. [Read More]

Experimentation with Unsupervised Learning

Motivation I’ve written before about my learning plans, which always seem to be in a state of flux, and in particular learning about machine learning. Part of the reason why I’m so reticent is because I’m a mathematician and statistics does not come natural or easy for me. My limited past experience has exposed to me just how much I don’t know. It’s fairly easy to apply a statistical model in R, and even have a go at assessing its performance, however I am acutely aware that there is a certain ‘dark art’ to it requiring a deeper understanding of knowing exactly how to interpret results, and how far you can take it. [Read More]

Are R ecosystems the future?

Some random thoughts… Over the past 6 months I’ve been creating, refining, and delivering a variety of ‘Introduction to R’ training courses. The more I do this, the more I come to the view that not nearly enough is made of taking an ecosystem-oriented view to packages. A good way of talking about #rstats functionality is in terms of ecosystems, rather than individual packages. Tidyverse, tidymodels, RMarkdown & Co, and HTML widgets are all worth highlighting. [Read More]

Let's call it tidysearch

R became 25 years old last year, and yet it’s only in relatively recent years that the language has really taken off with numerous conferences every year driven by a passionate and vibrant community of users. A large part of this has been driven by an ecosystem of R packages called the Tidyverse, which many new users nowadays begin their R journey with. This alternative ‘opinionated’ set of packages has been adopted now as canon by many users (including me) and the wave of hype and success associated with it has caused many experienced R users, well versed in long established ‘Base R’ functions, to take the leap into a new way of coding and a whole new set of functions. [Read More]

Mapping homelessness in England

Introduction Data wrangling Initial analysis The painful part Introduction For this blog post, I decided to try to find a dataset covering an issue I feel quite strongly about - homelessness. I managed to find a fairly large dataset from the Cambridgeshire Insight website. For a while I’ve wanted to try out R’s mapping potential and hopefully generate a heatmap, so I’ve deliberately tried to find a dataset where I can try this out. [Read More]

Two years in Data Science and not yet a Data Scientist

What’s in a name? Despite the potentially grumpy sounding title of this post, this is more a positive reflection of the past two years since I started working in Data Science. I think I’ve come a long way, but there is still so far to to go if I am to confidently call myself a Data Scientist. Why does a job title matter? It’s a good way of thinking about your competencies and describing where you want to go, and conveying that to other people. [Read More]

Portsmouth R User Group - 2nd Meeting

Last month I attended my first ever R User Group meeting, which was held at the University of Portsmouth in their impressive Future Technology Centre. I’d been itching to go to one of these meetups for a while, but unfortunately there was nothing in the South of England, so when this opportunity came around I couldn’t miss it, especially as I couldn’t attend the first one. It was really well attended by about 30 people from all manner of backgrounds, and two briefs were given. [Read More]