Data Science For Football Business - Clustering Analysis

Bas Schnater, Kenneth Cortsen and Daniel Rascher explore the use of machine learning to understand fan transactional behaviour to improve marketing effectiveness.



The use of data is becoming very integrated into numerous aspects of the study of football, whether it is from a sporting or from a business perspective. From a sporting perspective, there are many success stories, books, movies and podcasts as well as a growing number of dedicated academic programs focused on performance-based data science. However, there is unexploited potential concerning data science practices in the business of sports, e.g., the business of running a football club. While top European football clubs have invested intensively in professionalizing their sports analytics departments, most clubs have not done the same for their business or commercial departments. Although athletic performance analytics can improve winning and net scoring, many of the same techniques can provide optimal management of the commercial aspects of a football club (Cortsen & Rascher, 2018).


This new blog series associated with data science practices will dive into why and how competent data management can provide value to operating a football business. In this part, Bas teams up with Danish sports business researcher and UEFA A-licensed football expert Kenneth Cortsen from University College of Northern Denmark and Aalborg BK/AaB and Daniel Rascher, Professor and Academic Director at the University of San Francisco’s Sports Management Master’s Program to elaborate on a data science technique, which is very beneficial for marketing to fan bases: creating segments or groups of fans using cluster analysis on fan behavioural data.1


For this article, a combination of real and fictional2 data has been used. Clubs participating in this research wish to remain anonymous.


1 This is in contrast to clustering fan bases based on survey data and the extrapolation of the results to the broader fan base.

2 Fictional data is used to show how the technique works instead of drawing conclusions for a real team.


What is cluster analysis?

Cluster analysis (or clustering) is an unsupervised machine learning technique that can be applied to find similarities in terms of fan behaviour and thus to divide fans in groups characterized by those aspects of similarity (Jain, 2008). This can include grouping fans based on ticketing purchase behaviours and/or multiple transactional and non-transactional behaviours (ticketing, e-mail, food & beverage, merchandising, social media etc.). Grouping fans based on similar behaviours can help commercial and marketing teams improve targeting and hence personalise their offerings to fans.


Figure 1. Source: https://i.stack.imgur.com/cIDB3.png


There are many parallels between clustering and segmentation, however, there are some key differences. While traditional segmentation3 is the process of grouping fans based on observed similarities (age, distance to stadium, purchasing history), “clustering is the process of finding those similarities in customers so that they can be grouped.”4 In other words, clustering finds patterns in the data, which are not visible to the human eye. Therefore, clustering leads to a far more accurate segmentation because it takes multiple concepts and similarities into account.


3 Many segmentation studies use cluster analysis to create segments, not just observation of characteristics.

4 https://www.acquia.com/blog/difference-between-segmentation-and-clustering.


Why clustering a sports fan base?

First, the obvious question to ask is why would it be beneficial to cluster a fan base? The best answer is to gain understanding of fan behaviour. During our work in football, we have been motivated to find hidden patterns in consumer/fan behaviour. Without realizing it, consumers create many digital footprints along their fan journey, which contain valuable information. Analysing and grouping these footprints will expose clusters of fans with similar fan behaviours and spending patterns. Yet, this is not only feasible in the digital domain. Analog fans also leave information about their consumer experiences. Consider the location and time when a fan gets his/her ticket scanned at the access gate, the moment when he/she purchases a beer, purchases a hotdog and visits the merchandise store. All of this information can be analysed and provide tailored offerings to fans.


Given the increasing complexity of offerings in sports and the availability of data, it is critical to be able to extract marketing insights about customers more than ever. By grouping fans into various segments showing similar behaviours (clusters), these hidden patterns become visible. More importantly, marketing teams can then leverage these insights by offering products, services or experiences tailored to these cluster’s needs, predicting what groups of fans would likely purchase during their next interaction with the club. For example, do fans, who enter the stadium early, actually purchase more drinks and food? Or, do fans who respond to e-mails really buy more tickets? And for sponsors, cluster analysis could help to identify specific fan segments, which generate relevant ROI (Return on Investment) ROE (Return on Engagement) and ROO (Return on Objectives).  


COVID-19 and the new fan experience

Since the COVID-19 outbreak, many governing bodies, leagues and clubs are struggling with national rules and regulations in relation to being able to welcome fans back into the stadium. This is not only an operational challenge, but also a commercial challenge; who do you allow back first? Should fans be selected via a ”lucky” or random draw or is it best to invite the most loyal fans back first, or should a club invite a high percentage of fans being linked to risks of not renewing back to the stadium first to prevent them from cancelling their season tickets? Each of these scenarios may make sense depending on what is known about the fan base. Clubs may also be able to understand which fans are likely to churn the following season (i.e., cancel their subscription or season tickets) based on tracking different data points such as previous matchday behaviour, tenure of the season ticket, and average spending at the stadium during games or off-days.



In addition to applying clustering insights to improve fans’ experiences, clustering is also very advantageous for sponsors. Many sponsors expect more data-driven activations of their commercial rights; therefore, it is imperative to understand the characteristics of the behaviour of fans. Quantifying and qualifying, and thus not only explaining, but also understanding the hidden similarities among thousands of fans will allow clubs to offer more personalized and precise commercial packages to sponsors.


How to cluster a fan base?

Hopefully the description above has helped explain the practical purposes of clustering a fan base. The next step is to actually do it. It is first vital to remember that clustering is an unsupervised machine learning technique. This means that clustering attempts to find patterns in unlabelled or uncategorized data and to ‘find meaningful’ groups of data points that aren’t yet classified. This is exactly what we’ll do here as cluster analysis tries to group items with similar characteristics. An example includes grouping football players based on various performance metrics. Here, we will focus on customer data related to the business of football.


With the proper ingredients and preparation, the analysis can be completed relatively quickly. Although the concepts and steps are quite technical, and sensitivity tests are needed, here are the basic necessities:


– Expertise. Experience in software packages that can perform cluster analysis, e.g., Python, R or SPSS, is definitely recommended. Also, experience with grabbing data, cleaning and preparing data for analysis is needed. And this requires statistical understanding about data manipulation.


– Data. Preferably numerical data, as this is quite easy to analyse. Think age, newsletter interactions, amount of money spent per season, matches visited, etc.


– Business understanding. Contextual understanding is crucial for data science to have meaning and be able to address business challenges. In our opinion, data science in this context should always have a business problem or challenge as its starting point, i.e., what business questions (problems or challenges) can data science help to solve? Typically nowadays, there is more data available than ever before and these answers to (yet) unasked but relevant questions is critical in the sense that it will also train football clubs in asking the ‘right’ questions at the ‘right’ time. Part of the process of implementing data science into the business process of a club is for the club to understand the full scope of what can be analysed and prioritise what matters.


– Choice of technique. As there are various methods of clustering a fan base, there’s more to the choice of techniques than what this article covers. For the purpose of this article, we use K-Means clustering analysis as it is relatively straightforward.


Before the data is analysed, it is important to make sure it is in numerical format. This means that all categorical data (gender, age needs to be encoded. For the remainder of this article, Python will be used as the method of preference. However, as stated before, more software options can perform cluster analysis. Within Python, Pandas, a data analysis library, offers an easy way to create indicator or dummy variables.8 It is also a critical step to decide how to handle missing values. To provide support to the research, it is often useful to conduct the analysis dropping all missing values, but also filling in the missing values with suitable replacements (possibly using the group mean).9 An assessment is made to determine the impact of these two methods on the final results, and if there are stark differences, further investigations should be taken.


8 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

9 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna


Another important step is to determine the number of variables to be used in the clustering. The more variables that are included, the less informative the process can become because either the segments are too general to be useful, or too specific with a relatively small number of people in them. Unfortunately, there is a trade-off because excluding a variable essentially excludes its information from the final analysis. It is advised to start with a logically defined set of variables (based on customer information like annual income and household spending) and then to form the clusters around these. There are other methods, e.g., principal components analysis, which can shrink the number of variables using statistical techniques.


What does clustering look like?

It is essential to understand relationships between variables in the data beforehand and in relation to the defined questions that the football club wants to answer. As figure 2 shows, there is a light correlation (r=0.4) with the two variables, i.e., age and matches visited.


Figure 2 Example dataset before clustering (r=0.4)


By clustering this fan base, the goal is to find hidden patterns in the data invisible without further detection. As figure 2 showed, there seems to be a relationship between the age and the number of matches visited. However, is this only for home games, or also for away games? With these variables, 2 clearly defined clusters can be created (see figure 3). When adding a third variable, such as number of home games visited, a different clustering result will become visible where the data points get drastically rearranged (see figure 4). Also, the cluster boundaries are not as strictly defined anymore, but they seem to overlap slightly. Adding a fourth variable creates a slightly confusing result with cluster 2 being spread out through the other 3 clusters (see figure 5). A relevant question would be here: will this still be workable? In this example, arguments can be found for sticking to 3 clusters and not adding the fourth variable. Repeating this process multiple times with different variables may lead to a more ‘clean’ rearrangement for the fourth cluster. For now, 3 clusters will be accepted as the final result.


2 cluster example (age, matches)

3 variables (age, matches, # home games)

4 variables (age, matches, home games, years customer)


To conclude, there are many ways to cluster a fan base and there isn’t necessarily a single right or wrong approach. In fact, it takes a lot of adjustment and trial-and-error to find the most appropriate number of clusters. What needs to be emphasized is that ‘less-is-more’; the fewer variables used to create the clusters, the easier it is for the football club to work with the assigned clusters. However, fewer clusters tend to mean that the members of each cluster are less unique, or to put it another way, fewer clusters implies more generality within each cluster Thus, it is an art to find the correct balance in this process.


Practical case – COVID-19 and season ticket holders

How can this technique be applied? Imagine you’re the head of marketing and ticketing at a football club and you can welcome season ticket holders back into the stadium again. This is good news, but who should you allow back in first? All season ticket holders cannot immediately come back as that will often exceed the new maximum COVID-19 capacity. There are several ways to approach this challenge. One way is to welcome the most loyal fans back first. Another potential solution is to welcome back fans, who have the most expensive season tickets, as their churn the following year would result in the largest financial impact.


Based on the clustering analysis from earlier, it is possible to combine these choices. This means a cluster of fans that has a bit of all the above-mentioned arguments. Derived from figures 6 and 7, it becomes clear that cluster 1 spends more money on average for their season ticket (see figure 6) and many people in that cluster have visited most of the home games (see figure 7). So if you were a club and would had to invite the fans first who could financially hurt the club when not renewing, the retention risk of cluster 1 (considering it being a larger cluster than cluster 3) is higher than cluster 3. In other words, if a lot of season ticket holders from cluster 1 would not renew, it would hurt the club financially more than with fans in cluster 2. Therefore, from a financial standpoint, it would make sense to invite cluster 1 to the stadium first, then cluster 3 followed by cluster 2.


Average season ticket price

Home games in previous season



Hopefully this article provides a first introduction to the possibilities of K-Means clustering. Though, some limitations do need to be mentioned: the user always needs to determine the ideal number of clusters.10 It means that it takes a lot of time and consideration in terms of what is the ideal number of clusters chosen. Also, K-Means clustering is not always the ideal clustering method, for example when the dataset holds extreme outliers, in which case a K-Medoids technique would be advised. However, this is another methodological discussion, which falls outside of the scope of this article. In short, each technique has its pro’s and con’s which need to be considered when performing clustering analysis.


10 There are versions of cluster analysis that choose the number of clusters from the data.



Cortsen, K., & Rascher, D. (2018). The Applicaiton of Sports Technology and Sports Data for Commercial Purposes. In D. Almeida Marinho, & H. Pereira Neiva, The Use of Technology in Sport – Emerging Challenges (pp. 47-85). London: InTech Open.


Jain, A. K. (2008). Data Clustering: 50 Years Beyond K-means. ECML PKDD 2008, 3-4.



About the authors

Bas Schnater (MSc.) has worked in sports for over 8 years and has been active in various roles for organizations in The Netherlands and Australia including Australian Open, FIH, Melbourne Sports & Aquatic Centre and in a club role at Dutch Eredivisie club AZ Alkmaar. At this club, Schnater was responsible for all fan intelligence projects, digital innovation and campaign management. In 2018, his campaign methodology got nominated for a global World Football Summit Award as the Best Club Commercial Initiative. With his consultancy Fan Engagement, Schnater helps various football organizations across Europe such as UEFA with setting up fan strategies, CRM projects and fan research. He is a frequently invited guest lecturer for education institutes such as the Johan Cruyff Institute, IFBI Brussels, ISDE Masters Sports & Law, Sports Business Institute, Coventry University, University College Birmingham, the University College of Northern Denmark and the University of Applied Sciences Amsterdam.


Schnater has also spoken at international sports business conferences such as various ESSMA Summits, SportsBiz Warsaw, KPMG Sports Analytics Conference, Football Business Summit, the Capgemini Future of Sports conference and has contributed to private knowledge sharing events of the English Premier League, the Swedish Elitfotboll, the Portuguese Primeira Liga, the Dutch Eredivisie CV and the Hungarian OTP Bank Liga. Schnater is the creator of the Growing Attendance Model (2018), the Data Maturity Model (2021*), has contributed over 20 publication to the industry literature including case study contributions to Winning With Data (2018) and Digital Sport Marketing (2020). Apart from sports, Schnater also works in publishing at international publisher Mediahuis. In his role, he helps newspapers to improve content performance and subscription conversion maximization via managing various data science projects.

*to be published


Kenneth Cortsen co-founded the Department of Sport Management at University College of Northern Denmark (UCN). In addition to its Sport Management Program with general admission criteria, UCN Sport Management collaborates with FIFPro (World Players’ Union) in Amsterdam and runs online sport management education for professional athletes from various European countries. Cortsen is also an External Lecturer at DIS, Copenhagen where he teaches sport economics for American elite students from various universities in the US. His PhD. focused on sports branding at different levels and how to improve sports branding interactions and how to capitalize on this process. Cortsen does sports business research, lectures and consults for organisations in Denmark and abroad, e.g., guest lecturers or research visits at University of San Francisco, Vlerick Business School, Harvard Business School, University of Northern Colorado, San Diego State University Sports MBA Program, and Johan Cruyff Institute (Amsterdam and Barcelona). He is a frequent speaker at national and international sport conferences, e.g., Sport Accord Convention or Forum SPORTBIZ in Warsaw, Poland in September, 2020 or the Australian Sports Technologies Network Masterclass session in November, 2020 as recent 2020 examples. He holds a UEFA A-license coaching certificate and has been coaching in the Danish football club Aalborg BK/AaB. After four years as head coach for the men’s reserve team, he became the head women’s coach and took the team from the third tier of Danish football to a current position in the Danish Super League in June 2020 before becoming a strategic advisor for the board of the mother club while overseeing individual coaching projects and some international collaboration projects with other European top clubs.


Daniel Rascher (Ph.D in Economics, UC Berkeley) is Professor and co-Director of the sport management program at the University of San Francisco, having also taught at the University of Massachusetts, Northwestern University, Stanford University, and the IE Business School. He has over 70 publications including a co-authored sport finance textbook, Financial Management in the Sport Industry. At SportsEconomics and OSKR, he has worked on over 150 sports business consulting projects for clients involved in the NBA, NFL, MLB, NHL, NCAA, NASCAR, MLS, PGA, media, sporting goods and apparel, professional boxing, mixed martial arts, minor league baseball, NHRA, AHL, Formula One racing, Indy Car racing, American Le Mans racing, Premier League Football (soccer), women’s professional soccer, professional cycling, endurance sports, Indian Premier League, ticketing, IHRSA, music, as well as sports commissions, government, convention and visitors bureaus, tourism businesses, and B2B enterprises. He has been named Research Fellow of the North American Society for Sport Management and was given the Lifetime Achievement Award from the Applied Sport Management Association. Dan has testified as an expert witness in federal and state courts, in arbitration proceedings, and provided public testimony numerous times to state and local governments.