The Rising Tennis Star

Author

James Liang

Published

February 10, 2025

Executive Summary

Tennis has long been defined by the dominance of three extraordinary players—Novak Djokovic, Rafael Nadal, and Roger Federer. Their achievements, spanning almost two uncontested decades, have set an unparalleled benchmark in the sport. Their rivalry, marked by intense matches and moments of brilliance, has captivated fans around the world, pushing the boundaries of what was thought possible in tennis.

For me, growing up in the leafy suburbs of Melbourne, watching their plays at the Australian Open with my family became a warm, nostalgic tradition. Whether we gathered around that small TV in the living room, all squeezed together, or simply let the matches play softly in the background, it was always a moment that brought us closer, especially as we cheered on our favourite three players. However, with their gradual transition away from professional play, a new generation of rising stars is poised to reshape the competitive landscape.

This report aims to provide:

  • A quantitative exploration of the Big 3’s dominance

  • A data-driven analysis of how the sport has evolved, and

  • A statistical approach to identifying the key attributes that define a tennis superstar.

Leveraging analytical techniques—including visualisations, regressions, and principal component analysis (PCA) - this study uncovers patterns in performance metrics, player trajectories, and the factors influencing success at the highest level. This report offers insights into the shifting dynamics of professional tennis and highlights the emerging talents most likely to leave their mark on the sport.

Preparing the Data

Data Collection

Data used for this report includes:

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

##### 1. Association of Tennis Professionals (ATP) Match Statistics #####
years = range(1968, 2025)
base_url = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_{}.csv"

# Read and concatenate all CSV files
dfs = []
for year in years:
    url = base_url.format(year)
    try:
        df = pd.read_csv(url)
        dfs.append(df)
        # print(f"Successfully loaded: {year}")
    except Exception as e:
        print(f"Failed to load {year}: {e}")

# Combine all dataframes
atp_match_stats = pd.concat(dfs, ignore_index=True)

# atp_match_stats.to_csv("data/atp_1968-2024.csv", index=False)
atp_match_stats = pd.read_csv('data/atp_1968-2024.csv')
  • ATP Player Information by Jeff Sackmann, containing the personal attributes of ATP players, including details such as player birth year, country of origin, and other personal attributes.
Code
##### 2. ATP Player Information #####
base_url = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_players.csv"
player_info = pd.read_csv(base_url)
  • ATP Rankings by Jeff Sackmann, containing the weekly rankings of all ATP players.
Code
##### 3. ATP Rankings #####
base_url = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_rankings_{}s.csv"
years = ['00', '10', '20', '90', '80', '70']
# Read and concatenate all CSV files
dfs = []
for year in years:
    url = base_url.format(year)
    try:
        df = pd.read_csv(url)
        dfs.append(df)
        # print(f"Successfully loaded: {year}")
    except Exception as e:
        print(f"Failed to load {year}: {e}")

# Append current rankings up to 2024
current_seed_url = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_rankings_current.csv"
df = pd.read_csv(current_seed_url)
dfs.append(df)

# Combine all dataframes
player_seed = pd.concat(dfs, ignore_index=True)

# player_seed = pd.read_csv('data/all_ranking_data.csv')
  • ATP Advanced Match Statistics (Manual Match Labels) by Nirodha Epasinghege Dona, Paramjit S. Gill, and Tim B. Swartz, which aggregates manually labeled shot-by-shot data from Sackmann’s open-source Match Charting Project. The primary dataset used in this report is sourced from Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 188, Issue 1, January 2025, Pages 188–204.
Code
##### 4. ATP Advanced Match Statistics #####
advanced_match_stats = pd.read_csv('data/Supp_1_men_data.csv')
  • Tennis Abstract Metrics by Jeff Sackmann, containing the Top 100 player related statistics, scraped from the site.
Code
##### 4. ATP Advanced Match Statistics #####
atp_100_advanced_player_info = pd.read_csv("data/atp_100_advanced_player_info.csv")

Preprocess Data

The datasets used in this report are relatively well-maintained and structured, making them suitable for analysis with minimal preprocessing. However, before proceeding, a simple data inspection was conducted to ensure consistency, completeness, and accuracy.

  1. Data Integrity Checks Before making any transformations, each dataset was reviewed for missing values, inconsistencies, and potential errors. This included:
  • Uniqueness Checks
  • Identifying missing or null values
  • Data type consistency (e.g. Change percentage-based statistics from strings to numerical values).
  1. Handling Missing Data
  • One notable issue was found in the ATP Player Rankings dataset, where weekly ranking updates were not always recorded, leaving intermittent gaps in the timeline. Since player rankings are updated every Monday, missing data could result in misleading trends when analyzing ranking progression.
Code
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
df = player_seed.copy()
df["ranking_date"] = pd.to_datetime(df["ranking_date"], format="%Y%m%d")

# Generate expected weekly range (ATP rankings update every Monday)
full_date_range = pd.date_range(start=df["ranking_date"].min(), end=df["ranking_date"].max(), freq="W-MON")

# Get missing weeks
missing_weeks = full_date_range.difference(df["ranking_date"].unique())

# Plot rankings and missing weeks
plt.scatter(df["ranking_date"], [1] * len(df), label="Available Weeks", color="blue", marker="o")
plt.scatter(missing_weeks, [1] * len(missing_weeks), label="Missing Weeks", color="red", marker="x")

plt.xlabel("Date")
plt.ylabel("Ranking Weeks")
plt.title("ATP Ranking Weeks: Missing Weeks Visualization")
plt.legend()
plt.xticks(rotation=45)
plt.grid(True)

plt.show()

if missing_weeks.empty:
    print("✅ No missing weeks. All rankings are accounted for.")
else:
    print(f"\n❌ Missing {len(missing_weeks)} of {len(full_date_range)} weeks\n")


❌ Missing 422 of 2680 weeks

To address this, forward filling was applied to interpolate missing dates, by assuming that a player’s ranking remained the same until the next recorded update. At the same time, it is noteworthy to mention that between 2020-03-23 and 2020-08-23, due to COVID lockdowns, many tournaments were unable to proceed, and hence ATP ranking were frozen during the period.

Code
import pandas as pd

# Load dataset
final_df = player_seed.copy()
final_df["ranking_date"] = pd.to_datetime(final_df["ranking_date"], format="%Y%m%d")

# Sort by player and date
final_df = final_df.sort_values(by=["player", "ranking_date"]).reset_index(drop=True)

# Generate full weekly date range
full_date_range = pd.date_range(start=final_df["ranking_date"].min(), end=final_df["ranking_date"].max(), freq="W-MON")

# Create a dataframe with all player-week combinations
players = final_df["player"].unique()
date_expanded = pd.MultiIndex.from_product([players, full_date_range], names=["player", "ranking_date"])
expanded_df = pd.DataFrame(index=date_expanded).reset_index()

# Merge with original data
merged_df = expanded_df.merge(final_df, on=["player", "ranking_date"], how="left")

# Forward-fill missing values (impute rank and points)
merged_df["rank"] = merged_df.groupby("player")["rank"].ffill()
merged_df["points"] = merged_df.groupby("player")["points"].ffill()

# Remove any rows where `rank` is still NaN (for players who didn't exist in the dataset yet)
merged_df = merged_df.dropna(subset=["rank"])
merged_df = merged_df.reset_index(drop=True)

# Save or return final dataframe
merged_df.to_csv("data/imputed_ranking_data.csv", index=False)

print("✅ Missing dates imputed successfully!")
✅ Missing dates imputed successfully!

Another important issue to address, is that whilst ATP rankings are mostly complete from 1985 to the present, 1982 is missing, and rankings from 1973-1984 are especially intermittent - as raw data itself was not captured during the scraping process. Therefore, we will also filter the data from the year 2000, in addition to the forward filling that was applied.

Code
# Load dataset
imputed_rankings_raw = pd.read_csv("data/imputed_ranking_data.csv")
imputed_rankings_raw["ranking_date"] = pd.to_datetime(imputed_rankings_raw["ranking_date"])

# Remove the frozen ranking period
imputed_rankings_raw = imputed_rankings_raw[
    ~((imputed_rankings_raw["ranking_date"] >= "2020-03-23") & (imputed_rankings_raw["ranking_date"] <= "2020-08-23"))
]

# Filter for years between 2000 and 2024
imputed_rankings = imputed_rankings_raw[(imputed_rankings_raw["ranking_date"].dt.year >= 2000) & (imputed_rankings_raw["ranking_date"].dt.year <= 2024)]

#####
# Rank 1 Players and The Number of Weeks at Rank 1
rank_1_df = imputed_rankings[imputed_rankings["rank"] == 1]
rank_1_count = rank_1_df.groupby("player").size().reset_index(name="Weeks at #1")

rank_1_count = rank_1_count.sort_values(by="Weeks at #1", ascending=False)

merged_df = rank_1_count.merge(player_info, left_on="player", right_on="player_id", how="left")

# merged_df[['player_id', 'name_first', 'name_last', 'Weeks at #1']].head(6)

After imputation, we can verify the accuracy of the dataset by examining the top ATP rankings of leading players, particularly Djokovic, Federer, and Nadal. It’s important to note that the forward-filled dataset has been filtered to include only players active on tour from the 2000s onward.

ATP Official Site Stats Vs. Forward Filled Data

Link to ATP Historical Rank 1

Analysis

A Dominating Trio

Number of Grand Slam Victories by Player

Code
final = atp_match_stats[(
            atp_match_stats['tourney_level'] == 'G') & (atp_match_stats['round'] == 'F')] \
                .groupby('winner_name')['tourney_id'] \
                    .count().reset_index()

# Rename and sort
final = final.rename(columns={'tourney_id': 'Grand Slam Wins'}) \
            .sort_values(by='Grand Slam Wins', ascending=False).head(10)

# Set figure
sns.set_theme(style="whitegrid")

# Create a color gradient with the most wins highlighted
colors = sns.color_palette("viridis", len(final))
highlight_color = "purple"  # Special color for the top player
bar_colors = [highlight_color if i == 0 else colors[i] for i in range(len(final))]

ax = sns.barplot(x="Grand Slam Wins", y="winner_name", data=final, palette=bar_colors)

# Add annotations on bars
for index, value in enumerate(final["Grand Slam Wins"]):
    ax.text(value + 0.5, index, str(value), ha="left", va="center", fontsize=12, fontweight="bold", color="black")

# Titles and Labels
plt.xlabel("Number of Grand Slam Titles", fontsize=14, fontweight="bold")
plt.ylabel("Player", fontsize=14, fontweight="bold")
plt.title("Top 10 Grand Slam Winners (1968-2024)", fontsize=16, fontweight="bold", pad=15)

sns.despine(left=True, top=True)
plt.show()

The dominance of the Big Three—Roger Federer, Novak Djokovic, and Rafael Nadal—stands in stark contrast to even some of the sport’s greatest champions such as Pete Sampras, Andre Agassi, and Jimmy Connors, who each laid the foundation for modern tennis.

Their combined Grand Slam tally not only surpasses that of their predecessors but also accounts for a remarkable portion of total Grand Slam titles in the Open Era. Each of them has set records that were once thought untouchable—whether it be Federer’s elegant shot-making, Nadal’s clay-court dominance, or Djokovic’s unmatched consistency across all surfaces, these three are the only individuals in the history of tennis, to each amass over 20 Grand Slam Titles.

In contrast, current-generation players, such as Daniil Medvedev, Alexander Zverev, and Carlos Alcaraz, have fallen significantly short in terms of Grand Slam wins - which is hardly surprising, given that the Big Three have monopolized the Grand Slam victories over the past two decades, leaving little room for others to claim major titles.

Code
import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = atp_match_stats.copy()
df["tourney_date"] = pd.to_datetime(df["tourney_date"], format="%Y%m%d")

# Filter for Grand Slam finals from 1968 onwards
df = df[
    (df['tourney_level'] == 'G') & 
    (df['round'] == 'F') & 
    (df['tourney_date'].dt.year >= 2003) & 
    (df['tourney_date'].dt.year <= 2022)
].copy()

# Count Grand Slam wins per player
df = df.groupby('winner_name')['tourney_id'].count().reset_index()

# Rename and sort
df = df.rename(columns={'tourney_id': 'Grand Slam Wins'}) \
       .sort_values(by='Grand Slam Wins', ascending=False)

# Aggregate total Grand Slam wins per player
grand_slam_totals = df.groupby("winner_name")["Grand Slam Wins"].sum()
big_3 = ["Roger Federer", "Novak Djokovic", "Rafael Nadal"]

# Calculate proportions
big_3_wins = grand_slam_totals[grand_slam_totals.index.isin(big_3)].sum()
total_wins = grand_slam_totals.sum()

big_3_proportion = big_3_wins / total_wins
other_players_proportion = 1 - big_3_proportion

# Data for pie chart
labels = ["Federer, Nadal & Djokovic", "Other Players"]
sizes = [big_3_proportion, other_players_proportion]
colors = ["gold", "lightgrey"]

# Create pie chart
plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct="%1.1f%%", colors=colors, startangle=140, 
        wedgeprops={"edgecolor": "black"}, textprops={"fontsize": 12})
plt.title("Grand Slam Wins: Big 3 vs Other Players (2003-2022)", fontsize=14, fontweight="bold")

# Save plot
plt.savefig("images/pie_chart_big_3.png", dpi=300, bbox_inches="tight")
plt.close()

Since 2003, Roger Federer, Rafael Nadal, and Novak Djokovic have defined an era of dominance in men’s tennis, capturing the vast majority of Grand Slam titles during their Tour. Their achievements account for a remarkable share of total Grand Slam victories in the Open Era, setting a standard unmatched in the sport’s history. In fact, between 2003 and 2022, the three players alone, won 80% of all Grand Slam tournaments during those 2 decades.

Their overwhelming success, in comparison to all other players, underscores their sustained excellence and lasting impact on the game - an achievement that would be difficult to emulate for any player in future.

The Rank 1 Tennis Player Across History

Another key measure in tennis is the ATP Ranking. The ATP Ranking is based on a player’s performance over the past 52 weeks, using their best 19 tournament results. Points are awarded based on event prestige, with Grand Slams offering up to 2000 points and smaller tournaments awarding fewer. The ranking updates weekly, with points expiring after a year. Mandatory events include Grand Slams, ATP Masters 1000s, and the ATP Finals, where players can earn extra points. Missing key tournaments without a valid reason can result in penalties. This system rewards consistency and sustained success, influencing seeding, tournament entry, and career opportunities.

Being Rank 1 in the ATP Rankings means a player has accumulated the most ranking points over the past 52 weeks. Holding the No. 1 ranking is a prestigious achievement, signifying dominance over the competition and granting advantages such as top seeding in tournaments.

To truly illustrate the degree of dominance held by the Big Three players, it is worth looking at the timeline of Rank 1 players, over the past 24 years.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors

# Combine first and last names
player_info['player_name'] = player_info['name_first'] + " " + player_info['name_last']
plot_df = imputed_rankings[["player", "ranking_date", "rank"]]

# Sort by player and ranking_date to ensure proper order
plot_df = plot_df.sort_values(by=["player", "ranking_date"])

# Merge with player_info to get player names
player_rankings = plot_df.merge(player_info[['player_id', 'player_name']], left_on='player', right_on='player_id', how='left')

# List of top players for plotting.
top_players = rank_1_count["player"].tolist() 

# Filter the dataset for top players
top_players = player_rankings[player_rankings['player'].isin(top_players)]

############

# Create a pivot table with players as rows and years as columns
pivot_df = top_players.pivot_table(index='player_name', columns='ranking_date', values='rank', aggfunc='first')

pivot_df = pivot_df.where(pivot_df.notna(), None) # Remove NAs
rank_one_df = pivot_df.applymap(lambda x: 1 if x == 1 else None)

# Iterate over each player to plot when they were ranked 1
for player in rank_one_df.index:
    rank_one_dates = rank_one_df.columns[rank_one_df.loc[player].notna()]
    
    # Plot a scatter plot of these dates for the player
    plt.scatter(rank_one_dates, [player] * len(rank_one_dates), label=player, s=100, alpha=0.7)

# Plot
plt.xlabel('Ranking Date')
plt.ylabel('Player')
plt.title('Timeline of Players Ranked #1 (2000 - Current)')
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

From 2004 to 2024, only four other players managed to reach the world No. 1 spot outside of the Big Three. In stark contrast, in the four years prior to their reign (2000–2004), a total of seven different players claimed the top ranking, highlighting a much more volatile and competitive era. This shift in the rankings landscape marks a clear turning point in modern tennis, where Federer, Nadal, and Djokovic not only raised the bar but also established an era of unprecedented stability at the sport’s highest level.

Their sheer dominance underscores the significance of skill in tennis—sustained excellence at the top is no coincidence but rather a testament to their extraordinary level of play. It wasn’t just about winning titles; it was about consistently outclassing elite competition across different surfaces, conditions, and eras. The longevity of their success, spanning nearly two decades, speaks volumes about the gap they created between themselves and the rest of the field.

Yet, at the same time, since 2022, a noticeable shift has begun to take place. With Federer and Nadal’s retirement, and Djokovic gradually facing tougher competition from the next generation, the era of the Big Three appears to be winding down, with newer players such as Alcaraz and Sinner, taking on the title as Rank 1 in the past 2 years.

The Changing Tennis Landscape

Rallying and Serving

To understand how tennis has evolved over time, it is essential to examine both the changes in the style of play and the factors influencing these shifts. One noticeable change is the increasing length of rallies in modern tennis. While some sources, like BBC News, suggest that rallies aren’t getting much longer overall, a closer look reveals a shift in rally patterns.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df = advanced_match_stats.copy()
df["date"] = pd.to_datetime(df["match_id"].str[:8], format="%Y%m%d")

# Filter 
df = df[(df["date"].dt.year >= 1985) & (df["date"].dt.year <= 2022)]

grand_slam_keywords = ["Australian_Open", "French_Open", "Roland_Garros", "Wimbledon", "US_Open"]
df = df[df["match_id"].str.contains("|".join(grand_slam_keywords))]

# Categorize Touches into bins
bins = [1, 3, float("inf")]  # 0-3, 4+
labels = ["1-3", "4+"]
df["Touches_Category"] = pd.cut(df["Touches"], bins=bins, labels=labels, right=True)

# Compute sum of touches per match per category
rally_sum_per_match = df.groupby(["match_id", "date", "Touches_Category"])["Touches"].count().reset_index()

# Remove matches where all categories have Touches = 0 -> Excluding Aces
total_touches_per_match = rally_sum_per_match.groupby(["match_id", "date"])["Touches"].sum()
valid_matches = total_touches_per_match[total_touches_per_match > 0].index
rally_sum_per_match = rally_sum_per_match.set_index(["match_id", "date"]).loc[valid_matches].reset_index()

# Compute the average rally count per year for each category
rally_sum_per_match["year"] = rally_sum_per_match["date"].dt.year

total_touches_per_year = rally_sum_per_match.groupby("year")["Touches"].sum().reset_index()
total_touches_per_year.rename(columns={"Touches": "Total_Touches"}, inplace=True)

avg_rally_per_year = rally_sum_per_match.groupby(["year", "Touches_Category"])["Touches"].sum().reset_index()

# Merge with total touches per year
avg_rally_per_year = avg_rally_per_year.merge(total_touches_per_year, on="year")

# Compute proportion
avg_rally_per_year["Proportion"] = avg_rally_per_year["Touches"] / avg_rally_per_year["Total_Touches"]

# Plot scatter plot with regression line
category_colors = {"1-3": "red", "4+": "purple"}

for category, color in category_colors.items():
    subset = avg_rally_per_year[avg_rally_per_year["Touches_Category"] == category]
    sns.regplot(x=subset["year"], y=subset["Proportion"], scatter=True, label=f"Touches {category}", color=color, order=1)

# Labels and title
plt.xlabel("Year")
plt.ylabel("Proportion of Touches")
plt.title("Proportion of Rally Count in Grand Slam (1985-2022) - Scatter & Regression")
plt.xticks(rotation=45)
plt.legend()
plt.grid()

# Move legend outside the plot
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.tight_layout()

# Show plot
plt.show()

The proportion of rallies lasting beyond four shots has steadily increased since 1985, while those with fewer than three shots have decreased. This suggests a move away from the traditional serve-and-volley game toward a style that relies more on baseline play, where players chase down balls and engage in longer exchanges. This shift can likely be attributed to advancements in tennis racquet technology, improvements in player fitness, and the increasing emphasis on rallying from the baseline.

Additionally, there has been a rise in the number of aces per match, signaling the growing importance of powerful serves in today’s game.

Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

final_df = atp_match_stats.copy()

# Load data.
final_df["tourney_date"] = pd.to_datetime(final_df["tourney_date"], format="%Y%m%d")
final_df["year"] = final_df["tourney_date"].dt.year

# Filter for relevant years (adjust range as needed)
final_df = final_df[(final_df["year"] >= 1995) & (final_df["year"] <= 2022)]

# Compute average aces per match per year
avg_aces_per_year = final_df.groupby("year")["w_ace"].mean().reset_index()

# Plot
sns.regplot(x=avg_aces_per_year["year"], y=avg_aces_per_year["w_ace"], 
            order=2, scatter=True, line_kws={"color": "red"}, scatter_kws={"color": "blue"})

# Labels and title
plt.xlabel("Year")
plt.ylabel("Average Aces per Match")
plt.title("Average Number of Aces per Match Over the Years")
plt.xticks(rotation=45)
plt.grid()

plt.show()

Average Match Times over the years

Across the tennis world, match durations have steadily trended upwards over the years, reflecting the evolving nature of the sport. This trend is consistent across most major tournaments, where matches are lasting longer on average - roughly 20% longer since 1997! Several factors contribute to this change, with some being the aforementioned advancements in racquet technology, improved player fitness, and a shift in playing styles, but interestingly, serve preparation (the time a player spends bouncing the ball or otherwise getting ready to serve) has also been getting longer..

In analyzing the trend of match durations, a smoothing technique called spline interpolation was employed. This method is used to fit a smooth curve to the data, helping to identify underlying patterns and trends over time. The make_interp_spline method creates a spline (a type of piecewise polynomial) that ensures a smooth, continuous curve through the data points, providing a clearer representation of the overall trend while minimizing the noise from individual data points.

However, it is important to note that the dataset used to track these trends has some gaps. Specifically, data from the 1997 Australian Open is incomplete, and no data is available for the 1998 and 2015 US Open tournaments. Additionally, the 2020 Wimbledon was not held due to the global pandemic, and the 2024 Australian Open data is also incomplete, so it has been excluded from the analysis.

Code
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import make_interp_spline
import mplcursors

# Load dataset
final_df = atp_match_stats.copy()

# Convert tourney_date to datetime format and extract year
final_df["tourney_date"] = pd.to_datetime(final_df["tourney_date"], format="%Y%m%d")
final_df["year"] = final_df["tourney_date"].dt.year

# Filter data (excluding 2024 Australian Open due to incomplete data)
final_df = final_df[(final_df["year"] >= 1997) & (final_df["year"] <= 2023)]

# Remove known missing data cases
excluded_entries = [
    (1997, "Australian Open"),  
    (1998, "US Open"),          
    (2015, "US Open"),          
    (2020, "Wimbledon")         
]

for year, tourney in excluded_entries:
    final_df = final_df[~((final_df["year"] == year) & (final_df["tourney_name"].str.contains(tourney, case=False)))]

# Filter for Grand Slam matches
grand_slam_matches = final_df[final_df["tourney_level"] == "G"]

# Compute average match duration per year for each Grand Slam
avg_minutes_per_tourney = grand_slam_matches.groupby(["year", "tourney_name"])["minutes"].mean().reset_index()

# Define colors and line styles for each Grand Slam
tourney_styles = {
    "Australian Open": {"color": "blue", "linestyle": "solid"},  
    "Roland Garros": {"color": "red", "linestyle": "dotted"},    
    "Wimbledon": {"color": "green", "linestyle": "dashed"},      
    "US Open": {"color": "purple", "linestyle": "dashdot"}       
}

# Plot Settings
fig, ax = plt.subplots(figsize=(8,6), facecolor="white")
ax.set_facecolor("white") 

# Store points for hover functionality
hover_points = []
for tourney, style in tourney_styles.items():
    subset = avg_minutes_per_tourney[avg_minutes_per_tourney["tourney_name"].str.contains(tourney, case=False)]
    
    x = subset["year"].values
    y = subset["minutes"].values

    if len(x) > 3:  
        x_smooth = np.linspace(x.min(), x.max(), 300)  
        spline = make_interp_spline(x, y, k=3)  
        y_smooth = spline(x_smooth)
    else:
        x_smooth, y_smooth = x, y  

    # Plot smoothed lines with different styles
    ax.plot(x_smooth, y_smooth, linestyle=style["linestyle"], color=style["color"], linewidth=2.5, label=tourney)

    # Store original data points for hover tooltips (hidden dots)
    scatter = ax.scatter(x, y, color=style["color"], s=20, alpha=0)  
    hover_points.append(scatter)

# Labels and title
ax.set_xlabel("Year", fontsize=12, fontweight="bold", labelpad=10)
ax.set_ylabel("Average Match Duration (minutes)", fontsize=12, fontweight="bold", labelpad=10)
ax.set_title("Grand Slam Match Durations (1997-2023)", fontsize=14, fontweight="bold", pad=15)

# Grids
ax.yaxis.grid(True, linestyle="--", linewidth=0.6, color="#E0E0E0", alpha=0.5)  
ax.xaxis.grid(False)  

ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.legend(title="Grand Slam", fontsize=10, title_fontsize=11, loc="upper left", bbox_to_anchor=(1, 1))

cursor = mplcursors.cursor(hover_points, hover=True)
cursor.connect("add", lambda sel: sel.annotation.set_text(f"{int(sel.target[0])}: {sel.target[1]:.1f} min"))

# Show plot
plt.tight_layout()
plt.show()

What makes a Tennis Star?

When we consider what makes a tennis star, the immediate answer often revolves around their ability to dominate major titles, such as Grand Slams—just as Roger Federer, Rafael Nadal, and Novak Djokovic have done. This dominance is often associated with consistently winning matches, accumulating titles, and generating large numbers of total points won across their careers.

However, a deeper look at the statistics reveals an interesting paradox. The variation in the percentage of points won between players is minimal, even for the game’s elite. For instance, a player like Federer, often regarded as one of the sport’s greatest, only wins slightly more than half of the points played in a match.

Code
atp_100_advanced_player_info = pd.read_csv("data/atp_100_advanced_player_info.csv")

# Convert 'TPW%' to numeric by stripping '%' and converting to float
atp_100_advanced_player_info['TPW%'] = atp_100_advanced_player_info['TPW%'].str.rstrip('%').astype(float)

atp_100_advanced_player_info['M W%'] = atp_100_advanced_player_info['M W%'].str.rstrip('%').astype(float)

# Calculate 2.5th and 97.5th percentiles
tpw_2_5th_percentile = atp_100_advanced_player_info['TPW%'].quantile(0.025)
tpw_97_5th_percentile = atp_100_advanced_player_info['TPW%'].quantile(0.975)

# Plot histogram with percentiles
plt.figure(figsize=(8, 5))
atp_100_advanced_player_info['TPW%'].plot(kind='hist', bins=12, edgecolor='black', alpha=0.7)

# Add vertical lines for the percentiles
plt.axvline(tpw_2_5th_percentile, color='r', linestyle='dashed', linewidth=2)
plt.axvline(tpw_97_5th_percentile, color='g', linestyle='dashed', linewidth=2)

# Labels and title
plt.xlim(40, 60)
plt.xlabel('TPW%')
plt.ylabel('Frequency')
plt.title('Distribution of Points Won%')

# Grid for readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

In fact, among the top 100 current players, 95% of them average between 47.44% and 54.02% of total points won per match. Given such a narrow range in points won—just slightly above 50%—one might wonder: How is it possible for players like the Big Three to consistently dominate the sport? The small variation in points won suggests that, almost by chance, even players ranked outside the top 100 could theoretically reach the No. 1 spot!

Code
# Plot scatter plot
plt.scatter(atp_100_advanced_player_info['TPW%'], atp_100_advanced_player_info['M W%'])

# Labels and title
plt.xlabel('TPW%')
plt.ylabel('M W%')
plt.title('Scatter Plot of TPW% vs M W%')

# Show plot
plt.show()

However, upon closer examination, it becomes clear that despite the narrow difference in points won, a higher percentage of points won plays a crucial role in a player’s ability to consistently win sets and ultimately secure victories in matches.

To understand why this is the case, we can look at a few possible explanations:

  • Not all points in tennis carry the same weight. Some points are more critical than others, and top players excel at capitalizing on these pivotal moments.

  • Specific skills in tennis are essential for consistently winning points.

A straightforward way to test this hypothesis is by using a correlation matrix, which allows us to explore the relationships between important variables and understand how various stats may influence each other.

Investigation

For the purposes of this report, we examine the effects of the following variables on an individual players average Match Win Rate:

  • BPConv%: Break Point (define) Opportunities converted.

  • BPSvd%: Percentage of Break Points saved on Serve.

  • RPW: Percentage of Return Points won,

  • Brk%: Break Rate, which is the percentage of return games won.

  • Ace%: Ace Rate, which are serves that the opponent does not manage to make touch.

  • Hld%: Hold Rate, which is the percentage of service games won.

  • 1stIn: Percentage of First Serves In.

  • TB W%: Tiebreak winning percentage.

Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

variables = ['M W%', 'BPConv%', 'BPSvd%', 'RPW', 'Brk%', 'Ace%', 'Hld%', '1stIn']

# Convert the variables to numeric
atp_100_advanced_player_info = pd.read_csv("data/atp_100_advanced_player_info.csv")

variables = ['M W%', 'BPConv%', 'BPSvd%', 'RPW', 'Brk%', 'Ace%', 'Hld%', '1stIn', 'TB W%']
for var in variables:
    atp_100_advanced_player_info[var] = atp_100_advanced_player_info[var].str.rstrip('%').astype(float)

# Calculate the correlation matrix
correlation_matrix = atp_100_advanced_player_info[variables].corr()

# Plot the correlation matrix using seaborn heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, cbar=True)

# Set title
plt.title('Correlation Matrix of Tennis Statistics')

# Show the plot
plt.show()

Observations from the correlation matrix reveal that Hold Rate, Tiebreak Winning Percentage, and Break Rate exhibit some of the highest correlations with Match Win Rate, with values of 0.61, 0.54, and 0.47, respectively. These metrics are crucial as they represent key moments within a match: winning your own service game (Hold Rate), prevailing in tiebreaks (Tiebreak Win Percentage), and converting or defending break points (Break Rate). Each of these areas has a direct impact on a player’s ability to maintain momentum and secure victories

An interesting pattern emerges when examining the correlation between Break Rate and Percentage of Return Points Won. The high correlation between these two variables suggests that a player’s ability to break serve is closely tied to their effectiveness on return points. That is, players who win a higher proportion of return points are more likely to break their opponent’s serve, which ultimately contributes to a higher overall break rate.

However, this high level of correlation also suggests multicollinearity, which makes it difficult to separate out their individual effects on the Match Win Rate. This could inflate standard errors for the regression coefficients, making it harder to assess the true effect of each variable on the outcome. Whilst we may not be building a predictive model for our analysis, it is noteworthy enough to consider either removing one of the variables, or apply other techniques such as Principal Component Analysis to merge the two variables going forward.

Past and Present - Player Performance

When comparing player performance across generations, one key area of focus is how new-generation players are measuring up to the Big 3, particularly in metrics like Percentage of Return Points Won. In a 2020 study by Tim Roback & Nick Anderson from Tennis Project, it was found that Percentage of Return Points Won showed a strong positive correlation with individual player match win rates.

What stands out is the impressive consistency with which players like Federer, Nadal, and Djokovic have maintained some of the highest Return Points Won percentages, typically ranging from 40% to 42%, a significant benchmark in the sport.

With Federer and Nadal now retired, Djokovic continues to perform at a high level, maintaining a return win rate close to 41%. However, new rising stars like Carlos Alcaraz and Jannik Sinner are swiftly emerging as the successors to these legends. Both players are demonstrating similar performances and achieving victories comparable to what Nadal and Federer once did, making it increasingly challenging for Djokovic to maintain the #1 spot. In this evolving dynamic, the torch appears to be passing to a new era of tennis champions.

Code
# Return Point winning percentage
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

# Plot the scatter plot for RPW vs M W%
plt.figure(figsize=(8, 5))
plt.scatter(atp_100_advanced_player_info['RPW'], atp_100_advanced_player_info['M W%'], alpha=0.7)

# Add regression line
slope, intercept, r_value, p_value, std_err = linregress(atp_100_advanced_player_info['RPW'], atp_100_advanced_player_info['M W%'])

# Create the regression line values
regression_line = slope * atp_100_advanced_player_info['RPW'] + intercept

# Plot the regression line
plt.plot(atp_100_advanced_player_info['RPW'], regression_line, color='red')

# Get the index of the players (without country code)
players = ['Jannik Sinner', 'Novak Djokovic', 'Carlos Alcaraz']
for player in players:
    # Remove any additional country code information
    player_data = atp_100_advanced_player_info[atp_100_advanced_player_info['Player'].str.contains(player, case=False, na=False)]
    
    # Check if player data is found
    if not player_data.empty:
        plt.text(player_data['RPW'].values[0] - 0.3, player_data['M W%'].values[0] - 0.5, player,
                 fontsize=9, ha='right', color='black',
                 bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=0.3'))
    else:
        print(f"Player {player} not found in the DataFrame.")

# Labels and title
plt.xlabel('RPW')
plt.ylabel('M W%')
plt.title('Scatter Plot of RPW vs M W%')

# Show plot
plt.legend()
plt.show()

Rising Star

While we can continue to analyze the previously defined metrics individually (as we did with RPW) to assess which players perform the best, the high dimensionality of the data makes it challenging to view everything at once in a single visualization. Additionally, as mentioned earlier, some of the variables used in constructing the model exhibit signs of multicollinearity.

To better understand which groups of attributes align players with similar performance characteristics, we must address the complexity of the data. Although each variable provides valuable insight, traditional regression methods only allow us to examine the relationship between two variables at a time, making it difficult to identify clusters of players.

To overcome this limitation, I apply Principal Component Analysis (PCA) for dimensionality reduction. PCA enables us to condense the high-dimensional data into a more manageable form, allowing us to visually identify any clusters of strong-performing players. This approach will help us uncover hidden patterns that might otherwise be obscured.

Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load Data
atp_100_advanced_player_info = pd.read_csv("data/atp_100_advanced_player_info.csv")

# Select the relevant variables and convert percentage columns to float
variables = ['BPConv%', 'BPSvd%', 'RPW', 'Brk%', 'Ace%', 'Hld%', '1stIn', 'TB W%']
for var in variables:
    atp_100_advanced_player_info[var] = atp_100_advanced_player_info[var].str.rstrip('%').astype(float)

# Select only the variables needed for PCA
X = atp_100_advanced_player_info[variables]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a DataFrame for the first two principal components
pca_df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])

# Add the player names to the PCA DataFrame
pca_df['Player'] = atp_100_advanced_player_info['Player']

######## Get the PC Loadings Visualisation

# Get the loadings (components) for the first two PCs
loadings = pca.components_

# Create a DataFrame for the loadings of PC1 and PC2
loadings_df = pd.DataFrame(loadings.T, columns=['PC1', 'PC2'], index=variables)

# Plot the loadings for PC1 and PC2
plt.barh(loadings_df.index, loadings_df['PC1'], color='b', alpha=0.7, label='PC1')
plt.barh(loadings_df.index, loadings_df['PC2'], color='r', alpha=0.7, label='PC2')

# Add labels and title
plt.xlabel('Loading Value')
plt.title('Loadings of the First Two Principal Components (PC1 and PC2)')
plt.axvline(0, color='black',linewidth=0.5)

# Display the legend
plt.legend()

# Show the plot
plt.show()

After applying Principal Component Analysis (PCA) to reduce the dimensionality of the data, we can examine the loadings of the first two PCs (principal components) to understand which variables contribute the most to the variance captured by these components. These loadings indicate the strength and direction of each variable’s relationship with the principal components, essentially revealing how the original features were combined.

Principal Component 1:

  • In this case, Hold Rate (Hld%), Ace Rate (Ace%), and Break Points Saved (BPSvd%) have the highest positive loadings on PC1, suggesting that this principal component strongly represents players who excel in serving and holding their service games.

  • Conversely, Break Rate (Brk%), Break Point Conversion (BPConv%) and Return Points Won (RPW) have the highest negative loadings, indicating that PC1 also captures players who thrive on breaking opponents’ serves—but in the opposite direction of strong servers. Players with high PC1 scores are likely dominant servers, while those with low PC1 scores are more effective returners.

Principal Component 2:

  • Unlike PC1, Tiebreak Win Percentage (TB W%), Hold Rate (Hld%), and Break Points Saved (BPSvd%) have the highest negative loadings, suggesting that players who excel in these areas are positioned on one end of this component.

  • Notably, there are no significant positive loadings, meaning PC2 does not strongly represent any particular attributes in the opposite direction.

Next, I examine the loadings of both principal components in a Biplot, to gain insight into the patterns that differentiate top-performing players from others.

Code
def biplot(pca, X_pca, labels, variables):
    plt.figure(figsize=(9, 7))
    pc1, pc2 = 0, 1  # First two principal components

    # Scatter plot of the projected data
    plt.scatter(X_pca[:, pc1], X_pca[:, pc2], alpha=0.5)

    # Add the variable loadings (vectors)
    for i, var in enumerate(variables):
        plt.arrow(0, 0, pca.components_[pc1, i] * 3, pca.components_[pc2, i] * 3, 
                  color='r', alpha=0.75, head_width=0.1, length_includes_head=True)
        plt.text(pca.components_[pc1, i] * 3.5, pca.components_[pc2, i] * 3.5, 
                 var, color='r', fontsize=10, ha='center', va='center', fontweight='bold')

    # Highlight the players of interest
    players = ['Sinner', 'Djokovic', 'Alcaraz', 'Raphael Collignon', 'Medvedev', 'Zverev', 'Berrettini', 'de Minaur', 'Fritz']
    for player in players:
        player_data = atp_100_advanced_player_info[atp_100_advanced_player_info['Player'].str.contains(player, case=False, na=False)]
        if not player_data.empty:
            idx = player_data.index[0]
            plt.scatter(X_pca[idx, pc1], X_pca[idx, pc2], color='black', s=100)
            plt.text(X_pca[idx, pc1] - 0.5, X_pca[idx, pc2] - 0.5, player, fontsize=10, color='black')

    plt.xlabel(f"PC{pc1+1} ({pca.explained_variance_ratio_[pc1]*100:.2f}%)", fontsize=12)
    plt.ylabel(f"PC{pc2+1} ({pca.explained_variance_ratio_[pc2]*100:.2f}%)", fontsize=12)
    plt.title("PCA Biplot (First Two PCs)", fontsize=14, fontweight='bold')
    plt.axhline(0, color='gray', linestyle='--', alpha=0.5)
    plt.axvline(0, color='gray', linestyle='--', alpha=0.5)
    plt.grid(True)
    plt.show()

# Call the improved biplot function
biplot(pca, X_pca, atp_100_advanced_player_info['Player'], variables)

Upon examining the biplot for the current Top 100 ATP players and highlighting some of the strongest and most well-known competitors—such as Djokovic, Sinner, Alcaraz, and Zverev—we observe three distinct clusters based on the selected performance metrics:

  • Serve-Dominant Players – These players excel in Hold Rate (Hld%), Ace Rate (Ace%), and Break Points Saved (BPSvd%), indicating a strong ability to maintain their service games, featuring players such as Berretini or Fritz.

  • Return-Oriented Players – This group is characterized by high Break Rate (Brk%), Break Point Conversion (BPConv%), and Return Points Won (RPW), showcasing their effectiveness in breaking opponents’ serves, such as de Minaur, Medvedev, or Alcaraz (who is well regarded for his ability to chase down balls).

All-Court Players – These players exhibit a large negative loading in PC2, which we previously identified as associated with Tiebreak Win Percentage (TB W%), Hold Rate (Hld%), and Break Points Saved (BPSvd%). Notably, they perform at a high level across multiple aspects of the game, demonstrating well-rounded dominance, like with Djokovic, Zverev, or far in far leading as the current Rank 1 Player, Jannik Sinner.

Just for reference, I have highlighted a player, such as Raphael Collignon, who is Rank 92 as at W1 of March 2025.

Code
# Extract the explained variance ratio for all principal components
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Perform PCA on the standardized data (as done earlier)
pca = PCA()
X_pca_all = pca.fit_transform(X_scaled)

# Extract the explained variance ratio for all principal components
explained_variance_ratio = pca.explained_variance_ratio_

# Plot the explained variance for all PCs (Elbow Method)
# plt.figure(figsize=(10, 6))

# Plot the explained variance for each principal component
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, marker='o', color='b', label='Explained Variance')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Each Principal Component (Elbow Method)')
plt.grid(True)

# Add a vertical line to indicate where the "elbow" occurs
plt.axvline(x=4, color='r', linestyle='--', label="Elbow (Optimal Number of Components)")

# Show the plot
plt.legend()
plt.show()

# for i, ratio in enumerate(explained_variance_ratio, 1):
#     print(f"PC{i}: {ratio:.4f} ({ratio*100:.2f}%)")

As a side note, whilst I have visualized only the first two principal components to provide a clear graphical representation, it is important to note that PCA inherently reduces dimensionality, meaning some information from the original data is lost in the process. By projecting complex, high-dimensional data onto a simpler plane, we prioritize interpretability at the expense of completeness.

In this case, the first two principal components explain approximately 67.67% of the total variance. However, analyzing the elbow of the curve suggests that the optimal number of components would be four, which would capture around 90% of the total variance, providing a more comprehensive representation of player performance.

Concluding Remarks

From the early 2000s, men’s tennis was dominated by three legendary players—Federer, Nadal, and Djokovic. However, as the sport evolves, demanding greater physicality from longer rallies, and extended match durations, a new generation of players is slowly beginning to take their place, where Rising stars like Alcaraz and Sinner are showcasing a level of dominance reminiscent of the Big Three in their prime, making it clear why they are regarded as the future of tennis - with Sinner dominating the game by far at the moment.

Of course, in tennis, there are no guaranteed victors. The sport is constantly evolving, and new contenders continue to emerge. For example, Alex de Minaur, a promising Australian star, recently became the first to reach an Australian Open quarterfinal since Nick Kyrgios. Meanwhile, young talents like Learner Tien have shown remarkable consistency, defeating top-ranked ATP players such as Medvedev or Zverev despite being outside the top 100.

But as the game continues to evolve, one thing remains certain: the relentless pursuit of excellence will always define the greatest champions.