Kaggle PUBG Competition Data Analysis

Currently there is a fun competition running over on the Kaggle Data Science website.

The objective is to use metrics from a large data set of Player Unknown Battle Grounds (PUBG) matches to build a model to predict performance in the game. This blog post covers my exploratory data analysis of the dataset.

Fist of course we need to load come python libraries and the data we plan to analyse


# library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os, sys

# data import
pubg_train = pd.read_csv('train_V2.csv')
pubg_test = pd.read_csv('test_V2.csv')

One of the very first things to do is to check that the training and testing sets are drawn from the same distribution. This can easily be done by checking that one dataset’s  basic statistics given by describe are similar to the the other’s. Subtracting the testing set’s describe dataframe from the training set’s dataframe and then dividing by  the training set’s describe dataframe should give values close to zero  for each mean and quartile figure being small  if the data sets contain similar distributions and this is indeed what we see.

pubg_train_stats =  pubg_train.describe()
pubg_test_stats =  pubg_test.describe()
pubg_train_stats.drop(columns = 'winPlacePerc')
train_test_difference = pubg_train_stats - pubg_test_stats
train_test_difference
Sample of the subtraction of the test data from the training data showing the low variances, demonstrating similar distributions
Sample of the subtraction of the test data from the training data showing the low variances

Having established that the sets are drawn from the same distributions, we shall focus of the training set for visualisation and analysis. Lets start with a correlation plot to identify the most significant variables.

corr_vals = pubg_train.corr()
fig, axes = plt.subplots(figsize=(15,15))
sns.heatmap(corr_vals, ax = axes, cmap="rainbow", annot=True);
Shows correlations between variables
Correlation plot for the PUBG training set

Looking at the bottom row we can see strong correlation between our target variable “winPlacePerc” and other variables such as boosts, killPlace, walkDistance and weaponsAcquired, and reasonable correlation with a number of other variables. We can also clearly see that some variables are not well correlated with our target and are likely to be of less use to us.

These results makes considerable intuitive sense. PUBG is battle royale style game where 100 players try to stay alive. Players start with no weapons so obviously acquiring weapons will be useful, so will killing enemies since otherwise they will kill you. Also the game features a shrinking play area which forces players to travel accounting for the usefulness of walking or otherwise travelling long distances.

Next lets plot some 2 dimensional KDE plots of the most significant variables against the target variable. Doing this on the entire dataset takes an exhorbitant amount of time and memory.  As the data comes pre-shuffled, we can simply take a 1% sample of the data to give us a small enough dataset to work with more easily.

# lets see how each individual variable is related to the target variable
# note that to finish in a reasonable time we need to take a subset
pubg_small = pubg_train[0:-1:100]

# select interesting correlated columns
column_list = [
'damageDealt', 'DBNOs', 'heals',
'killPlace', 'kills',
'killStreaks', 'rideDistance', 'walkDistance',
'weaponsAcquired']
pubg_clipped = pubg_small[column_list+['winPlacePerc']]

# clip off extremes to get nice plots
pubg_clipped = pubg_clipped.clip(
    lower=None, upper= pubg_clipped.quantile(0.999),
    axis = 1)

# cycle though columns to look at correlation
for column in column_list:
    sns.jointplot(x = column, y = "winPlacePerc", 
        data = pubg_clipped, kind = "kde")

A variety of KDE plots showing differing relationships between variables and winPlacePerc

A variety of KDE plots showing differing relationships between variables and winPlacePerc
A variety of KDE plots showing differing relationships between variables and winPlacePerc

The plots show a variety of different relationships to winPlacePerc. Some of these look reasonably linear, however some seem to be rather more complicated. We should check this, since if the data is non-linear, then we may need a more sophisticated model to account for this. The easiest way is to try to plot a polynomial regression to the data and look to see if we get a monotonically increasing line. If not then it is unlikely linear regression will give us a good fit.


#cycle though columns to look at how linear the correlation is
for column in column_list:
fig, axes = plt.subplots(figsize=(6,6))
ax = sns.regplot(x = column, y = "winPlacePerc", data = pubg_clipped,
scatter_kws={"s": 80}, order=3, line_kws={'color':'red'},
robust = False, ci=None, truncate=True)

From these we can see that while some variables like Walk distance and kills could be approximated by a linear relationship, others such as weaponsAcquired have a definite sweet spot for high winPlacePercentage scores, and others such as heals have a complicated relationship possibly suggesting a number of competing strategies being used by effective players. It seems likely that we will need to base our model on something more sophisticated than linear regression for this challenge.

Now lets try to conduct some engineering on these features. A number of possibilities spring to mind.

  • Some modes are played as teams, perhaps we should look at team aggregate variables
  • We could normalise the variables with respect to the average scores in the matches they correspond to
  • Perhaps we could extract information about player skill by comparing ratios between pickups and distance walked, or kills and headshot kills

Lets implement these and see if our engineered variables are more highly correlated with our target variable

 


# set up our feature engineering labels

# list of the variables suspected to be significant in data analysis

variables = ['killPlace', 'boosts', 'walkDistance', 'weaponsAcquired', 
    'damageDealt', 'heals', 'kills', 'longestKill', 'killStreaks', 
    'rideDistance','rampage', 'lethality', 'items', 'totalDistance']

keep_labels = variables + ['matchId','groupId', 
    'matchType', 'winPlacePerc']

def feature_engineering(pubg_data):
    '''FEATURE ENGINEERING
    GIVEN: a PUBG dataframe which must have a dummy 'winPlacePerc' 
    column if a test set. Conduct data engineering including:
    producing group data, normalising data with relevant match stats, 
    clipping extreme results
    RETURNS: pubg_x dataframe consisting of feature engineered input columns
    pubg_y dataframe with target values (0 dummy frame if a test set)
    '''

    # total the pickups
    pubg_data['items'] = pubg_data[
        'heals'] + pubg_data['boosts'] + pubg_data["weaponsAcquired"]

    # total the distance
    pubg_data['totalDistance'] = pubg_data[
        'rideDistance'] + pubg_data[
        'swimDistance'] + pubg_data['walkDistance']

    # estimate accuracy of players
    pubg_data['lethality'] = pubg_data[
        'headshotKills'] / pubg_data['kills']
    pubg_data['lethality'].replace(np.inf, 0, inplace=True)
    pubg_data['lethality'].fillna(0, inplace=True)

    # estimate how players behave in shootouts
    pubg_data['rampage'] = pubg_data[
        'killStreaks'] / pubg_data['kills']
    pubg_data['rampage'].replace(np.inf, 0, inplace=True)
    pubg_data['rampage'].fillna(0, inplace=True)

    # reduce dataframe to the columns we want to use
    pubg_data = pubg_data[keep_labels]

    # use groupby to get means for team
    pubg_group_means = pubg_data.groupby([
        'matchId','groupId']).mean().reset_index()

    # use groupby to get means of matches
    pubg_match_means = pubg_data.groupby([
        'matchId']).mean().reset_index()

    # merge back in leaving columns unchanged for one set to allow for 
    # future suffixing (only affects shared columns)
    pubg_engineered = pd.merge(pubg_data, pubg_group_means,
    suffixes=["", "_group"], how = "left", on = ['matchId', 'groupId'])
    pubg_engineered = pd.merge(pubg_engineered, pubg_match_means,
    suffixes=["_player", "_match"], how = "left", on = ['matchId'])

    # norm the player variables
    for variable in variables:
        pubg_engineered[variable+'_norm'] = pubg_engineered[
            variable+'_player']/(pubg_engineered[variable+'_match']+0.1)

    # norm the group variables
    for variable in variables:
        pubg_engineered[variable+'_g_norm'] = pubg_engineered[
            variable+'_group']/(pubg_engineered[variable+'_match']+0.1)

    # one hot encode the matchTypes since different matches
    # may follow different logics
    one_hot = pd.get_dummies(pubg_engineered['matchType'])

    # Drop matchType column as it is now encoded
    pubg_engineered = pubg_engineered.drop('matchType',axis = 1)

    # Join the encoded df
    pubg_engineered = pubg_engineered.join(one_hot)
    pubg_engineered.drop(columns = ['winPlacePerc_group',
        'winPlacePerc_match'], inplace = True)
    pubg_engineered.rename(columns = {
        'winPlacePerc_player': 'winPlacePerc'}, inplace = True)
    pubg_engineered = pubg_engineered.reset_index(drop=True)

    return pubg_engineered

We can use this function to engineer our new variables. Having done this we can then plot correlation plots between the differently engineered forms of each variable to see which ones are most significant.

# must train on the full set to get correct group and match means
pubg_engineered = feature_engineering(pubg_train)
# group related cloumns together
pubg_engineered = pubg_engineered.sort_index(axis=1)

# grab our columns
available_columns = list(pubg_engineered.columns.values)
# work out where each group of variables we want to compare start
start_values = [0,7,17,22,27,32,37,42,47,59,64,73,78,83]

# loop over our subsets creating correlation plots
for start in start_values:
    column_selection = available_columns[
        start: start+5] + ['winPlacePerc']
    corr_vals = pubg_engineered[column_selection].corr()

    fig, axes = plt.subplots(figsize=(5,5))
    sns.heatmap(corr_vals, ax = axes,
        cmap="rainbow", annot=True);

The correlation plots clearly show that the grouped and normed features show a higher correlation with winPlacePerc than the variables before engineering. Clearly we should consider including them in our model.

For brevity not all the variables have been illustrated here. If you wish to investigate more closely yourself, the notebook to is available on Kaggle.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.