{"id":233,"date":"2018-09-21T17:01:54","date_gmt":"2018-09-21T17:01:54","guid":{"rendered":"https:\/\/justinmatters.co.uk\/wp\/?p=233"},"modified":"2018-10-11T12:28:33","modified_gmt":"2018-10-11T12:28:33","slug":"the-danger-of-missing-values-in-data-sets","status":"publish","type":"post","link":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/","title":{"rendered":"The Danger of Missing Values in Data Sets"},"content":{"rendered":"<p>Missing values in data sets may not seem like too much of a problem at first glance. The rows can be ignored, average values can be input or the data can be marked as missing. However applying the wrong strategy can lead to serious errors of interpretation. Lets take a look at a concrete example in that old favourite of a data set, the <a href=\"https:\/\/www.kaggle.com\/c\/titanic\/data\">Titanic survival training data set<\/a>. Note that for simplicity only the training data from the Kaggle challenge is being examined here.<\/p>\n<p>First lets import our data obtained from the<a href=\"https:\/\/www.kaggle.com\/c\/titanic\/data\"> link<\/a> above and\u00a0 take a quick look at it:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# Import libraries\r\nimport matplotlib\r\nimport pandas as pd\r\nimport numpy as np\r\nimport seaborn as sns\r\n\r\n# Enable inline printing\r\n%matplotlib inline\r\n\r\n# Import our data\r\ntitanic_train = pd.read_csv(&quot;train.csv&quot;)\r\n# Make the Survived column more human readable\r\ntitanic_train&#x5B;&quot;Survived&quot;] = titanic_train&#x5B;&quot;Survived&quot;]\r\n    .map({1: &quot;Yes&quot;, 0:&quot;No&quot;})\r\n# Then display\r\ntitanic_train.head()\r\n\r\n<\/pre>\n<p>We can see that some passengers have a cabin assigned to them, looking through the data we realise that the first letter of the cabin is the deck assignment. A little historical research indicates that deck may be correlated with passenger class since first class had the top decks (A-E), second class (D-F), and third class (E-G). Which deck people were on seems likely to affect survival, after all the lifeboats are on the upper decks so we decide to investigate deck assignments more closely.<\/p>\n<p>First we need to extract the deck information, then we can visualise the relationship between deck and survival.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# create a new column using the first letter of the cabin column\r\ntitanic_train&#x5B;&quot;Deck&quot;] = titanic_train&#x5B;&quot;Cabin&quot;].dropna().\r\n    astype(str).str&#x5B;0].str.upper()&lt;\/pre&gt;\r\n# remove anomalous deck assignment &quot;T&quot;\r\ntitanic_train.loc&#x5B;titanic_train&#x5B;&quot;Deck&quot;] == &quot;T&quot;] = np.nan\r\n\r\n# Plot survival versus deck\r\nsns.countplot(&quot;Survived&quot;, data = titanic_train, hue=&quot;Deck&quot;,\r\nhue_order = &#x5B;&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;], \r\n    order=&#x5B;&quot;Yes&quot;, &quot;No&quot;,]);\r\n<\/pre>\n<figure id=\"attachment_237\" aria-describedby=\"caption-attachment-237\" style=\"width: 386px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-237 size-full\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png\" alt=\"According to this misleading plot two thirds of the passengers of the Titanic survived\" width=\"386\" height=\"266\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png 386w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1-300x207.png 300w\" sizes=\"auto, (max-width: 386px) 100vw, 386px\" \/><figcaption id=\"caption-attachment-237\" class=\"wp-caption-text\">Plot of survival by deck for passengers with a known deck<\/figcaption><\/figure>\n<p>Looking at this naively it would appear that the survival rates on deck G is as good as the survival rate on deck A,\u00a0 the survival rates on decks D and E are better than that on deck B and the survival rate on deck F is better than that of deck C. Given that second and third class passengers occupied decks D-G it might appear that they actually fared reasonably well during the sinking. However the graph above is actually highly misleading.<\/p>\n<p>The first indication that all is not as it might seem is that according to the plot by deck, more than half the passengers survived. However a <a href=\"https:\/\/en.wikipedia.org\/wiki\/RMS_Titanic\">knowledge of history<\/a> (or plotting the whole training data set) reveals the situation is actually very different. In fact far less than half the passengers in the data set survive.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# plot the whole dataset\r\nsns.countplot(&quot;Survived&quot;, data = titanic_train, \r\n    order=&#x5B;&quot;Yes&quot;, &quot;No&quot;]);\r\n<\/pre>\n<figure id=\"attachment_238\" aria-describedby=\"caption-attachment-238\" style=\"width: 392px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-238 size-full\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure2.png\" alt=\"A more accurate plot, only one third of the passengers of the Titanic survive\" width=\"392\" height=\"266\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure2.png 392w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure2-300x204.png 300w\" sizes=\"auto, (max-width: 392px) 100vw, 392px\" \/><figcaption id=\"caption-attachment-238\" class=\"wp-caption-text\">Plot of survivors of the Titanic from the whole training set<\/figcaption><\/figure>\n<p>So what is the cause of the discrepancy? Well lets add a bar the passengers for whom we have no cabin data to our survival by deck plot.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# Run fillna on the NaNs and plot again\r\ntitanic_train&#x5B;&quot;Deck&quot;] = titanic_train&#x5B;&quot;Deck&quot;].fillna(&quot;N&quot;)\r\n\r\n# Visualise what happened to those for whom we have \r\n# no deck indication alongside those we do\r\nsns.countplot(&quot;Survived&quot;, data=titanic_train, hue=&quot;Deck&quot;,\r\nhue_order = &#x5B;&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;N&quot;], \r\n    order = &#x5B;&quot;Yes&quot;, &quot;No&quot;]);\r\n<\/pre>\n<figure id=\"attachment_239\" aria-describedby=\"caption-attachment-239\" style=\"width: 392px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-239 size-full\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure3.png\" alt=\"Most of the passengers do not have an known deck allocation according to this plot\" width=\"392\" height=\"266\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure3.png 392w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure3-300x204.png 300w\" sizes=\"auto, (max-width: 392px) 100vw, 392px\" \/><figcaption id=\"caption-attachment-239\" class=\"wp-caption-text\">A survival by deck plot with those with no known deck added<\/figcaption><\/figure>\n<p>Now we can clearly see our issue, the rows with deck allocations and those without are not drawn from the same distributions. This means that inferences we draw about one population have no reason to transfer to the other population.<\/p>\n<p>Lets investigate a little further by adding a column noting whether or not we have a deck allocation for a given passenger so that we can plot this information directly<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# lets add a column indicating whether the cabin is known\r\ntitanic_train&#x5B;&quot;KnownDeck&quot;] = titanic_train&#x5B;&quot;Deck&quot;]\r\n    .apply(lambda x: &quot;No&quot; if x == &quot;N&quot; else &quot;Yes&quot;)\r\n\r\n# Now display the distribution of survivors\r\nsns.countplot(&quot;Survived&quot;, data=titanic_train, \r\n    hue=&quot;KnownDeck&quot;);\r\n<\/pre>\n<figure id=\"attachment_240\" aria-describedby=\"caption-attachment-240\" style=\"width: 392px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-240 size-full\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure4.png\" alt=\"Being a survivor increases the chance of having a known deck \" width=\"392\" height=\"266\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure4.png 392w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure4-300x204.png 300w\" sizes=\"auto, (max-width: 392px) 100vw, 392px\" \/><figcaption id=\"caption-attachment-240\" class=\"wp-caption-text\">A plot demonstrating the correlation between known deck allocation and survival<\/figcaption><\/figure>\n<p>Here the difference can be seen very starkly, having a known deck is highly correlated with passenger survival.<\/p>\n<p>What if we break down the survivors by class, whether they survived and whether they had a known deck allocation?<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# Now lets plot to see if the cabins we have are a \r\n# representative sample across passenger classes\r\ngrid = sns.FacetGrid(titanic_train, row =&quot;KnownDeck&quot;, \r\n    row_order = &#x5B;&quot;Yes&quot;, &quot;No&quot;], col= &quot;Pclass&quot;, height=4, aspect=1)\r\ngrid.map(sns.countplot, &quot;Survived&quot;, order = &#x5B;&quot;Yes&quot;, &quot;No&quot;]);\r\n<\/pre>\n<figure id=\"attachment_243\" aria-describedby=\"caption-attachment-243\" style=\"width: 856px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-243 size-full\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure5.png\" alt=\"Being a survivor increases the chance of having a known deck, also first class passengers are more likely and third class passengers less likely to have a known deck\" width=\"856\" height=\"568\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure5.png 856w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure5-300x199.png 300w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure5-768x510.png 768w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure5-700x464.png 700w\" sizes=\"auto, (max-width: 856px) 100vw, 856px\" \/><figcaption id=\"caption-attachment-243\" class=\"wp-caption-text\">A plot demonstrating the correlation between known deck allocation and survival broken down by passenger class<\/figcaption><\/figure>\n<p>Here we can see that regardless of class, if you survived we are more likely to have a deck allocation. Also we can see that first class passengers were most likely to survive and third class least likely to survive which is more in line with the known facts of the disaster.<\/p>\n<p>But why are having a known deck and survival highly correlated? Well it seems unlikely that having a known deck caused passengers to survive. More likely causation works the other way round in this case. The data set records cabin numbers where they are known and this is more likely if the passenger survived.\u00a0 The data set on survivors will have been compiled after the sinking and so the survival bias on deck numbers suggests that cabin allocations were mostly recorded for survivors. A higher proportion of first class victims have known cabin allocations than second and third class, so perhaps first class allocations were recorded by the White Star Line and the few other cabin allocations that are known were discovered by questioning survivors.<\/p>\n<p>In this case the effect of the missing values is fairly obvious, however imagine a less famous case and imagine that instead of retaining the rows with missing data we had initially &#8220;cleaned&#8221; our data\u00a0 using<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ntitanic_train.dropna()\r\n<\/pre>\n<p>In such a case we would not necessarily get any intuition that the conclusions we are drawing are on shaky ground. After all if we already knew the answers, we would not need to look at the data. This could have led to serious errors in our conclusions. Missing values need to be treated with care, think and check before you drop them or fill them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Missing values in data sets may not seem like too much of a problem at first glance. The rows can be ignored, average values can&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[33,6,14,34],"class_list":["post-233","post","type-post","status-publish","format-standard","hentry","category-data-science","tag-data-science","tag-problem-solving","tag-python","tag-seaborn"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>The Danger of Missing Values in Data Sets - Justin&#039;s Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Danger of Missing Values in Data Sets - Justin&#039;s Blog\" \/>\n<meta property=\"og:description\" content=\"Missing values in data sets may not seem like too much of a problem at first glance. The rows can be ignored, average values can&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/\" \/>\n<meta property=\"og:site_name\" content=\"Justin&#039;s Blog\" \/>\n<meta property=\"article:published_time\" content=\"2018-09-21T17:01:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-11T12:28:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png\" \/>\n<meta name=\"author\" content=\"justinmatters\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"justinmatters\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/\"},\"author\":{\"name\":\"justinmatters\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\"},\"headline\":\"The Danger of Missing Values in Data Sets\",\"datePublished\":\"2018-09-21T17:01:54+00:00\",\"dateModified\":\"2018-10-11T12:28:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/\"},\"wordCount\":1263,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2018\\\/09\\\/Figure1.png\",\"keywords\":[\"Data Science\",\"Problem Solving\",\"Python\",\"Seaborn\"],\"articleSection\":[\"Data Science\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/\",\"name\":\"The Danger of Missing Values in Data Sets - Justin&#039;s Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2018\\\/09\\\/Figure1.png\",\"datePublished\":\"2018-09-21T17:01:54+00:00\",\"dateModified\":\"2018-10-11T12:28:33+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#primaryimage\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2018\\\/09\\\/Figure1.png\",\"contentUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2018\\\/09\\\/Figure1.png\",\"width\":386,\"height\":266},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/the-danger-of-missing-values-in-data-sets\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"The Danger of Missing Values in Data Sets\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#website\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\",\"name\":\"Justin's Blog\",\"description\":\"Justin&#039;s Coding and Geek Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\",\"name\":\"justinmatters\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"caption\":\"justinmatters\"},\"description\":\"Data Scientist specialising in Python, PySpark, SQL and Machine Learning\",\"sameAs\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\",\"https:\\\/\\\/uk.linkedin.com\\\/in\\\/justin-matters-edinburgh\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"The Danger of Missing Values in Data Sets - Justin&#039;s Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/","og_locale":"en_US","og_type":"article","og_title":"The Danger of Missing Values in Data Sets - Justin&#039;s Blog","og_description":"Missing values in data sets may not seem like too much of a problem at first glance. The rows can be ignored, average values can&hellip;","og_url":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/","og_site_name":"Justin&#039;s Blog","article_published_time":"2018-09-21T17:01:54+00:00","article_modified_time":"2018-10-11T12:28:33+00:00","og_image":[{"url":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png","type":"","width":"","height":""}],"author":"justinmatters","twitter_card":"summary_large_image","twitter_misc":{"Written by":"justinmatters","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#article","isPartOf":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/"},"author":{"name":"justinmatters","@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321"},"headline":"The Danger of Missing Values in Data Sets","datePublished":"2018-09-21T17:01:54+00:00","dateModified":"2018-10-11T12:28:33+00:00","mainEntityOfPage":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/"},"wordCount":1263,"commentCount":0,"image":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#primaryimage"},"thumbnailUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png","keywords":["Data Science","Problem Solving","Python","Seaborn"],"articleSection":["Data Science"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/","url":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/","name":"The Danger of Missing Values in Data Sets - Justin&#039;s Blog","isPartOf":{"@id":"https:\/\/justinmatters.co.uk\/wp\/#website"},"primaryImageOfPage":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#primaryimage"},"image":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#primaryimage"},"thumbnailUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png","datePublished":"2018-09-21T17:01:54+00:00","dateModified":"2018-10-11T12:28:33+00:00","author":{"@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321"},"breadcrumb":{"@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#primaryimage","url":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png","contentUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2018\/09\/Figure1.png","width":386,"height":266},{"@type":"BreadcrumbList","@id":"https:\/\/justinmatters.co.uk\/wp\/the-danger-of-missing-values-in-data-sets\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/justinmatters.co.uk\/wp\/"},{"@type":"ListItem","position":2,"name":"The Danger of Missing Values in Data Sets"}]},{"@type":"WebSite","@id":"https:\/\/justinmatters.co.uk\/wp\/#website","url":"https:\/\/justinmatters.co.uk\/wp\/","name":"Justin's Blog","description":"Justin&#039;s Coding and Geek Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/justinmatters.co.uk\/wp\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321","name":"justinmatters","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","caption":"justinmatters"},"description":"Data Scientist specialising in Python, PySpark, SQL and Machine Learning","sameAs":["https:\/\/justinmatters.co.uk\/wp\/","https:\/\/uk.linkedin.com\/in\/justin-matters-edinburgh"]}]}},"_links":{"self":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/comments?post=233"}],"version-history":[{"count":11,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/233\/revisions"}],"predecessor-version":[{"id":253,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/233\/revisions\/253"}],"wp:attachment":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/media?parent=233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/categories?post=233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/tags?post=233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}