{"id":687,"date":"2020-03-06T19:39:38","date_gmt":"2020-03-06T19:39:38","guid":{"rendered":"https:\/\/justinmatters.co.uk\/wp\/?p=687"},"modified":"2020-03-22T13:53:39","modified_gmt":"2020-03-22T13:53:39","slug":"my-pydata-presentation-on-pyspark-and-databricks","status":"publish","type":"post","link":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/","title":{"rendered":"My PyData Presentation on PySpark and Databricks"},"content":{"rendered":"<p>I gave a flash talk at <a href=\"https:\/\/www.meetup.com\/PyData-Edinburgh\/\">Edinburgh Pydata<\/a> yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform. I thought I would take a moment to share my slides online in case anyone else would like to take a look. They can be found here on <a href=\"https:\/\/docs.google.com\/presentation\/d\/1a-AEB1793PStFOgOf31jHlaJKP0lIS3wd4q4N4aX46k\/edit?usp=sharing\">Google Slides<\/a><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-688 aligncenter\" src=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png\" alt=\"\" width=\"348\" height=\"145\" srcset=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png 348w, https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata-300x125.png 300w\" sizes=\"auto, (max-width: 348px) 100vw, 348px\" \/><\/p>\n<p>TLDR: a promising set of tools that are still under rapid development. The learning curve can be a little steep here and there due to some departures from standard Python and dataframe design principles. Very useful now (I am using them in production at <a href=\"https:\/\/queryclick.com\/\">QueryClick<\/a>) and likely to get even better over time.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[57,55,58,54,14],"class_list":["post-687","post","type-post","status-publish","format-standard","hentry","category-data-science","tag-big-data","tag-databricks","tag-pydata","tag-pyspark","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog\" \/>\n<meta property=\"og:description\" content=\"I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/\" \/>\n<meta property=\"og:site_name\" content=\"Justin&#039;s Blog\" \/>\n<meta property=\"article:published_time\" content=\"2020-03-06T19:39:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-03-22T13:53:39+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png\" \/>\n<meta name=\"author\" content=\"justinmatters\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"justinmatters\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/\"},\"author\":{\"name\":\"justinmatters\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\"},\"headline\":\"My PyData Presentation on PySpark and Databricks\",\"datePublished\":\"2020-03-06T19:39:38+00:00\",\"dateModified\":\"2020-03-22T13:53:39+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/\"},\"wordCount\":116,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/Pydata.png\",\"keywords\":[\"Big data\",\"Databricks\",\"Pydata\",\"PySpark\",\"Python\"],\"articleSection\":[\"Data Science\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/\",\"name\":\"My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/Pydata.png\",\"datePublished\":\"2020-03-06T19:39:38+00:00\",\"dateModified\":\"2020-03-22T13:53:39+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#primaryimage\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/Pydata.png\",\"contentUrl\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/wp-content\\\/uploads\\\/2020\\\/03\\\/Pydata.png\",\"width\":348,\"height\":145},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/my-pydata-presentation-on-pyspark-and-databricks\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"My PyData Presentation on PySpark and Databricks\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#website\",\"url\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\",\"name\":\"Justin's Blog\",\"description\":\"Justin&#039;s Coding and Geek Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/#\\\/schema\\\/person\\\/7c3e0740e1fef74f705c19f175f6f321\",\"name\":\"justinmatters\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g\",\"caption\":\"justinmatters\"},\"description\":\"Data Scientist specialising in Python, PySpark, SQL and Machine Learning\",\"sameAs\":[\"https:\\\/\\\/justinmatters.co.uk\\\/wp\\\/\",\"https:\\\/\\\/uk.linkedin.com\\\/in\\\/justin-matters-edinburgh\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/","og_locale":"en_US","og_type":"article","og_title":"My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog","og_description":"I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.&hellip;","og_url":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/","og_site_name":"Justin&#039;s Blog","article_published_time":"2020-03-06T19:39:38+00:00","article_modified_time":"2020-03-22T13:53:39+00:00","og_image":[{"url":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png","type":"","width":"","height":""}],"author":"justinmatters","twitter_card":"summary_large_image","twitter_misc":{"Written by":"justinmatters","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#article","isPartOf":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/"},"author":{"name":"justinmatters","@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321"},"headline":"My PyData Presentation on PySpark and Databricks","datePublished":"2020-03-06T19:39:38+00:00","dateModified":"2020-03-22T13:53:39+00:00","mainEntityOfPage":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/"},"wordCount":116,"commentCount":0,"image":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png","keywords":["Big data","Databricks","Pydata","PySpark","Python"],"articleSection":["Data Science"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/","url":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/","name":"My PyData Presentation on PySpark and Databricks - Justin&#039;s Blog","isPartOf":{"@id":"https:\/\/justinmatters.co.uk\/wp\/#website"},"primaryImageOfPage":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#primaryimage"},"image":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#primaryimage"},"thumbnailUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png","datePublished":"2020-03-06T19:39:38+00:00","dateModified":"2020-03-22T13:53:39+00:00","author":{"@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321"},"breadcrumb":{"@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#primaryimage","url":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png","contentUrl":"https:\/\/justinmatters.co.uk\/wp\/wp-content\/uploads\/2020\/03\/Pydata.png","width":348,"height":145},{"@type":"BreadcrumbList","@id":"https:\/\/justinmatters.co.uk\/wp\/my-pydata-presentation-on-pyspark-and-databricks\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/justinmatters.co.uk\/wp\/"},{"@type":"ListItem","position":2,"name":"My PyData Presentation on PySpark and Databricks"}]},{"@type":"WebSite","@id":"https:\/\/justinmatters.co.uk\/wp\/#website","url":"https:\/\/justinmatters.co.uk\/wp\/","name":"Justin's Blog","description":"Justin&#039;s Coding and Geek Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/justinmatters.co.uk\/wp\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/justinmatters.co.uk\/wp\/#\/schema\/person\/7c3e0740e1fef74f705c19f175f6f321","name":"justinmatters","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/27cf337940887c098b79716aa7025ce782bd51de3f6b07a9dcad710bbf576c59?s=96&d=mm&r=g","caption":"justinmatters"},"description":"Data Scientist specialising in Python, PySpark, SQL and Machine Learning","sameAs":["https:\/\/justinmatters.co.uk\/wp\/","https:\/\/uk.linkedin.com\/in\/justin-matters-edinburgh"]}]}},"_links":{"self":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/687","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/comments?post=687"}],"version-history":[{"count":1,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/687\/revisions"}],"predecessor-version":[{"id":689,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/posts\/687\/revisions\/689"}],"wp:attachment":[{"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/media?parent=687"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/categories?post=687"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/justinmatters.co.uk\/wp\/wp-json\/wp\/v2\/tags?post=687"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}