N-grams are a well established method in natural language processing. They can be used in situations like predictive text, sentiment analysis and other useful task.…
Tag: Python
Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found…
PySpark window functions are useful when you want to examine relationships within groups of data rather than between groups of data as for groupBy. To…
Recently I needed to check for equality between Pyspark dataframes as part of a test suite. To my surprise I discovered that there is no…
I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.…
There are a variety of ways to filter strings in PySpark, each with their own advantages and disadvantages. This post will consider three of the…
Jupyter notebooks are great, however the interface for file handling has its issues. One issue is that files have to be downloaded individually, there is…
Pyspark is very powerful. However because it is based on Scala we need to be careful about types as they are not Pythonic. And because…
If you work in data science you have probably come across the pipeline model for handling data transformations. It is used by many machine learning…
Databricks is a very handy cloud platform for large scale data processing and machine learning using Spark. However it does have some idiosyncrasies. Here are…