Perfectly Random

machine learning and stuff

Empty spark dataframes

Creating an empty spark dataframe is a bit tricky. Let’s see some examples. First, let’s create a SparkSession object to use.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('my_app').getOrCreate()

The following command fails because the schema cannot be inferred. We can make it work by specifying the schema as a string.

spark.createDataFrame([])  # fails!
# ...
# ValueError: can not infer schema from empty dataset

df = spark.createDataFrame([], 'a INT, b DOUBLE, c STRING')  # works!
df.dtypes
# [('a', 'int'), ('b', 'double'), ('c', 'string')]

Things get a little bit more interesting when we create a spark dataframe from a pandas dataframe.

x = pd.DataFrame({'a': [1.0], 'b': [1.0]})  # works as expected
print(x.dtypes)  # pandas dataframe has a schema
# a    float64
# b    float64
# dtype: object
spark.createDataFrame(x).dtypes  # no need to specify schema because it can be inferred
# [('a', 'double'), ('b', 'double')]

An empty pandas dataframe has a schema but spark is unable to infer it.

y = pd.DataFrame({'a': [], 'b': []})
print(y.dtypes)  # default dtype is float64
# a    float64
# b    float64
# dtype: object
spark.createDataFrame(y)  # fails!
# ...
# ValueError: can not infer schema from empty dataset

Now, funnily enough, spark completely ignores an empty pandas dataframe’s schema.

# works as expected
spark.createDataFrame(pd.DataFrame({'a': [], 'b': []}), 'a INT, b DOUBLE').dtypes
# [('a', 'int'), ('b', 'double')]

# also works!
spark.createDataFrame(pd.DataFrame({'a': [], 'b': []}), 'a INT').dtypes
# [('a', 'int')]

# also works!
spark.createDataFrame(pd.DataFrame({'a': [], 'b': []}), 'b INT').dtypes
# [('b', 'int')]

# still works!
spark.createDataFrame(pd.DataFrame({'a': [], 'b': []}), 'c INT').dtypes
# [('c', 'int')]