How to create a sample single-column Spark DataFrame in Python? - python

I want to create a sample single-column DataFrame, but the following code is not working:
df = spark.createDataFrame(["10","11","13"], ("age"))
## ValueError
## ...
## ValueError: Could not parse datatype: age
The expected result:
age
10
11
13

the following code is not working
With single element you need a schema as type
spark.createDataFrame(["10","11","13"], "string").toDF("age")
or DataType:
from pyspark.sql.types import StringType
spark.createDataFrame(["10","11","13"], StringType()).toDF("age")
With name elements should be tuples and schema as sequence:
spark.createDataFrame([("10", ), ("11", ), ("13", )], ["age"])

Well .. There is some pretty easy method for creating sample dataframe in PySpark
>>> df = sc.parallelize([[1,2,3], [2,3,4]]).toDF()
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
to create with some column names
>>> df1 = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a", "b", "c"))
>>> df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 2| 3| 4|
+---+---+---+
In this way, no need to define schema too.Hope this is the simplest way

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}])
Output: (no need to define schema)
+---+---+---+
| a | b | c |
+---+---+---+
| x| y| 3|
+---+---+---+

For pandas + pyspark users, if you've already installed pandas in the cluster, you can do this simply:
# create pandas dataframe
df = pd.DataFrame({'col1':[1,2,3], 'col2':['a','b','c']})
# convert to spark dataframe
df = spark.createDataFrame(df)
Local Spark Setup
import findspark
findspark.init()
import pyspark
spark = (pyspark
.sql
.SparkSession
.builder
.master("local")
.getOrCreate())

See my farsante lib for creating a DataFrame with fake data:
import farsante
df = farsante.quick_pyspark_df(['first_name', 'last_name'], 7)
df.show()
+----------+---------+
|first_name|last_name|
+----------+---------+
| Tommy| Hess|
| Arthur| Melendez|
| Clemente| Blair|
| Wesley| Conrad|
| Willis| Dunlap|
| Bruna| Sellers|
| Tonda| Schwartz|
+----------+---------+
Here's how to explicitly specify the schema when creating the PySpark DataFrame:
df = spark.createDataFrame(
[(10,), (11,), (13,)],
StructType([StructField("some_int", IntegerType(), True)]))
df.show()
+--------+
|some_int|
+--------+
| 10|
| 11|
| 13|
+--------+

You can also try something like this -
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc) # sc is the spark context
sample = sqlContext.createDataFrame(
[
('qwe', 23), # enter your data here
('rty',34),
('yui',56),
],
['abc', 'def'] # the row header/column labels should be entered here

There are several ways to create a DataFrame, PySpark Create DataFrame is one of the first steps you learn while working on PySpark
I assume you already have data, columns, and an RDD.
1) df = rdd.toDF()
2) df = rdd.toDF(columns) //Assigns column names
3) df = spark.createDataFrame(rdd).toDF(*columns)
4) df = spark.createDataFrame(data).toDF(*columns)
5) df = spark.createDataFrame(rowData,columns)
Besides these, you can find several examples on pyspark create dataframe

Related

Spark DataFrame: Add a new columns according to other columns

I want to add a new column new_col, if the value of column a is in yes_list, then the value is 1 in new_col else 0
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.parallelize([{"a":'y'}, {"a":'y', "b":2}, {"a":'n', "c":3}])
rdd_df = sqlContext.read.json(rdd)
yes_list = ['y']
Something like this:
rdd_df.withColumn("new_col", [1 if val in yes_list else 0 for val in rdd_df["a"]])
But the above is not correct, and raise errors.
TypeError: Column is not iterable
How to achieve it?
You can use the when and isin functions for the sparkSQL API. It would go as follows:
from pyspark.sql import functions
rdd_df.withColumn("new_col", functions.when(rdd_df['a'].isin(yes_list), 1).otherwise(0)).show()
+---+----+----+-------+
| a| b| c|new_col|
+---+----+----+-------+
| y|null|null| 1|
| y| 2|null| 1|
| n|null| 3| 0|
+---+----+----+-------+

checking column of the pyspark dataframe columns

I want to check each column of the pyspark dataframe and if the column meets specific dtypes then it will perform certain functions. below is my codes and dataset.
dataset:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Word Count').config('spark.some.config.option', 'some-value').getOrCreate()
df = spark.createDataFrame(
[
('A',1),
('A', 2),
('A',3),
('A', 4),
('B',5),
('B', 6),
('B',7),
('B', 8),
],
['id', 'v']
) #I save this to csv so can just ignore my read csv park below.
Codes:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('test.csv',
format ='com.databricks.spark.csv',
header='true',
inferSchema='true')
from functools import reduce
from pyspark.sql.functions import col
import numpy as np
i = (reduce(lambda x, y: x.withColumn(y, np.where(col(y).dtypes != 'str', col(y)+2, col(y))), df.columns, df)) # this is the part that I wanted to change.
Side learning request: If possible can anyone tell me how to edit only specific column? I understand using .select but can someone show some examples with some dataset if possible. thank you
My expected output:
+---+---+
| id| v|
+---+---+
| A| 3|
| A| 4|
| A| 5|
| A| 6|
| B| 7|
| B| 8|
| B| 9|
| B| 10|
+---+---+
Side note: I am new to pyspark so I dont get why need to use 'col'. what is it anyway actually?

Append list of lists as column to PySpark's dataframe (Concatenating two dataframes without common column)

I have some dataframe in Pyspark:
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df.show()
+---+
| id|
+---+
| a|
| b|
| c|
| d|
| e|
+---+
And I have a list of lists:
l = [[1,1], [2,2], [3,3], [4,4], [5,5]]
Is it possible to append this list as a column to df? Namely, the first element of l should appear next to the first row of df, the second element of l next to the second row of df, etc. It should look like this:
+----+---+--+
| id| l|
+----+---+--+
| a| [1,1]|
| b| [2,2]|
| c| [3,3]|
| d| [4,4]|
| e| [5,5]|
+----+---+--+
UDF's are generally slow but a more efficient way without using any UDF's would be:
import pyspark.sql.functions as F
ldf = spark.createDataFrame(l, schema = "array<int>")
df1 = df.withColumn("m_id", F.monotonically_increasing_id())
df2 = ldf.withColumn("m_id", F.monotonically_increasing_id())
df3 = df2.join(df1, "m_id", "outer").drop("m_id")
df3.select("id", "value").show()
+---+------+
| id| value|
+---+------+
| a|[1, 1]|
| b|[2, 2]|
| d|[4, 4]|
| c|[3, 3]|
| e|[5, 5]|
+---+------+
Assuming that you are going to have same amount of rows in your df and items in your list (df.count==len(l)).
You can add a row_id (to specify the order) to your df, and based on that, access to the item on your list (l).
from pyspark.sql.functions import row_number, lit
from pyspark.sql.window import *
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
df.show()
Above code will look like:
+---+-------+
| id|row_num|
+---+-------+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+---+-------+
Then, you can just iterate your df and access the specified index in your list:
def map_df(row):
return (row.id, l[row.row_num-1])
new_df = df.rdd.map(map_df).toDF(["id", "l"])
new_df.show()
Output:
+---+------+
| id| l|
+---+------+
| 1|[1, 1]|
| 2|[2, 2]|
| 3|[3, 3]|
| 4|[4, 4]|
| 5|[5, 5]|
+---+------+
Thanks to Cesar's answer, I figured out how to do it without making the dataframe an RDD and coming back. It would be something like this:
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import row_number, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, FloatType, IntegerType
spark = SparkSession.builder.getOrCreate()
sqlcontext = SQLContext(spark)
df = sqlcontext.createDataFrame([['a'],['b'],['c'],['d'],['e']], ['id'])
df = df.withColumn("row_num", row_number().over(Window().orderBy(lit('A'))))
new_col = [[1.,1.], [2.,2.], [3.,3.], [4.,4.], [5.,5.]]
map_list_to_column = udf(lambda row_num: new_col[row_num -1], ArrayType(FloatType()))
df.withColumn('new_col', map_list_to_column(df.row_num)).drop('row_num').show()

python- Recursion error while trying to create a copy of spark dataframe using the copy module

I am trying to create a copy of a spark dataframe using the copy module of python but I am running into a RecursionError. Following is the code I am using
>>> df = spark.createDataFrame([[1,2],[3,4]],['x1','x2'])
>>> df.show()
+---+---+
| x1| x2|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
>>> import copy
>>> df_copy = copy.copy(df)
This code results into a RecursionError: maximum recursion depth exceeded. Same is the case when I use copy.deepcopy as well.
What is the correct way to create copies of a spark dataframe in python? And why does the current approach result in a recursion error?
To (shallow) copy a DataFrame you can just assign it to a new variable:
import pyspark.sql.functions as F
import pandas as pd
# Sample data
df = pd.DataFrame({'x1': [1,2,3] })
df = spark.createDataFrame(df)
df2 = df
df2 = df2.withColumn('x1', F.col('x1') + 1)
print('df:')
df.show()
print('df2:')
df2.show()
Output:
df:
+---+
| x1|
+---+
| 1|
| 2|
| 3|
+---+
df2:
+---+
| x1|
+---+
| 2|
| 3|
| 4|
+---+
As you can see, after copying df to df2 and mutating the copy, our original DataFrame df remains unchanged.

How to retrieve all columns using pyspark collect_list functions

I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that
z=data1.groupby('country').agg(F.collect_list('names'))
will give me values for country & names attribute & for names attribute it will give column header as collect_list(names). But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can you please suggest me how to do it using collect_list() or any other pyspark functions?
I tried this code too
from pyspark.sql import functions as F
fieldnames=data1.schema.names
names1= list()
for item in names:
if item != 'names':
names1.append(item)
z=data1.groupby('names').agg(F.collect_list(names1))
z.show()
but got error message
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.collect_list. Trace: py4j.Py4JException: Method collect_list([class java.util.ArrayList]) does not exist
Use struct to combine the columns before calling groupBy
suppose you have a dataframe
df = spark.createDataFrame(sc.parallelize([(0,1,2),(0,4,5),(1,7,8),(1,8,7)])).toDF("a","b","c")
df = df.select("a", f.struct(["b","c"]).alias("newcol"))
df.show()
+---+------+
| a|newcol|
+---+------+
| 0| [1,2]|
| 0| [4,5]|
| 1| [7,8]|
| 1| [8,7]|
+---+------+
df = df.groupBy("a").agg(f.collect_list("newcol").alias("collected_col"))
df.show()
+---+--------------+
| a| collected_col|
+---+--------------+
| 0|[[1,2], [4,5]]|
| 1|[[7,8], [8,7]]|
+---+--------------+
Aggregation operation can be done only on single columns.
After aggregation, You can collect the result and iterate over it to separate the combined columns generate the index dict. or you can write a
udf to separate the combined columns.
from pyspark.sql.types import *
def foo(x):
x1 = [y[0] for y in x]
x2 = [y[1] for y in x]
return(x1,x2)
st = StructType([StructField("b", ArrayType(LongType())), StructField("c", ArrayType(LongType()))])
udf_foo = udf(foo, st)
df = df.withColumn("ncol",
udf_foo("collected_col")).select("a",
col("ncol").getItem("b").alias("b"),
col("ncol").getItem("c").alias("c"))
df.show()
+---+------+------+
| a| b| c|
+---+------+------+
| 0|[1, 4]|[2, 5]|
| 1|[7, 8]|[8, 7]|
+---+------+------+
Actually we can do it in pyspark 2.2 .
First we need create a constant column ("Temp"), groupBy with that column ("Temp") and apply agg by pass iterable *exprs in which expression of collect_list exits.
Below is the code:
import pyspark.sql.functions as ftions
import functools as ftools
def groupColumnData(df, columns):
df = df.withColumn("Temp", ftions.lit(1))
exprs = [ftions.collect_list(colName) for colName in columns]
df = df.groupby('Temp').agg(*exprs)
df = df.drop("Temp")
df = df.toDF(*columns)
return df
Input Data:
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 0| 1| 2|
| 0| 4| 5|
| 1| 7| 8|
| 1| 8| 7|
+---+---+---+
Output Data:
df.show()
+------------+------------+------------+
| a| b| c|
+------------+------------+------------+
|[0, 0, 1, 1]|[1, 4, 7, 8]|[2, 5, 8, 7]|
+------------+------------+------------+
in spark 2.4.4 and python 3.7 (I guess its also relevant for previous spark and python version) --
My suggestion is a based on pauli's answer,
instead of creating the struct and then using the agg function, create the struct inside collect_list:
df = spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
df.groupBy("a").agg(collect_list(struct(["b","c"])).alias("res")).show()
result :
+---+-----------------+
| a|res |
+---+-----------------+
| 0|[[1, 2], [4, 5]] |
| 1|[[7, 8], [8, 7]] |
+---+-----------------+
I just use Concat_ws function it's perfectly fine.
> from pyspark.sql.functions import * df =
> spark.createDataFrame([(0,1,2),(0,4,5),(1,7,8),(1,8,7)]).toDF("a","b","c")
> df.groupBy('a').agg(collect_list(concat_ws(',','b','c'))).alias('r').show()

Categories

Resources