Pyspark: how to split a dataframe into chunks and save them?

Pyspark: how to split a dataframe into chunks and save them? - python

I need to split a pyspark dataframe df and save the different chunks.
This is what I am doing: I define a column id_tmp and I split the dataframe based on that.
chunk = 10000
id1 = 0
id2 = chunk
df = df.withColumn('id_tmp', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
c = df.count()
while id1 < c:
stop_df = df.filter( (tmp.id_tmp < id2) & (tmp.id_tmp >= id1))
stop_df.write.format('com.databricks.spark.csv').save('myFolder/')
id1+=chunk
id2+=chunk
Is there a more efficient way without defining the column id_tmp

I suggest you to use the partitionBy method from the DataFrameWriter interface built-in Spark (docs). Here is an example.
Given the df DataFrame, the chuck identifier needs to be one or more columns. In my example id_tmp. The following snippet generates a DF with 12 records with 4 chunk ids.
import pyspark.sql.functions as F
df = spark.range(0, 12).withColumn("id_tmp", F.col("id") % 4).orderBy("id_tmp")
df.show()
Returns:
+---+------+
| id|id_tmp|
+---+------+
| 8| 0|
| 0| 0|
| 4| 0|
| 1| 1|
| 9| 1|
| 5| 1|
| 6| 2|
| 2| 2|
| 10| 2|
| 3| 3|
| 11| 3|
| 7| 3|
+---+------+
To save each chunk indepedently you need:
(df
.repartition("id_tmp")
.write
.partitionBy("id_tmp")
.mode("overwrite")
.format("csv")
.save("output_folder"))
repartition will shuffle the records so that each node has a complete set of records for one "id_tmp" value. Then each chunk is written to one file with the partitionBy.
Resulting folder structure:
output_folder/
output_folder/._SUCCESS.crc
output_folder/id_tmp=0
output_folder/id_tmp=0/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=0/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=1
output_folder/id_tmp=1/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=1/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=2
output_folder/id_tmp=2/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=2/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/id_tmp=3
output_folder/id_tmp=3/.part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv.crc
output_folder/id_tmp=3/part-00000-eba244a4-ce95-4f4d-b9b8-8e5f972b144f.c000.csv
output_folder/_SUCCESS
The size and number of partitions are quite important for Spark's performance. Don't partition the dataset too much and have reasonable file sizes (like 1GB per file) especially if you are using cloud storage services. It is also advised to use the partition variables if you want to filter the data when loading (i.e.: year=YYYY/month=MM/day=DD)

Related

AWS - how to convert list of tuples into multiple data frames (python - glue job)

I have list of tuples like
(
(('a1','a2','a3'), [[1,2,3],[4,5,6],[7,8,9]]),
(('b1','b2','b3'), [[11,21,13],[14,15,16],[74,84,94]])
)
This needs to be converted to multiple dataframes- one for each dataset.
for example a1,a2,a3 must go to dataframe1, similarly b1,b2,b3 must be converted to dataframe2.
Any easy way to iterate through within a dataframe?
currently I can convert 1st dataset in to dataframe using below:
x = spark.createDataFrame(data = r[0][1], schema = r[0][0])
r is where I have downloaded and stored the file. this reads only 1,2,3,4,5,6,7,8,9.
should i convert it into dic?
I want to be able to iterate list of tuples within dataframe

Create a list of the dataframe and iterate.
data = [
(('a1','a2','a3'), [[1,2,3],[4,5,6],[7,8,9]]),
(('b1','b2','b3'), [[11,21,13],[14,15,16],[74,84,94]])
]
df_list = [spark.createDataFrame(data[i][1], data[i][0]) for i in range(0, len(data))]
for i in range(0, len(df_list)):
df_list[i].show()
+---+---+---+
| a1| a2| a3|
+---+---+---+
| 1| 2| 3|
| 4| 5| 6|
| 7| 8| 9|
+---+---+---+
+---+---+---+
| b1| b2| b3|
+---+---+---+
| 11| 21| 13|
| 14| 15| 16|
| 74| 84| 94|
+---+---+---+

How to create multiple count columns in Pyspark?

I have a dataframe of title and bin:
+---------------------+-------------+
| Title| bin|
+---------------------+-------------+
| Forrest Gump (1994)| 3|
| Pulp Fiction (1994)| 2|
| Matrix, The (1999)| 3|
| Toy Story (1995)| 1|
| Fight Club (1999)| 3|
+---------------------+-------------+
How do I count the bin into each individual column of a new dataframe using Pyspark? For instance:
+------------+------------+------------+
| count(bin1)| count(bin2)| count(bin3)|
+------------+------------+------------+
| 1| 1 | 3|
+------------+------------+------------+
Is this possible? Would someone please help me with this if you know how?

Group by bin and count then pivot the column bin and rename the columns of resulting dataframe if you want:
import pyspark.sql.functions as F
df1 = df.groupBy("bin").count().groupBy().pivot("bin").agg(F.first("count"))
df1 = df1.toDF(*[f"count_bin{c}" for c in df1.columns])
df1.show()
#+----------+----------+----------+
#|count_bin1|count_bin2|count_bin3|
#+----------+----------+----------+
#| 1| 1| 3|
#+----------+----------+----------+

Flatten pyspark Dataframe to get timestamp for each particular value and field

I have tried to find the change in value for each column attribute in following manner :
windowSpec = Window.partitionBy("attribute").orderBy(df_series['time'].asc())
final_df_series = df_series.withColumn('lagdate',f.lag(df_series['time'],-1).over(windowSpec))\
.withColumn("value_lagvalue$df",(f.lag(df_series["value"],-1).over(windowSpec)))\
.withColumn("value_grp$df",(f.col("value") - f.col("value_lagvalue$df")).cast("int"))\
.filter(F.col("value_grp$df") != 0).drop(F.col("value_grp$df"))\
.select("attribute","lagdate","value_lagvalue$df").persist()
The output of dataframe from above code is :
+---------+-------------------+-----------------+
|attribute| lagdate|value_lagvalue$df|
+---------+-------------------+-----------------+
| column93|2020-09-07 10:29:24| 3|
| column93|2020-09-07 10:29:38| 1|
| column93|2020-09-07 10:31:08| 0|
| column94|2020-09-07 10:29:26| 3|
| column94|2020-09-07 10:29:40| 1|
| column94|2020-09-07 10:31:18| 0|
|column281|2020-09-07 10:29:34| 3|
|column281|2020-09-07 10:29:54| 0|
|column281|2020-09-07 10:31:08| 3|
|column281|2020-09-07 10:31:13| 0|
|column281|2020-09-07 10:35:24| 3|
|column281|2020-09-07 10:36:08| 0|
|column282|2020-09-07 10:41:13| 3|
|column282|2020-09-07 10:49:24| 1|
|column284|2020-09-07 10:51:08| 1|
|column284|2020-09-07 11:01:13| 0|
|column285|2020-09-07 11:21:13| 1|
+---------+-------------------+-----------------+
I want to transform it into following structure
attribute,timestamp_3,timestamp_1,timestamp_0
column93,2020-09-07 10:29:24,2020-09-07 10:29:38,2020-09-07 10:31:08
column94,2020-09-07 10:29:26,2020-09-07 10:29:40,2020-09-07 10:31:18
column281,2020-09-07 10:29:34,null,2020-09-07 10:29:54
column281,2020-09-07 10:31:08,null,2020-09-07 10:31:13
column281,2020-09-07 10:35:24,null,2020-09-07 10:36:08
column282,2020-09-07 10:41:13,2020-09-07 10:49:24,null
column284,null,2020-09-07 10:51:08,2020-09-07 11:01:13
column285,null,2020-09-07 11:21:13,null
Any help appreciated.(Solution in pyspark is preferable as it is optimized in nature for large dataframe of such kind but in pandas is also very helpful).
Update:
This article seem to achieve nearly the same thing. Hope from community to help in achieving desired goal
PySpark explode list into multiple columns based on name

Pyspark: filter function error with .isNotNull() and other 2 other conditions

I'm trying to filter my dataframe in Pyspark and I want to write my results in a parquet file, but I get an error every time because something is wrong with my isNotNull() condition. I have 3 conditions in the filter function, and if one of them is true the resulting row should be written in the parquet file.
I tried different versions with OR and | and different versions with isNotNull(), but nothing helped me.
This is one example I tried:
from pyspark.sql.functions import col
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df.where(col("col2").isNotNull()))
).write.save("new_parquet.parquet")
This is the other example I tried, but in that example it ignores the rows with attribute1 or attribute2:
df.filter(
(df['col1'] == 'attribute1') |
(df['col1'] == 'attribute2') |
(df['col2'].isNotNull())
).write.save("new_parquet.parquet")
This is the error message:
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
I hope you can help me, I'm new to the topic. Thank you so much!

First of, about the col1 filter, you could do it using isin like this:
df['col1'].isin(['attribute1', 'attribute2'])
And then:
df.filter((df['col1'].isin(['atribute1', 'atribute2']))|(df['col2'].isNotNull()))
AFAIK, the dataframe.column.isNotNull() should work, but I dont have sample data to test it, sorry.

See the example below:
from pyspark.sql import functions as F
df = spark.createDataFrame([(3,'a'),(5,None),(9,'a'),(1,'b'),(7,None),(3,None)], ["id", "value"])
df.show()
The original DataFrame
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 5| null|
| 9| a|
| 1| b|
| 7| null|
| 3| null|
+---+-----+
Now we do the filter:
df = df.filter( (df['id']==3)|(df['id']=='9')|(~F.isnull('value')))
df.show()
+---+-----+
| id|value|
+---+-----+
| 3| a|
| 9| a|
| 1| b|
| 3| null|
+---+-----+
So you see
row(3, 'a') and row(3, null) are selected because of `df['id']==3'
row(9, 'a') is selected because of `df['id']==9'
row(1, 'b') is selected because of ~F.isnull('value'), but row(5, null) and row(7, null) are not selected.

count values in multiple columns that contain a substring based on strings of lists pyspark

I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
lists are below:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
I want to count the device and device_model for each id and pivot the values in a new data frame.
I want to count the values in the both the device_model and device columns for each id that match the strings in the list.
For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5
The result I want
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
I have done like below
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.
How can I achieve the result I want?

Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark: how to split a dataframe into chunks and save them? - python

Related

AWS - how to convert list of tuples into multiple data frames (python - glue job)

How to create multiple count columns in Pyspark?

Flatten pyspark Dataframe to get timestamp for each particular value and field

Pyspark: filter function error with .isNotNull() and other 2 other conditions

count values in multiple columns that contain a substring based on strings of lists pyspark

Categories

Resources