Merge two dataframes in PySpark - python

I have two dataframes, DF1 and DF2, DF1 is the master which stores any additional information from DF2.
Lets say the DF1 is of the following format,
Item Id | item | count
---------------------------
1 | item 1 | 2
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 5
DF2 contains the 2 items which were already present in DF1 and two new entries. (itemId and item are considered as a single group, can be treated as the key for join)
Item Id | item | count
---------------------------
1 | item 1 | 2
3 | item 4 | 2
4 | item 4 | 4
5 | item 5 | 2
I need to combine the two dataframes such that the existing items count are incremented and new items are inserted.
The result should be like:
Item Id | item | count
---------------------------
1 | item 1 | 4
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 7
4 | item 4 | 4
5 | item 5 | 2
I have one way do achieve this not sure if its efficient or the right way to do
temp1 = df1.join(temp,['item_id','item'],'full_outer') \
.na.fill(0)
temp1\
.groupby("item_id", "item")\
.agg(F.sum(temp1["count"] + temp1["newcount"]))\
.show()

Since, the schema for the two dataframes is the same you can perform a union and then do a groupby id and aggregate the counts.
step1: df3 = df1.union(df2);
step2: df3.groupBy("Item Id", "item").agg(sum("count").as("count"));

There are several ways how to do it.
Based on what you describe the most straightforward solution would be to use RDD - SparkContext.union:
rdd1 = sc.parallelize(DF1)
rdd2 = sc.parallelize(DF2)
union_rdd = sc.union([rdd1, rdd2])
the alternative solution would be to use DataFrame.union from pyspark.sql
Note: I have suggested unionAll previously but it is deprecated in Spark 2.0

#wandermonk's solution is recommended as it does not use join. Avoid joins as much as possible as this triggers shuffling (also known as wide transformation and leads to data transfer over the network and that is expensive and slow)
You also have to look into your data size (both tables are big or one small one big etc) and accordingly you can tune the performance side of it.
I tried showing the group by a solution using SparkSQL as they do the same thing but easier to understand and manipulate.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
list_1 = [[1,"item 1" , 2],[2 ,"item 2", 3],[1 ,"item 3" ,2],[3 ,"item 4" , 5]]
list_2 = [[1,"item 1",2],[3 ,"item 4",2],[4 ,"item 4",4],[5 ,"item 5",2]]
my_schema = StructType([StructField("Item_ID",IntegerType(), True),StructField("Item_Name",StringType(), True ),StructField("Quantity",IntegerType(), True)])
df1 = spark.createDataFrame(list_1, my_schema)
df2 = spark.createDataFrame(list_2, my_schema)
df1.createOrReplaceTempView("df1")
df1.createOrReplaceTempView("df2")
df3 = df2.union(df1)
df3.createOrReplaceTempView("df3")
df4 = spark.sql("select Item_ID, Item_Name, sum(Quantity) as Quantity from df3 group by Item_ID, Item_Name")
df4.show(10)
now if you look into the SparkUI, you can see for such a small data set, the shuffle operation, and # of stages.
Number of stages for such a small job
Number the shuffle operation for this group by command
I also recommend to see the SQL plan and understand the cost. Exchange represents the shuffle here.
== Physical Plan ==
*(2) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, Quantity#32L])
+- Exchange hashpartitioning(Item_ID#6, Item_Name#7, 200)
+- *(1) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[partial_sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, sum#38L])
+- Union
:- Scan ExistingRDD[Item_ID#6,Item_Name#7,Quantity#8]
+- Scan ExistingRDD[Item_ID#0,Item_Name#1,Quantity#2]

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

Not able to use Grouped by unique values as index in Pivot_table Panda DataFrame

I'm trying to filter unique values for product descriptions ( short: description), with a unique price (net_price). example:
description net_price site
Product 1 9 USCA
Product 2 7 USCA
product 3 6 USCA
product 1 12 USNY
product 2 7 USNY
product 4 10 USBP
What I get after filtering with un= master.dropna().groupby('description')['net_price'].unique(), which is what I want "A list of services unique depending on price"
description net_price
product 1 [9,12]
product 2 [7]
product 3 [6]
product 4 [10]
using this: serv= master.pivot_table(values='quantity', index=un, columns='site', aggfunc=np.sum) I'm trying to build a pivot table using the unique descriptions as the index
what I'm expecting to get
USCA USNY USBP
description
product 1 9
product 1 12
product 2 7
product 3 6
product 4 10
what I actually get is an empty DataFrame. NOTE: this works if I don't filter to get unique values except it does not give me duplicates with dif. prices
my script:
import camelot
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
re_filter = re.compile(r'^[A-Z]')
pdf_data = camelot.read_pdf('test.pdf', pages='all', flavor='lattice', encoding='latin-1' )# # GETS Tables
list_of_df = list()# LIST HOLDS ALL DATAFRAMES
prev_df = []# # LIST HOLDS ARRAY WITH ENDING SITE COL AS DIGIT
for table in pdf_data:# iterates over all dataframes
df = table.df.rename({0:'product_no',1:'description', 2:'quantity', 3:'net_price', 4:'total'}, axis=1)
site = df.iloc[-1, 0]
df =df.reindex(df.index.drop(0)).reset_index(drop=True)
if re.match(re_filter, str(site)): # CHECKS IF SITE STARTS LIKE 'USCA' INSTEAD OF THIS '112233'
if len(prev_df) > 0: # CHECKS IF 'prev_df' HAS ANY DATAFRAME IN IT
prev_df.insert(1,df)
df =pd.concat(prev_df) # TRYS TO CONCATENATE CURRENT DATAFRAME WITH STORED ON 'prev_df' DATAFRAME
df["site"] = site
df.drop(index=df.index[-1], axis=0, inplace=True)# THIS IS DROPING LAST ROW OF 'prev_df -->need to Move '
list_of_df.append(df)
prev_df.clear() # Removes item in Prev_df
else:
# makes normal changes to current DataFrame
df['site'] = site
df.drop(index=df.index[-1], axis=0, inplace=True)
list_of_df.append(df)
else:
prev_df.insert(0,df) # APPENDS INCOMPLETE DATAFRAME TO 'prev_df'
master = pd.concat(list_of_df)
# FILTER UNIQUE LIST OF DESCRIPTIONS
un= master.dropna().groupby('description')['net_price'].unique()
print(un)
# master_filtered = master.dropna(axis=1, how='all', inplace=True)
# currently gets data in this format need to display like this
# | service1 | 4 | site1 | site1 | site2 | site3 | site4 | site5 |
# | 2 | site2 service 1 | 4 | 2 | | | 4 |
# | 4 | site5
serv= master.pivot_table(values='quantity', index=un, columns='site', aggfunc=np.sum)
print(serv)
serv.to_csv('test2-output.csv')

How i can compare columns from another dataframe in PySpark

I have two dataframes:
First with AVG values:
+----------+-----+
| Category | AVG |
+----------+-----+
| Categ | 1.0 |
+----------+-----+
| Categ2 | 0.5 |
+----------+-----+
...
...
Second has the fallowing category: Category, Name, Price
The question is:
How can I delete all those records for which the price is less than the average price from the first table??
I tried that way:
dataGreaterAvge = data.where(data.Price >= avgCategoryPrice.where(data.Category == avgCategoryPrice.Category).collect()[0]["avg(Price)"])
dataGreaterAvge - First dataframe
data - Second dataframe
However, this does not work as it should, because it only takes the value of the first element from the average values table
Spark works like SQL... so...
First you need to join the dataframes.
a = df1.alias('a')
b = df2.alias('b')
df_joined = a.join(b, a.Category == b.Category)
then you will be able to filter properly
from pyspark.sql import functions as f
df_joined.select(col('a.category'),col('a.AVG'))\
.where(col('a.AVG') > f.avg(col('b.avg')).groupBy(col('a.AVG'))

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science
IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()
df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13
If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Categories

Resources