if duplicata row update rows to 0 in Pyspark - python

I need to update values in column DF.EMAIL if have duplicates values in DF.EMAIL column to 0
generate DF
data = [('2345', 'leo#gmai.com'),
('2398', 'leo#hotmai.com'),
('2398', 'leo#hotmai.com'),
('2328', 'leo#yahoo.con'),
('3983', 'leo#yahoo.com.ar')]
serialize DF
df = sc.parallelize(data).toDF(['ID', 'EMAIL'])
# show DF
df.show()
Partial Solution
# create column with value 0 if don't have duplicates
# if have duplicates set value 1
df_join = df.join(
df.groupBy(df.columns).agg((count("*")>1).cast("int").alias("duplicate_indicator")),
on=df.columns,
how="inner"
)
# Update to 1 if have duplicates
df1 = df_join.withColumn(
"EMAIL",
when(df_join.duplicate_indicator == 1,"") \
.otherwise(df_join.EMAIL)
)

Syntax-wise, this looks more compact but yours might perform better.
df = (df.withColumn('count', count('*').over(Window.partitionBy('ID')))
.withColumn('EMAIL', when(col('count') > 1, '').otherwise(col('EMAIL'))))

Related

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.
Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.
I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Pandas: Replace empty column values with the non-empty value based on a condition

I have a dataset in this format:
and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])
Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.
Based on the suggestions to use bfill(). I tried putting it as following:
df["EndDate"] = (
df.sort_values(by=["StartDate"])
.groupby(["DocumentId", "PersonId"])["EndDate"]
.bfill()
)
Above works fine but how can I add the condition for DocumentCode being RT?
You can calculate the value to use to fill nan inside the apply function.
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)
You could find the empty cells and replace with np.nan, then fillna with method='bfill'
df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)
Alternatively you could iterate through the df from last row to first row, and fill in the EndDate where necessary:
d = df.loc[df.shape[0]-1, 'EndDate'] #initial condition
for i in range(df.shape[0]-1, -1, -1):
if df.loc[i, 'DocumentCode'] == 'RT':
d = df.loc[i, 'EndDate']
else:
df.loc[i, 'EndDate'] = d

How to write complicated function to aggregate DataFrame

I have a DataFrame in Python like below, which presents agreements of clients:
df = pd.DataFrame({"ID" : [1,2,1,1,3],
"amount" : [100,200,300,400,500],
"status" : ["active", "finished", "finished",
"active", "finished"]})
I need to write FUNCTION in Python, which will calculate:
1.Number (NumAg) and amount (AmAg) of contracts per "ID"
2.Number (NumAct) and amount of active (AmAct) contracts per ID
3.Number (NumFin) and amount of finished (AmFin) contracts per ID
To be more precision i need to create by this function DataFrame like below:
The below solution should fit your use case.
import pandas as pd
def summarise_df(df):
# Define mask to filter df by 'active' value in 'status' column for 'NumAct', 'AmAct', 'NumFin', and 'AmFin' columns
active_mask = df['status'].str.contains('active')
return df.groupby('ID').agg( # Create first columns in output df using agg (no mask needed)
NumAg=pd.NamedAgg(column='amount', aggfunc='count'),
AmAg=pd.NamedAgg(column='amount', aggfunc='sum'
)).join( # Add columns using values with 'active' status
df[active_mask].groupby('ID').agg(
NumAct=pd.NamedAgg(column='amount', aggfunc='count'),
AmAct=pd.NamedAgg(column='amount', aggfunc='sum')
)).join( # Add columns using values with NOT 'active' (i.e. 'finished') status
df[~active_mask].groupby('ID').agg(
NumFin=pd.NamedAgg(column='amount', aggfunc='count'),
AmFin=pd.NamedAgg(column='amount', aggfunc='sum')
)).fillna(0) # Replace nan values with 0
I would recommend reading over this function and its comments alongside documentation for groupby() and join() so that you can develop a better understanding of exactly what is being done here. It is seldom a wise decision to rely upon code that you don't have a good grasp on.
You could use groupby on ID with agg, after adding two bool columns that make the aggregation easier:
df['AmAct'] = df.amount[df.status.eq('active')]
df['AmFin'] = df.amount[df.status.eq('finished')]
df = df.groupby('ID').agg(
NumAg = ('ID', 'count'),
AmAg = ('amount', 'sum'),
NumAct = ('status', lambda col: col.eq('active').sum()),
AmAct = ('AmAct', 'sum'),
NumFin = ('status', lambda col: col.eq('finished').sum()),
AmFin = ('AmFin', 'sum')
)
Result:
NumAg AmAg NumAct AmAct NumFin AmFin
ID
1 3 800 2 500.0 1 300.0
2 1 200 0 0.0 1 200.0
3 1 500 0 0.0 1 500.0
Or add some more columns to df to do a simpler groupby on ID with sum:
df.insert(1, 'NumAg', 1)
df['NumAct'] = df.status.eq('active')
df['AmAct'] = df.amount[df.NumAct]
df['NumFin'] = df.status.eq('finished')
df['AmFin'] = df.amount[df.NumFin]
df.drop(columns=['status'], inplace=True)
df = df.groupby('ID').sum().rename(columns={'amount': 'AmAg'})
with the same result.
Or, maybe the easiest way, let pivot_table do most of the work, after adding a count column to df, and some column-rearranging afterwards:
df['count'] = 1
df = df.pivot_table(index='ID', columns='status', values=['count', 'amount'],
aggfunc=sum, fill_value=0, margins=True).drop('All')
df.columns = ['AmAct', 'AmFin', 'AmAg', 'NumAct', 'NumFin', 'NumAg']
df = df[['NumAg', 'AmAg', 'NumAct', 'AmAct', 'NumFin', 'AmFin']]

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')
df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']
You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

Comparing two Data Frames and getting differences

I want to compare two Data Frames and print out my differences in a selective way. Here is what I want to accomplish in pictures:
Dataframe 1
Dataframe 2
Desired Output - Dataframe 3
What I have tried so far?
import pandas as pd
import numpy as np
df1 = pd.read_excel("01.xlsx")
df2 = pd.read_excel("02.xlsx")
def diff_pd(df1, df2):
"""Identify differences between two pandas DataFrames"""
assert (df1.columns == df2.columns).all(), \
"DataFrame column names are different"
if any(df1.dtypes != df2.dtypes):
"Data Types are different, trying to convert"
df2 = df2.astype(df1.dtypes)
if df1.equals(df2):
return None
else: # need to account for np.nan != np.nan returning True
diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull())
ne_stacked = diff_mask.stack()
changed = ne_stacked[ne_stacked]
changed.index.names = ['id', 'Naziv usluge']
difference_locations = np.where(diff_mask)
changed_from = df1.values[difference_locations]
changed_to = df2.values[difference_locations]
return pd.DataFrame({'Service Previous': changed_from, 'Service Current': changed_to},
index=changed.index)
df3 = diff_pd(df1, df2)
df3 = df3.fillna(0)
df3 = df3.reset_index()
print(df3)
To be fair i found that code on another thread, but it does get job done, but I still have some issues.
My dataframes are not equal, what do I do?
I don't fully understand the code I provided.
Thank you!
How about something easier to start with ...
Try this
import pandas as pd
data1={'Name':['Tom','Bob','Mary'],'Age':[20,30,40],'Pay':[10,10,20]}
data2={'Name':['Tom','Bob','Mary'],'Age':[40,30,20]}
df1=pd.DataFrame.from_records(data1)
df2=pd.DataFrame.from_records(data2)
# Checking Columns
for col in df1.columns:
if col not in df2.columns:
print(f"DF2 Missing Col {col}")
# Check Col Values
for col in df1.columns:
if col in df2.columns:
# Ok we have the same column
if list(df1[col]) == list(df2[col]):
print(f"Columns {col} are the same")
else:
print(f"Columns {col} have differences")
It should output
DF2 Missing Col Pay
Columns Age have differences
Columns Name are the same
Python3.7 needed or change the f-string formatting.

Categories

Resources