Add list containing dates to PySpark Dataframe

Add list containing dates to PySpark Dataframe - python

I've created a list of dates that I would like to add to a Spark dataframe with StructType = StringType. However, the final df below only contains null values.
#Step 1: Create data-range and put into list
start_date = '2020-05-01'
end_date = '2020-05-10'
my_dates = pd.date_range(start_date,end_date).tolist()
#Step 2: Add list to Spark Df
cSchema = StructType([StructField("date", ArrayType(StringType()))])
df2 = spark.createDataFrame(my_dates,schema,cSchema)

Maybe you could try something like:
start_date = '2020-05-01'
end_date = '2020-05-10'
my_dates = pd.date_range(start_date,end_date).tolist()
new_df = spark.createDataFrame([(value,) for value in mydates], ['date'])
new_df.show()

Related

How to create lag feature in pandas in this case?

I have a table like this (with more columns):
date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825
I have created some features like this:
sectorGroup = df.groupby(["date","Sector"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby(["date"])["Value1","Value2"].mean().reset_index()
df = pd.merge(df,dateGroupGroup,on=["date"],how="left",suffixes=["","_byDate"])
Now my new df looks like this:
date,Sector,Value1,Value2,Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
14/03/22,Medical,86,64,275.0,81.5,281.75,260.25
14/03/22,Medical,464,99,275.0,81.5,281.75,260.25
14/03/22,Industry,22,35,22.0,35.0,281.75,260.25
14/03/22,Services,555,843,555.0,843.0,281.75,260.25
15/03/22,Services,111,533,111.0,533.0,1634.75,616.0
15/03/22,Industry,222,169,222.0,169.0,1634.75,616.0
15/03/22,Medical,672,937,3103.0,881.0,1634.75,616.0
15/03/22,Medical,5534,825,3103.0,881.0,1634.75,616.0
Now, I want to create lag features for Value1_bySector,Value2_bySector,Value1_byDate,Value2_byDate
For example, a new column named Value1_by_Date_lag1 and Value1_bySector_lag1.
And this new column will look like this:
date,Sector,Value1_by_Date_lag1,Value1_bySector_lag1
15/03/22,Services,281.75,555.0
15/03/22,Industry,281.75,22.0
15/03/22,Medical,281.75,275.0
15/03/22,Medical,281.75,275.0
Basically in Value1_by_Date_lag1, the date "15/03" will contain the value "281.75" which is for the date "14/03" (lag of 1 shift).
Basically in Value1_bySector_lag1, the date "15/03" and Sector "Medical" will contain the value "275.0", which is the value for "14/03" and "Medical" rows.
I hope, the question is clear and gave you all the details.

Create a lagged date variable by shifting the date column, and then merge again with dateGroupGroup and sectorGroup using the lagged date instead of the actual date.
df = pd.read_csv(io.StringIO("""date,Sector,Value1,Value2
14/03/22,Medical,86,64
14/03/22,Medical,464,99
14/03/22,Industry,22,35
14/03/22,Services,555,843
15/03/22,Services,111,533
15/03/22,Industry,222,169
15/03/22,Medical,672,937
15/03/22,Medical,5534,825"""))
# Add a lagged date variable
lagged = df.groupby("date")["date"].first().shift()
df = df.join(lagged, on="date", rsuffix="_lag")
# Create date and sector groups and merge them into df, as you already do
sectorGroup = df.groupby(["date","Sector"])[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df,sectorGroup,on=["date","Sector"],how="left",suffixes=["","_bySector"])
dateGroupGroup = df.groupby("date")[["Value1","Value2"]].mean().reset_index()
df = pd.merge(df, dateGroupGroup, on="date",how="left", suffixes=["","_byDate"])
# Merge again, this time matching the lagged date in df to the actual date in sectorGroup and dateGroupGroup
df = pd.merge(df, sectorGroup, left_on=["date_lag", "Sector"], right_on=["date", "Sector"], how="left", suffixes=["", "_by_sector_lag"])
df = pd.merge(df, dateGroupGroup, left_on="date_lag", right_on="date", how="left", suffixes=["", "_by_date_lag"])
# Drop the extra unnecessary columns that have been created in the merge
df = df.drop(columns=['date_by_date_lag', 'date_by_sector_lag'])
This assumes the data is sorted by date - if not you will have to sort before generating the lagged date. It will work whether or not all the dates are consecutive.

I found 1 inefficient solution (slow and memory intensive).
Lag of "date" group
cols = ["Value1_byDate","Value2_byDate"]
temp = df[["date"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp.date = temp.date.shift(-1-i)
df = pd.merge(df,temp,on="date",how="left",suffixes=["","_lag"+str(i+1)])
Lag of "date" and "Sector" group
cols = ["Value1_bySector","Value2_bySector"]
temp = df[["date","Sector"]+cols]
temp = temp.drop_duplicates()
for i in range(10):
temp[["Value1_bySector","Value2_bySector"]] = temp.groupby("Sector")["Value1_bySector","Value2_bySector"].shift(1+1)
df = pd.merge(df,temp,on=["date","Sector"],how="left",suffixes=["","_lag"+str(i+1)])
Is there a more simple solution?

Pandas: Replace empty column values with the non-empty value based on a condition

I have a dataset in this format:
and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])
Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.
Based on the suggestions to use bfill(). I tried putting it as following:
df["EndDate"] = (
df.sort_values(by=["StartDate"])
.groupby(["DocumentId", "PersonId"])["EndDate"]
.bfill()
)
Above works fine but how can I add the condition for DocumentCode being RT?

You can calculate the value to use to fill nan inside the apply function.
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

You could find the empty cells and replace with np.nan, then fillna with method='bfill'
df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)
Alternatively you could iterate through the df from last row to first row, and fill in the EndDate where necessary:
d = df.loc[df.shape[0]-1, 'EndDate'] #initial condition
for i in range(df.shape[0]-1, -1, -1):
if df.loc[i, 'DocumentCode'] == 'RT':
d = df.loc[i, 'EndDate']
else:
df.loc[i, 'EndDate'] = d

Unable to convert a list into a dataframe. Keep getting the error "ValueError: Must pass 2-d input. shape=(1, 4, 5)"

I have to 2 dfs:
dfMiss:
and
dfSuper:
I need to create a final output that summarises the data in the 2 tables which I am able to do shown in the code below:
dfCity = dfSuper \
.groupby(by='City').count() \
.drop(columns='Superhero ID') \
.rename(columns={'Superhero': 'Total count'})
print("This is the df city : ")
print(dfCity)
## Convert column MissionEndDate to DateTime format
for df in dfMiss:
# Dates are interpreted as MM/dd/yyyy by default, dayfirst=False
df['Mission End date'] = pd.to_datetime(df['Mission End date'], dayfirst=True)
# Get Year and Quarter, given Q1 2020 starts in April
date = df['Mission End date'] - pd.DateOffset(months=3)
df['Mission End quarter'] = date.dt.year.astype(str) + ' Q' + date.dt.quarter.astype(str)
## Get no. Superheros working per City per Quarter
dfCount = []
for dfM in dfMiss:
# Merge DataFrames
df = dfSuper.merge(dfM, left_on='Superhero ID', right_on='SID')
df = df.pivot_table(index=['City', 'Superhero'], columns='Mission End quarter', aggfunc='nunique')
# Get the first group (all the groups have the same values)
df = df[df.columns[0][0]]
# Group the values by City (effectively "collapsing" the 'Superhero' column)
df = df.groupby(by=['City']).count()
dfCount += [df]
## Get no. Superheros available per City per Quarter
dfFree = []
for dfC in dfCount:
# Merge DataFrames
df = dfCity.merge(right=dfC, on='City', how='outer').fillna(0) # convert NaN values to 0
# Subtract no. working superheros from total no. superheros per city
for col in df.columns[1:]:
df[col] = df['Total count'] - df[col]
dfFree += [df.astype(int)]
print(dfFree)
dfResult = pd.DataFrame(dfFree)
The problem is when I try to convert DfFree into a dataframe I get the error:
"ValueError: Must pass 2-d input. shape=(1, 4, 5) "
The line that raises the error is
dfResult = pd.DataFrame(dfFree)
Anyone have any idea what this means and how I can convert the list into a df?
Thanks :)

separate your code using SOLID. separation of concerns. It is not easy to read
sid=[665544,665544,2121,665544,212121,123456,666666]
mission_end_date=["10/10/2020", "03/03/2021", "02/02/2021", "05/12/2020", "15/07/2021", "03/06/2021", "12/10/2020"]
superherod_sid=[212121,364331,678523,432432,665544,123456,555555,666666,432432]
hero=["Spiderman","Ironman","Batman","Dr. Strange","Thor","Superman","Nightwing","Loki","Wolverine"]
city=["New York","New York","Gotham","New York","Asgard","Metropolis","Gotham","Asgard","New York"]
df_mission=pd.DataFrame({'sid':sid,'mission_end_date':mission_end_date})
df_super=pd.DataFrame({'sid':superherod_sid,'hero':hero, 'city':city})
df=df_super.merge(df_mission,on="sid", how="left")
df['mission_end_date']=pd.to_datetime(df['mission_end_date'])
df['mission_end_date_quarter']=df['mission_end_date'].dt.quarter
df['mission_end_date_year']=df['mission_end_date'].dt.year
print(df.head(20))
pivot = df.pivot_table(index=['city', 'hero'], columns='mission_end_date_quarter', aggfunc='nunique').fillna(0)
print(pivot.head())

Optimize the unique id with in certain period

I have below dataframe called "df" and calculating the sum by unique id called "Id".
Can anyone help me in optimizing the code i have tried.
import pandas as pd
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Id':['100','200','300','100','100', '100','200'],
'Amount':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])

You can try to use groupby, after this each adjust within sub-groupby not to the whole df
s = {}
for x , y in df.groupby(['Id','NCC']):
for i in y.index:
start_date = y['Date'][i] - timedelta(seconds=300)
end_date = y['Date'][i]
mask = (y['Date'] >= start_date) & (y['Date'] < end_date)
count = y.loc[mask]
count = count.loc[(y['Sys'] == 1)]
if len(count) == 0:
s.update({i : 0})
else:
s.update({i : count['Amount'].sum()})
df['New']=pd.Series(s)

If the original data frame has 2 million rows, it would probably be faster to convert the 'Date' column to an index and sort it. Then you can sub select each 5-minute interval:
df = df.set_index('Date').sort_index()
df['Sum_Amt'] = 0
for end in df.index:
start = end - pd.Timedelta('5min')
current_window = df[start : end] # data frame with 5-minute look-back
sum_amt = <calc logic applied to `current_window` goes here>
df.at[end, 'Sum_Amt'] = sum_amt
print(current_window)
print()
I'm not following the logic for calculating Sum_Amt, so I left that out.

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')

df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']

You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add list containing dates to PySpark Dataframe - python

Maybe you could try something like: start_date = '2020-05-01' end_date = '2020-05-10' my_dates = pd.date_range(start_date,end_date).tolist() new_df = spark.createDataFrame([(value,) for value in mydates], ['date']) new_df.show()

Related

How to create lag feature in pandas in this case?

Pandas: Replace empty column values with the non-empty value based on a condition

Unable to convert a list into a dataframe. Keep getting the error "ValueError: Must pass 2-d input. shape=(1, 4, 5)"

Optimize the unique id with in certain period

Create dataframe conditionally to other dataframe elements

Categories

Resources