I want to do something similar to what pd.combine_first() does, but as a row-wise operation performed on a shared index. And to also add a new column in place of the old ones - while keeping the original_values of shared column names.
In this case the 'ts' column is one that I want to replace with time_now.
time_now = "2022-08-05"
row1 = {'unique_id':5,'ts': '2022-08-02','id2':2,'val':300, 'ffo1':55, 'debt':200}
row2 = {'unique_id':5,'ts': '2022-08-03' ,'id2':2, 'esg':True,'gov_rt':90}
row3 = {'unique_id':5,'ts': '2022-08-04','id2':2, 'rank':5,'buy_or_sell':'sell'}
df = pd.DataFrame([row1,row2,row3])
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-02 2 300.0 55.0 200.0 NaN NaN NaN
1 5 2022-08-03 2 NaN NaN NaN True 90.0 NaN
2 5 2022-08-04 2 NaN NaN NaN NaN NaN 5.0
buy_or_sell
0 NaN
1 NaN
2 sell
My desired output is below, using the new timestamp, but keeping the old ones based on their group index.
rows = [{'unique_id':5, 'ts':time_now ,'id2':2,'val':300, 'ffo1':55, 'debt':200,'esg':True,'gov_rt':90,'rank':5,'buy_or_sell':'sell', 'ts_1':'2022-08-02','ts_2':'2022-08-03', 'ts_3':'2022-08-04'}]
output = pd.DataFrame(rows)
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-05 2 300 55 200 True 90 5
buy_or_sell ts_1 ts_2 ts_3
0 sell 2022-08-02 2022-08-03 2022-08-04
The part below seems to work when run by itself. But I cannot get it to work inside of a function because of differences between index lengths.
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df3['level_1'],df3[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows['ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time
df2
The problem was that I forgot to put the dictionary into a list to create a records oriented dataframe. Additionally when using a similar function, the index might need to be dropped to be reset, as duplicated columns might be created.
I still wonder if there's a better way to do what I want, since it's kind of slow.
def func(df):
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df2['level_1'],df2[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows[f'ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time_now
return df2
#run this
df.groupby('unique_id').apply(func)
Related
I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.
for example I have a DataFrame:
dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
'param1':[100,200,2,3,7,23,43,98,1,0],
'param2':[1,20,65,3,67,2,3,98,654,5]}
df = pd.DataFrame(dta)
I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.
For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.
The associated values for days A and B would be populated with NaN or empty where they do not have entries.
e.g.
dta2 = {'day':['A','B','C'],
'param1_1':[100,7,23],
'param1_2':[200,np.nan,43],
'param1_3':[2,np.nan,98],
'param1_4':[3,np.nan,1],
'param1_5':[np.nan,np.nan,0],
'param2_1':[1,67,2],
'param2_2':[20,np.nan,3],
'param2_3':[65,np.nan,98],
'param2_4':[3,np.nan,654],
'param2_5':[np.nan,np.nan,5]
}
df2 = pd.DataFrame(dta2)
Unfortunately this is a predefined format that I have to maintain.
I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.
You first need to melt, then add a helper columns to cumcount the labels per group and pivot:
df2 = (
df.melt(id_vars='day')
.assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
.pivot(index='day', columns=['variable', 'group'], values='value')
)
df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()
output:
day param1_1 param1_2 param1_3 param1_4 param1_5 param2_1 param2_2 param2_3 param2_4 param2_5
0 A 100.0 200.0 2.0 3.0 NaN 1.0 20.0 65.0 3.0 NaN
1 B 7.0 NaN NaN NaN NaN 67.0 NaN NaN NaN NaN
2 C 23.0 43.0 98.0 1.0 0.0 2.0 3.0 98.0 654.0 5.0
I'm currently trying to add a column together that has two rows to it as such:
Now I just need to add row 1 and 2 together for each column, and I want to append the average underneath the given column for their respective header name. I currently have this:
for x in sub_houseKeeping:
if "phch" in x:
sub_houseKeeping['Average'] = sub_houseKeeping[x].sum()/2
However, this adds together the entire row and appends it to the end of the rows, not the bottom of the column as I wished. How can I fix it to add to the bottom of the column?
This?
data=''' id a b
0 1 34 10
1 2 27 40'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df1 = df.append(df[['a', 'b']].mean(), ignore_index=True)
df1
id a b
0 1.0 34.0 10.0
1 2.0 27.0 40.0
2 NaN 30.5 25.0
Try this:
sub_houseKeeping = pd.DataFrame({'ID':['200650_s_at','1565446_at'], 'phchp003v1':[2174.84972,6.724141107], 'phchp003v2':[444.9008362,4.093883364]})
sub_houseKeeping = sub_houseKeeping.append(pd.DataFrame(sub_houseKeeping.mean(axis=0)).T, ignore_index=True)
Output:
print(sub_houseKeeping)
ID phchp003v1 phchp003v2
0 200650_s_at 2174.849720 444.900836
1 1565446_at 6.724141 4.093883
2 NaN 1090.786931 224.497360
apologies if this has been already asked and replied to, but having searched for one whole day but could not locate the right solution. plz point me towards it, if solution already exists.
I am trying to fill na/nan values in a column in my pandas dataframe(df1). the fill values are located in another dataframe(df2) which contain the unique id's and a corresponding value. How do i match the id of df1.Prod_id (where existing value in df.item_wt is nan) and then find the corresponding value in df2.mean_wt and fill the nan value in df1.item_wt. both the dataframes are of different sizes, df1 being 80k+ rows and df2 is only 1559. the column names are also different as coming from different sources. the fill has to be done in-place.
would appreciate any pandas way, to avoid iterative looping given size of actual dataframe.
i have tried to use combine_first and map with zero success as the dataframe sizes are different, so extra rows gets no replacement.
data1 = {'Prod_id':['PR1', 'PR2', 'PR3', 'PR4', 'PR2', 'PR3','PR1', 'PR4"],store=['store1','store2','store3','store6','store3','store8','store45','store23']'item_wt':[28,nan,29,42,nan,34,87,nan]}
df1 = pd.DataFrame(data1)
data2 = {'Item_name':['PR1', 'PR2', 'PR3', 'PR4'],'mean_wt':[18,12,22,9]}
df2 = pd.DataFrame(data2)
final df should be like:
data1 = {'Prod_id':['PR1', 'PR2', 'PR3', 'PR4', 'PR2', 'PR3','PR1', 'PR4"],store=['store1','store2','store3','store6','store3','store8','store45','store23']'Item_wt':[28,12,29,42,12,34,87,9]}
df1 = pd.DataFrame(data1)
You can use fillna and set numpy array created by values because different indices original and new Series:
df1['item_wt'] = (df1.set_index('Prod_id')['item_wt']
.fillna(df2.set_index('Item_name')['mean_wt']).values)
print (df1)
Prod_id store item_wt
0 PR1 store1 28.0
1 PR2 store2 12.0
2 PR3 store3 29.0
3 PR4 store6 42.0
4 PR2 store3 12.0
5 PR3 store8 34.0
6 PR1 store45 87.0
7 PR4 store23 9.0
Or use map first:
s = df2.set_index('Item_name')['mean_wt']
df1['item_wt'] = df1['item_wt'].fillna(df1['Prod_id'].map(s))
#alternative
#df1['item_wt'] = df1['item_wt'].combine_first(df1['Prod_id'].map(s))
print (df1)
Prod_id store item_wt
0 PR1 store1 28.0
1 PR2 store2 12.0
2 PR3 store3 29.0
3 PR4 store6 42.0
4 PR2 store3 12.0
5 PR3 store8 34.0
6 PR1 store45 87.0
7 PR4 store23 9.0
I have a dataframe which can be simplified like this:
df = pd.DataFrame(index = ['01/11/2017', '01/11/2017', '01/11/2017', '02/11/2017', '02/11/2017', '02/11/2017'],
columns = ['Period','_A', '_B', '_C'] )
df.Period = [1, 2, 3, 1, 2, 3]
df
which looks like:
Date Period _A _B _C
01/11/2017 1 NaN NaN NaN
01/11/2017 2 NaN NaN NaN
01/11/2017 3 NaN NaN NaN
02/11/2017 1 NaN NaN NaN
02/11/2017 2 NaN NaN NaN
02/11/2017 3 NaN NaN NaN
And I want to apply my function to each cell
Get_Y(Date, Period, Location)
(where _A, _B, _C, ... are the locations).
Get_Y is a complex function, that looks up data from other dataframes using the Date, Period and Location, and based on criteria gives a value for Y (a float between 0 and 1).
I have managed to make this work with iterrows:
for index, row in PeriodDF.iterrows():
date = index
Period = row.loc[row.index[0]]
LocationList = row.index[1:]
print(date, Period)
for Location in LocationList :
PeriodDF.loc[(PeriodDF.index == date)&(PeriodDF.Period == Period), Location] = Get_Y(date, Period, Location)
But this takes over 1 hour.
There must be a way to do this faster in pandas.
I have tried creating 3 dataframes, one an array of the Period, one an array of the Location, and one of the Date, but not sure how to apply Get_Y elementwise, using the value from each dataframe.
I've problem. I want to create a new dataframe from another one. I want to avoid duplicate rows. It mean if there is same mails, I should concatenate them side-by-side otherwise top and bottom. But the problem is I'm getting value indexing error every time.
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
And here is what I did :
if not self.data.empty:
if data_frame_['Email'][0] in self.data['Email'].get_values():
self.data = pd.concat([self.data, data_frame_], axis=1)
else:
self.data = pd.concat([self.data,data_frame_], axis=0)
else:
self.data = data_frame_.copy()
end = time.time()
data_frame_ has only one row this is why I'm using
data_frame_['Email'][0]
Exemple of data (which is in data_frame_ ):
Email Project1 Target1 Projetc2 Target2
-------------------------------------------------------------
kml#mail.com 1 5000 NaN NaN
abc#abc.com 7 5000 NaN NaN
kml#mail.com 7 4000 NaN NaN
What I desire is :
Email Project1 Target1 Projetc2 Target2
-------------------------------------------------------------
kml#mail.com 1 5000 7 4000
abc#abc.com 7 5000 NaN NaN
Ps : I could do it using dicts but to protect code integrity, I'd like to use dataframes.
Thank you in advance.
You can use pivot_table, but first create groups by cumcount:
#rename columns
df.rename(columns={'Project1':'Project','Target1':'Target'}, inplace=True)
print (df)
Email Project Target
0 kml#mail.com 1 5000
1 abc#abc.com 7 5000
2 kml#mail.com 7 4000
df['g'] = (df.groupby('Email').cumcount() + 1).astype(str)
df1 = df.pivot_table(index='Email', columns='g', values=['Project', 'Target'])
#Sort multiindex in columns
df1 = df1.sort_index(axis=1, level=1)
#'reset' multiindex in columns
df1.columns = [''.join(col) for col in df1.columns]
print (df1)
Project1 Target1 Project2 Target2
Email
abc#abc.com 7.0 5000.0 NaN NaN
kml#mail.com 1.0 5000.0 7.0 4000.0