I have a dataframe which can be simplified like this:
df = pd.DataFrame(index = ['01/11/2017', '01/11/2017', '01/11/2017', '02/11/2017', '02/11/2017', '02/11/2017'],
columns = ['Period','_A', '_B', '_C'] )
df.Period = [1, 2, 3, 1, 2, 3]
df
which looks like:
Date Period _A _B _C
01/11/2017 1 NaN NaN NaN
01/11/2017 2 NaN NaN NaN
01/11/2017 3 NaN NaN NaN
02/11/2017 1 NaN NaN NaN
02/11/2017 2 NaN NaN NaN
02/11/2017 3 NaN NaN NaN
And I want to apply my function to each cell
Get_Y(Date, Period, Location)
(where _A, _B, _C, ... are the locations).
Get_Y is a complex function, that looks up data from other dataframes using the Date, Period and Location, and based on criteria gives a value for Y (a float between 0 and 1).
I have managed to make this work with iterrows:
for index, row in PeriodDF.iterrows():
date = index
Period = row.loc[row.index[0]]
LocationList = row.index[1:]
print(date, Period)
for Location in LocationList :
PeriodDF.loc[(PeriodDF.index == date)&(PeriodDF.Period == Period), Location] = Get_Y(date, Period, Location)
But this takes over 1 hour.
There must be a way to do this faster in pandas.
I have tried creating 3 dataframes, one an array of the Period, one an array of the Location, and one of the Date, but not sure how to apply Get_Y elementwise, using the value from each dataframe.
Related
I want to create a simple matrix where I have as index the name of a software requirement and as column all the software test cases in the project.
Where one SWRS is covered by one SWTS, I need to place "something" (for example a cross).
In my code draft, I create an empty dataframe and then iterate to place the cross:
import pandas as pd
struct = {
"swrslist":["swrs1","swrs2","swrs3","swrs4"],
"swtslist":["swts1","swts2","swts3","swts4","swts5","swts6"],
"mapping":
{
"swrs1": ["swts1", "swts3", "swts4"],
"swrs2": ["swts2", "swts3", "swts5"],
"swrs4": ["swts1", "swts3", "swts5"]
}
}
if __name__ == "__main__":
df = pd.DataFrame( index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
print(df)
for key in struct["mapping"].keys():
for elem in struct["mapping"][key]:
print(key, elem)
df.at[key,elem] = "x"
print(df)
df.to_excel("mapping.xlsx")
the output is the following
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 x NaN x x NaN NaN
swrs2 NaN x x NaN x NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 x NaN x NaN x NaN
I know that create an empty dataframe and then iterate is not efficient.
I tried to create the dataframe as following
df = pd.DataFrame(struct["mapping"], index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
but it creates an empty dataframe:
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 NaN NaN NaN NaN NaN NaN
swrs2 NaN NaN NaN NaN NaN NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 NaN NaN NaN NaN NaN NaN
Furthermore, in future I plan to provide different values if a SWTS is a pass, fail or not executed.
How can I create the dataframe efficently, rather that iterate on the "mapping" entries?
Though I used for loop too, how about this?
df = pd.DataFrame(index=pd.Index(pd.Series(struct["swrslist"])), columns=pd.Index(struct["swtslist"]))
for key, value in struct["mapping"].items():
df.loc[key, value] = "x"
I want to do something similar to what pd.combine_first() does, but as a row-wise operation performed on a shared index. And to also add a new column in place of the old ones - while keeping the original_values of shared column names.
In this case the 'ts' column is one that I want to replace with time_now.
time_now = "2022-08-05"
row1 = {'unique_id':5,'ts': '2022-08-02','id2':2,'val':300, 'ffo1':55, 'debt':200}
row2 = {'unique_id':5,'ts': '2022-08-03' ,'id2':2, 'esg':True,'gov_rt':90}
row3 = {'unique_id':5,'ts': '2022-08-04','id2':2, 'rank':5,'buy_or_sell':'sell'}
df = pd.DataFrame([row1,row2,row3])
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-02 2 300.0 55.0 200.0 NaN NaN NaN
1 5 2022-08-03 2 NaN NaN NaN True 90.0 NaN
2 5 2022-08-04 2 NaN NaN NaN NaN NaN 5.0
buy_or_sell
0 NaN
1 NaN
2 sell
My desired output is below, using the new timestamp, but keeping the old ones based on their group index.
rows = [{'unique_id':5, 'ts':time_now ,'id2':2,'val':300, 'ffo1':55, 'debt':200,'esg':True,'gov_rt':90,'rank':5,'buy_or_sell':'sell', 'ts_1':'2022-08-02','ts_2':'2022-08-03', 'ts_3':'2022-08-04'}]
output = pd.DataFrame(rows)
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-05 2 300 55 200 True 90 5
buy_or_sell ts_1 ts_2 ts_3
0 sell 2022-08-02 2022-08-03 2022-08-04
The part below seems to work when run by itself. But I cannot get it to work inside of a function because of differences between index lengths.
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df3['level_1'],df3[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows['ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time
df2
The problem was that I forgot to put the dictionary into a list to create a records oriented dataframe. Additionally when using a similar function, the index might need to be dropped to be reset, as duplicated columns might be created.
I still wonder if there's a better way to do what I want, since it's kind of slow.
def func(df):
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df2['level_1'],df2[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows[f'ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time_now
return df2
#run this
df.groupby('unique_id').apply(func)
I have two dataframes, the main dataframe has two columns for Lat and Long some of which have values and some of which are NaN. I have another dataframe that is a subset of this main dataframe with Lat and Long filled in with values. I'd like to fill in the main DataFrame with these values based on matching ID.
Main DataFrame:
ID Lat Long
0 9547507704 33.853682 -80.369867
1 9777677704 32.942332 -80.066165
2 5791407702 47.636067 -122.302559
3 6223567700 34.224719 -117.372550
4 9662437702 42.521828 -82.913680
... ... ... ...
968552 4395967002 NaN NaN
968553 6985647108 NaN NaN
968554 7996438405 NaN NaN
968555 9054647103 NaN NaN
968556 9184687004 NaN NaN
DataFrame to fill:
ID Lat Long
0 2392497107 36.824257 -76.272486
1 2649457102 37.633918 -77.507746
2 2952437110 37.511077 -77.528711
3 3379937304 39.119430 -77.569008
4 3773127208 36.909731 -76.070420
... ... ... ...
23263 9512327001 37.371059 -79.194838
23264 9677417002 38.406665 -78.913133
23265 9715167306 38.761194 -77.454184
23266 9767568404 37.022287 -76.319882
23267 9872047407 38.823017 -77.057818
The two dataframes are of different lengths.
EDIT for clarification: I need to replace the NaN in the Lat & Long columns of the main DataFrame with the Lat & Long from the subset if ID matches in both DataFrames. My DataFrames are both >60 columns, I am only trying to replace the NaN for those two columns.
Edit:
I went with this mapping solution although it isn't exactly what I'm looking for, I know there is a much more simple solution.
#mapping coordinates to NaN values in main
m = dict(zip(fill_df.ID,fill_df.Lat))
main_df.Lat = main_df.Lat.fillna(main_df.ID.map(m))
n = dict(zip(fill_df.ID,fill_df.Long))
main_df.Long = main_df.Long.fillna(main_df.ID.map(n))
new_df = pd.merge(main_df, sub_df, how='left', on='ID')
I guess the left join will do the job.
One approach is to use DataFrame.combine_first. This method aligns DataFrames on index and columns, so you need to set ID as the index of each DataFrame, call df_main.combine_first(df_filler), then reset ID back into a column. (Seems awkward; there's probably a more elegant approach.)
Assuming your main DataFrame is named df_main and your DataFrame to fill is named df_filler:
df_main.set_index('ID').combine_first(df_filler.set_index('ID')).reset_index()
This should do the trick:
import math
A = pd.DataFrame({'ID' : [1, 2, 3], 'Lat':[4, 5, 6], 'Long': [7, 8, float('nan')]})
B = pd.DataFrame({'ID' : [2, 3], 'Lat':[5, 6], 'Long': [8, 9]})
print('Old table:')
print(A)
print('Fix table:')
print(B)
for i in A.index.to_list():
for j in B.index.to_list():
if not A['ID'][i] == B['ID'][j]:
continue
if math.isnan(A['Lat'][i]):
A.at[i, 'Lat'] = B['Lat'][j]
if math.isnan(A['Long'][i]):
A.at[i, 'Long'] = B['Long'][j]
print('New table:')
print(A)
Returns:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 NaN
Fix table:
ID Lat Long
0 2 5 8
1 3 6 9
New table:
ID Lat Long
0 1 4 7.0
1 2 5 8.0
2 3 6 9.0
Not very elegant but gets the job done :)
I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.
IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I donĀ“t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN
I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?
Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")