column not found while renaming in panda dataframe - python

I have this panda dataframe
timestamp EG2021 EGH2021
2021-01-04 33 Nan
2021-02-04 45 65
And I Am trying to replace the columnm name with new name as mapped in an excel file like this
OldId NewId
EG2021 LER_EG2021
EGH2021 LER_EGH2021
I tried below code but its not working I get the error as
KeyError: "None of [Index(['LER_EG2021',LER_EGH2021'],\n
dtype='object', length=186)] are in the [columns]
Code:
df = pd.ExcelFile('ids.xlsx').parse('Sheet1')
x=[]
x.append(df['external_ids'].to_list())
dtest_df = (my panda dataframe as mentioned above)
mapper = df.set_index(df['oldId'])[df['NewId']]
dtest_df.columns = dtest_df.columns.Series.replace(mapper)
Any idea what wrong am I doing??

You need:
mapper = df.set_index('oldId')['NewId']
dtest_df.columns = dtest_df.columns.map(mapper.to_dict())
Or:
dtest_df = dtest_df.rename(columns=df.set_index('oldId')['NewId'].to_dict())
dtest_df output:
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 NaN
1 2021-02-04 45 65

Another way, dict the zip of the df with the Old and New ids
dtest_df.rename(columns=dict(zip(df['OldId'], df['NewId'])), inplace=True)
timestamp LER_EG2021 LER_EGH2021
0 2021-01-04 33 Nan
1 2021-02-04 45 65

Related

Python Dataframe NaN rows slicing, filling and rejoining

I have a big dataframe. Some of the values in a column are NaN. I want to fill them with some value based on the other column value.
Data:
df =
A B
2019-10-01 09:19:40 667.029710 10
2019-10-01 09:20:15 673.518030 20
2019-10-01 09:21:29 533.137144 30
2020-07-25 15:51:15 NaN 40
2020-07-25 17:20:20 NaN 50
2020-07-25 17:21:23 NaN 60
I want to fill NaN in A column based on the B column value.
My code:
sdf = df[df['A'].isnull()] # slice NaN and create a new dataframe
sdf['A'] = sdf['B']*sdf['B']
df = pd.concat([df,sdf])
Everything works fine. I feel my code is lengthy. Is there a one line code?
For fillna we can do
df.A.fillna(df.B**2, inplace=True)

Summing all values ​from one day in a time series in pandas pivot

how to calculate the sum of all values ​​from one day in a time series in pd pivot?
My pandas pivot looks like this:
Date 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
Hospital_name
Hospital1 12 15 16 .... 12
Hospital2 10 17 14 .... 12
Hospital3 15 20 12 .... 12
and I want to pivot such like this:
Date 2019-10-01 2019-10-02 2019-10-03 .... 2019-12-01
Sum 37 52 42 .... 36
My data type is:
type(df.columns[0])
out: str
type(df.columns[1])
out: pandas._libs.tslibs.timestamps.Timestamp
Thanks for your help!
sum is your friend here, as stated in the comments. Using dummy df:
2019-10-01 2019-10-02 2019-10-03
Hospital_name John Maya Robin
h1 12 12 42
h2 15 55 13
h3 14 42 22
You simply ignore the first row and use sum:
df.loc[df.index!='Hospital_name'].sum()
2019-10-01 41.0
2019-10-02 109.0
2019-10-03 77.0
dtype: float64
EDIT: It looks like you have multiindex columns. You can drop this using:
df.columns = df.columns.droplevel()
(taken from this answer)
new_df = df.transpose()
new_df["Total"] = df[0:].sum()
df = new_df.transpose()
new_df is assigned as df but a transposed version
new_df["Total"] = df[0:].sum() adds the Total columns
df = new_df.transpose() brings back the table as it was in the first place
For a better experience you can always try each line in a jupyter notebook or lab to see what happens.
And please if you are satisfied with the answer, mark it as accepted
Thank you

Traspose groupby sublevels into columns pandas/python

gp1.groupby(by=['ID', 'CD'])['BALANCE_AM'].sum()
ID CD
4332 5 0.0
58 0.0
123 22656.0
756423 47 645087.0
123 227655.0
I want to create columns for each type of CD, what is the sum of BALANCE_AM
Desired Output
ID 5 58 123 47
4332 0 0 22656.0 NaN
756423 NaN NaN 227655.0 645087.0
Add Series.unstack and DataFrame.reset_index if necessary ID in column:
df = gp1.groupby(by=['ID', 'CD'])['BALANCE_AM'].sum().unstack().reset_index()
Another way is to use pivot_table instead of groupby:
gp1.pivot_table(values='BALANCE_AM', index='ID', columns='CD', aggfunc='sum')

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

How to append multiple pandas DataFrames in a loop?

I've been banging my head on this python problem for a while and am stuck. I am for-looping through several csv files and want one data frame that appends the csv files in a way that one column from each csv file is a column name and sets a common index of a date_time.
There are 11 csv files that look like this data frame except for different value and pod number, but the time_stamp is the same for all the csvs.
data
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
And this is the for-loop that I have so far:
import glob
import pandas as pd
filenames = sorted(glob.glob('*.csv'))
new = []
for f in filenames:
data = pd.read_csv(f)
time_stamp = [pd.to_datetime(d) for d in time_stamp]
new.append(data)
my_df = pd.DataFrame(new, columns=['pod','time_stamp','value'])
What I want is a data frame that looks like this where each column is the result of value from each of the csv files.
time_stamp 97 98 99 ...
2016-02-22 3.04800 4.20002 3.5500
2016-02-29. 23.62201 24.7392 21.1110
2016-03-07 13.97001 11.0284 12.0000
But right now the output of my_df is very wrong and looks like this. Any ideas of where I went wrong?
0
0 pod time_stamp value 0 22 2016-...
1 pod time_stamp value 0 72 2016-...
2 pod time_stamp value 0 79 2016-0...
3 pod time_stamp value 0 86 2016-...
4 pod time_stamp value 0 87 2016-...
5 pod time_stamp value 0 88 2016-...
6 pod time_stamp value 0 90 2016-0...
7 pod time_stamp value 0 93 2016-0...
8 pod time_stamp value 0 95 2016-...
I'd recommend first concatenating all your dataframes together with pd.concat, and then doing one final pivot operation.
filenames = sorted(glob.glob('*.csv'))
new = [pd.read_csv(f, parse_dates=['time_stamp']) for f in filenames]
df = pd.concat(new) # omit axis argument since it is 0 by default
df = df.pivot(index='time_stamp', columns='pod')
Note that I'm forcing read_csv to parse time_stamp when loading the dataframe, so parsing after loading is no longer required.
MCVE
df
pod time_stamp value
0 97 2016-02-22 3.048000
1 97 2016-02-29 23.622001
2 97 2016-03-07 13.970001
3 97 2016-03-14 6.604000
4 97 2016-03-21 NaN
df.pivot(index='time_stamp', columns='pod')
value
pod 97
time_stamp
2016-02-22 3.048000
2016-02-29 23.622001
2016-03-07 13.970001
2016-03-14 6.604000
2016-03-21 NaN

Categories

Resources