Pandas Map creating NaNs - python
My intention is to replace labels. I found out about using a dictionary and map it to the dataframe. To that end, I first extracted the necessary fields and created a dictionary which I then fed to the map function.
My programme is as follows:
factor_name = 'Help in household'
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
fact_df = labels.loc[labels['Column'] == factor_name]
fact_dict = dict(zip(fact_df['Level'], fact_df['Rename']))
print df.index.to_series().map(fact_dict)
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
State,AN,AN,Andaman & Nicobar
State,AP,AP,Andhra Pradesh
State,AR,AR,Arunachal Pradesh
State,BR,BR,Bihar
State,CG,CG,Chattisgarh
State,CH,CH,Chandigarh
State,DD,DD,Daman & Diu
State,DL,DL,Delhi
State,DN,DN,Dadra & Nagar Haveli
State,GA,GA,Goa
State,GJ,GJ,Gujarat
State,HP,HP,Himachal Pradesh
State,HR,HR,Haryana
State,JH,JH,Jharkhand
State,JK,JK,Jammu & Kashmir
State,KA,KA,Karnataka
State,KL,KL,Kerala
State,MG,MG,Meghalaya
State,MH,MH,Maharashtra
State,MN,MN,Manipur
State,MP,MP,Madhya Pradesh
State,MZ,MZ,Mizoram
State,NG,NG,Nagaland
State,OR,OR,Orissa
State,PB,PB,Punjab
State,PY,PY,Pondicherry
State,RJ,RJ,Rajasthan
State,SK,SK,Sikkim
State,TN,TN,Tamil Nadu
State,TR,TR,Tripura
State,UK,UK,Uttarakhand
State,UP,UP,Uttar Pradesh
State,WB,WB,West Bengal
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
Intended result is as follows:
4 Every day
1 Never
2 Once a month
3 Once a week
The mapping fails. The result always causes NaNs to appear which I do not want. Can anyone tell me why?
Try this:
In [140]: df['Help in household'] \
.astype(str) \
.map(labels.loc[labels['Column']=='Help in household',['Level','Rename']]
.set_index('Level')['Rename'])
Out[140]:
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
Name: Help in household, dtype: object
You may also consider using merge:
In [147]: df.assign(Level=df['Help in household'].astype(str)) \
.merge(labels.loc[labels['Column']=='Help in household',['Level','Rename']],
on='Level')
Out[147]:
Id Help in household Maths Reading Science Social Level Rename
0 11011001001 4 20.37 NaN 27.78 NaN 4 Every day
1 11011001003 4 27.78 70.00 NaN NaN 4 Every day
2 11011001004 4 NaN 56.67 NaN 36.00 4 Every day
3 11011001006 4 NaN 23.33 NaN 30.00 4 Every day
4 11011001007 4 40.74 70.00 NaN NaN 4 Every day
5 11011001002 3 12.96 NaN 38.18 NaN 3 Once a week
6 11011001008 3 NaN 26.67 NaN 22.92 3 Once a week
7 11011001005 1 NaN NaN 14.55 8.33 1 Never
Related
Python: Comparing rows values in a time period conditional
This is a sample of a pandas dataframe that I'm working on. ID DATE HOUR TYPE CODE CITY 0 222304678 27/09/22 15:19:00 50201 3 Manila 1 222304694 18/09/22 10:46:00 30202 2 Innsbruck 2 222081537 18/09/22 10:47:00 30202 1 Innsbruck 3 221848197 17/09/22 21:54:00 30202 2 Austin 4 221455590 13/09/22 4:50:00 30409 2 Panama 5 220540157 06/09/22 12:29:00 30603 3 Sydney 6 220367113 06/09/22 12:32:00 30202 2 Sydney 7 221380583 06/09/22 12:56:00 30204 4 Sydney 8 221381826 06/09/22 12:58:00 30202 1 Sydney 9 221365584 22/08/22 12:35:00 50202 1 Tokyo When a row is Code = 1. I need a comparison to be made of the rows that occurred 30 minutes before, with the following conditions: The same city The same date Codes other than 1 And need to create another dataframe with the rows that met the condition (or at least just highlight them) I have tried with df.loc but I dont know how to make the range in time
Merge/Concat 2 dataframe with different holiday dates
I would like to merge/concat "outer" for 2 different dataframes with different set of holiday dates. Date column is string. Both dataframe prices exclude non-pricing days e.g. public holiday and weekends Assuming Dataframe 1 follows US holiday: df1_US_holiday Date Price_A 5/6/2020 2 5/5/2020 3 5/4/2020 4 5/1/2020 5 4/30/2020 6 4/29/2020 1 4/28/2020 3 4/27/2020 1 Assuming Dataframe 2 follows China holiday (note: 1-5 May is China holiday): df2_China_holiday Date Price_B 5/6/2020 4 4/30/2020 3 4/29/2020 2 4/28/2020 2 4/27/2020 5 Expected merge/concat results: Date Price_A Price_B 5/6/2020 2 4 5/5/2020 3 NaN 5/4/2020 4 NaN 5/1/2020 5 NaN 4/30/2020 6 3 4/29/2020 1 2 4/28/2020 3 2 4/27/2020 1 5 Ultimately, Would like fill the NaN for fillna(method='bfill'). Should I include any holiday library pack for this merge/concat action?
Pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations. Please take a look at these documents that may be useful for what you want to achieve
pandas groupby by customized year, e.g. a school year
In a pandas data frame I would like to find the mean values of a column, grouped by a 'customized' year. An example would be to compute the mean values of school marks for a school year (e.g. Sep/YYYY to Aug/YYYY+1). The pandas docs gives some information on offsets and business year etc., but I can't really make any sense out of that to get a working example. Here is a minimal example where mean values of school marks are computed per year (Jan-Dec), which is what I do not want. import pandas as pd import numpy as np df = pd.DataFrame(data=np.random.randint(low=1, high=5, size=36), index=pd.date_range('2001-09-01', freq='M', periods=36), columns=['marks']) df_yearly = df.groupby(pd.Grouper(freq="A")).mean() This could yield e.g.: print(df): marks 2001-09-30 1 2001-10-31 4 2001-11-30 2 2001-12-31 1 2002-01-31 4 2002-02-28 1 2002-03-31 2 2002-04-30 1 2002-05-31 3 2002-06-30 3 2002-07-31 3 2002-08-31 3 2002-09-30 4 2002-10-31 1 ... 2003-11-30 4 2003-12-31 2 2004-01-31 1 2004-02-29 2 2004-03-31 1 2004-04-30 3 2004-05-31 4 2004-06-30 2 2004-07-31 2 2004-08-31 4 print(df_yearly): marks 2001-12-31 2.000000 2002-12-31 2.583333 2003-12-31 2.666667 2004-12-31 2.375000 My desired output would correspond to something like: 2001-09/2002-08 mean_value 2002-09/2003-08 mean_value 2003-09/2004-08 mean_value Many thanks!
We can manually compute the school years: # if month>=9 we move it to the next year school_years = df.index.year + (df.index.month>8).astype(int) Another option is to use fiscal year starting from September: school_years = df.index.to_period('Q-AUG').qyear And we can groupby: df.groupby(school_years).mean() Output: marks 2002 2.333333 2003 2.500000 2004 2.500000
One more approach a = (df.index.month == 9).cumsum() val = df.groupby(a, sort=False)['marks'].mean().reset_index() dates = df.index.to_series().groupby(a, sort=False).agg(['first', 'last']).reset_index() dates.merge(val, on='index') Output index first last marks 0 1 2001-09-30 2002-08-31 2.750000 1 2 2002-09-30 2003-08-31 2.333333 2 3 2003-09-30 2004-08-31 2.083333
Row-wise average for a subset of columns with missing values
I've got a 'DataFrame` which has occasional missing values, and looks something like this: Monday Tuesday Wednesday ================================================ Mike 42 NaN 12 Jenna NaN NaN 15 Jon 21 4 1 I'd like to add a new column to my data frame where I'd calculate the average across all columns for every row. Meaning, for Mike, I'd need (df['Monday'] + df['Wednesday'])/2, but for Jenna, I'd simply use df['Wednesday amt.']/1 Does anyone know the best way to account for this variation that results from missing values and calculate the average?
You can simply: df['avg'] = df.mean(axis=1) Monday Tuesday Wednesday avg Mike 42 NaN 12 27.000000 Jenna NaN NaN 15 15.000000 Jon 21 4 1 8.666667 because .mean() ignores missing values by default: see docs. To select a subset, you can: df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1) Monday Tuesday Wednesday avg Mike 42 NaN 12 42.0 Jenna NaN NaN 15 NaN Jon 21 4 1 12.5
Alternative - using iloc (can also use loc here): df['avg'] = df.iloc[:,0:2].mean(axis=1)
Resurrecting this Question because all previous answers currently print a Warning. In most cases, use assign(): df = df.assign(avg=df.mean(axis=1)) For specific columns, one can input them by name: df = df.assign(avg=df.loc[:, ["Monday", "Tuesday", "Wednesday"]].mean(axis=1)) Or by index, using one more than the last desired index as it is not inclusive: df = df.assign(avg=df.iloc[:,0:3]].mean(axis=1))
Pandas Reindex - Fill Column with Missing Values
I tried several examples of this topic but with no results. I'm reading a DataFrame like: Code,Counts 10006,5 10011,2 10012,26 10013,20 10014,17 10015,2 10018,2 10019,3 How can I get another DataFrame like: Code,Counts 10006,5 10007,NaN 10008,NaN ... 10011,2 10012,26 10013,20 10014,17 10015,2 10016,NaN 10017,NaN 10018,2 10019,3 Basically filling the missing values of the 'Code' Column? I tried the df.reindex() method but I can't figure out how it works. Thanks a lot.
I'd set the index to you 'Code' column, then reindex by passing in a new array based on your current index, arange accepts a start and stop param (you need to add 1 to the end) and then reset_index this assumes that your 'Code' values are already sorted: In [21]: df.set_index('Code', inplace=True) df = df.reindex(index = np.arange(df.index[0], df.index[-1] + 1)).reset_index() df Out[21]: Code Counts 0 10006 5 1 10007 NaN 2 10008 NaN 3 10009 NaN 4 10010 NaN 5 10011 2 6 10012 26 7 10013 20 8 10014 17 9 10015 2 10 10016 NaN 11 10017 NaN 12 10018 2 13 10019 3