Row-wise average for a subset of columns with missing values - python

I've got a 'DataFrame` which has occasional missing values, and looks something like this:
Monday Tuesday Wednesday
================================================
Mike 42 NaN 12
Jenna NaN NaN 15
Jon 21 4 1
I'd like to add a new column to my data frame where I'd calculate the average across all columns for every row.
Meaning, for Mike, I'd need
(df['Monday'] + df['Wednesday'])/2, but for Jenna, I'd simply use df['Wednesday amt.']/1
Does anyone know the best way to account for this variation that results from missing values and calculate the average?

You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean() ignores missing values by default: see docs.
To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5

Alternative - using iloc (can also use loc here):
df['avg'] = df.iloc[:,0:2].mean(axis=1)

Resurrecting this Question because all previous answers currently print a Warning.
In most cases, use assign():
df = df.assign(avg=df.mean(axis=1))
For specific columns, one can input them by name:
df = df.assign(avg=df.loc[:, ["Monday", "Tuesday", "Wednesday"]].mean(axis=1))
Or by index, using one more than the last desired index as it is not inclusive:
df = df.assign(avg=df.iloc[:,0:3]].mean(axis=1))

Related

Python Pandas totals and dates

Im sorry for not posting the data but it wouldn't really help. The thing is a need to make a graph and I have a csv file full of information organised by date. It has 'Cases' 'Deaths' 'Recoveries' 'Critical' 'Hospitalized' 'States' as categories. It goes in order by date and has the amount of cases,deaths,recoveries per day of each state. How do I sum this categories to make a graph that shows how the total is increasing? I really have no idea how to start so I can't post my data. Below are some numbers that try to explain what I have.
0 2020-02-20 1 Andalucía NaN NaN NaN
1 2020-02-20 2 Aragón NaN NaN NaN
2 2020-02-20 3 Asturias NaN NaN NaN
3 2020-02-20 4 Baleares 1.0 NaN NaN
4 2020-02-20 5 Canarias 1.0 NaN NaN
.. ... ... ... ... ... ...
888 2020-04-06 19 Melilla 92.0 40.0 3.0
889 2020-04-06 14 Murcia 1283.0 500.0 84.0
890 2020-04-06 15 Navarra 3355.0 1488.0 124.0
891 2020-04-06 16 País Vasco 9021.0 4856.0 417.0
892 2020-04-06 17 La Rioja 2846.0 918.0 66.0
It's unclear exactly what you mean by "sum this categories". I'm assuming you mean that for each date, you want to sum the values across all different regions to come up with the total values for Spain?
In which case, you will want to groupby date, then .sum() the columns (you can drop the States category.
grouped_df = df.groupby("date")["Cases", "Deaths", ...].sum()
grouped_df.set_index("date").plot()
This snippet will probably not work directly, you may need to reformat the dates etc. But should be enough to get you started.
I think you are looking for groupby followed by a cumsum not including dates.
columns_to_group = ['Cases', 'Deaths',
'Recoveries', 'Critical', 'Hospitalized', 'date']
new_columns = ['Cases_sum', 'Deaths_sum',
'Recoveries_sum', 'Critical_sum', 'Hospitalized_sum']
df_grouped = df[columns_to_group].groupby('date').sum().reset_index()
For plotting seaborn provides an easy functions:
import seaborn as sns
df_melted = df_grouped.melt(id_vars=["date"])
sns.lineplot(data=df_melted, x='date', y = 'value', hue='variable')

Pandas Map creating NaNs

My intention is to replace labels. I found out about using a dictionary and map it to the dataframe. To that end, I first extracted the necessary fields and created a dictionary which I then fed to the map function.
My programme is as follows:
factor_name = 'Help in household'
df = pd.read_csv('dat.csv')
labels = pd.read_csv('labels.csv')
fact_df = labels.loc[labels['Column'] == factor_name]
fact_dict = dict(zip(fact_df['Level'], fact_df['Rename']))
print df.index.to_series().map(fact_dict)
My labels.csv is as follows:
Column,Name,Level,Rename
Help in household,Every day,4,Every day
Help in household,Never,1,Never
Help in household,Once a month,2,Once a month
Help in household,Once a week,3,Once a week
State,AN,AN,Andaman & Nicobar
State,AP,AP,Andhra Pradesh
State,AR,AR,Arunachal Pradesh
State,BR,BR,Bihar
State,CG,CG,Chattisgarh
State,CH,CH,Chandigarh
State,DD,DD,Daman & Diu
State,DL,DL,Delhi
State,DN,DN,Dadra & Nagar Haveli
State,GA,GA,Goa
State,GJ,GJ,Gujarat
State,HP,HP,Himachal Pradesh
State,HR,HR,Haryana
State,JH,JH,Jharkhand
State,JK,JK,Jammu & Kashmir
State,KA,KA,Karnataka
State,KL,KL,Kerala
State,MG,MG,Meghalaya
State,MH,MH,Maharashtra
State,MN,MN,Manipur
State,MP,MP,Madhya Pradesh
State,MZ,MZ,Mizoram
State,NG,NG,Nagaland
State,OR,OR,Orissa
State,PB,PB,Punjab
State,PY,PY,Pondicherry
State,RJ,RJ,Rajasthan
State,SK,SK,Sikkim
State,TN,TN,Tamil Nadu
State,TR,TR,Tripura
State,UK,UK,Uttarakhand
State,UP,UP,Uttar Pradesh
State,WB,WB,West Bengal
My dat.csv is as follows:
Id,Help in household,Maths,Reading,Science,Social
11011001001,4,20.37,,27.78,
11011001002,3,12.96,,38.18,
11011001003,4,27.78,70,,
11011001004,4,,56.67,,36
11011001005,1,,,14.55,8.33
11011001006,4,,23.33,,30
11011001007,4,40.74,70,,
11011001008,3,,26.67,,22.92
Intended result is as follows:
4 Every day
1 Never
2 Once a month
3 Once a week
The mapping fails. The result always causes NaNs to appear which I do not want. Can anyone tell me why?
Try this:
In [140]: df['Help in household'] \
.astype(str) \
.map(labels.loc[labels['Column']=='Help in household',['Level','Rename']]
.set_index('Level')['Rename'])
Out[140]:
0 Every day
1 Once a week
2 Every day
3 Every day
4 Never
5 Every day
6 Every day
7 Once a week
Name: Help in household, dtype: object
You may also consider using merge:
In [147]: df.assign(Level=df['Help in household'].astype(str)) \
.merge(labels.loc[labels['Column']=='Help in household',['Level','Rename']],
on='Level')
Out[147]:
Id Help in household Maths Reading Science Social Level Rename
0 11011001001 4 20.37 NaN 27.78 NaN 4 Every day
1 11011001003 4 27.78 70.00 NaN NaN 4 Every day
2 11011001004 4 NaN 56.67 NaN 36.00 4 Every day
3 11011001006 4 NaN 23.33 NaN 30.00 4 Every day
4 11011001007 4 40.74 70.00 NaN NaN 4 Every day
5 11011001002 3 12.96 NaN 38.18 NaN 3 Once a week
6 11011001008 3 NaN 26.67 NaN 22.92 3 Once a week
7 11011001005 1 NaN NaN 14.55 8.33 1 Never

Python pandas dataframe select rows from columns

In an Excel sheet with columns Rainfall / Year / Month, I want to sum rainfall data per year. That is, for instance, for the year 2000, from month 1 to 12, summing all the Rainfall cells into a new one.
I tried using pandas in Python but cannot manage (just started coding). How can I proceed? Any help is welcome, thanks!
Here the head of the data (which has been downloaded):
rainfall (mm) \tyear month country iso3 iso2
0 120.54000 1990 1 ECU NaN NaN
1 231.15652 1990 2 ECU NaN NaN
2 136.62088 1990 3 ECU NaN NaN
3 203.47653 1990 4 ECU NaN NaN
4 164.20956 1990 5 ECU NaN NaN
Use groupby and aggregate sum if need sum of all years:
df = df.groupby('\tyear')['rainfall (mm)'].sum()
But if need only one value:
df.loc[df['\tyear'] == 2000, 'rainfall (mm)'].sum()
If you just want the year 2000, use
df[df['\tyear'] == 2000]['rainfall (mm)'].sum()
Otherwise, jezrael's answer is nice because it sums rainfall (mm) for each distinct value of \tyear.

Pandas changing cell values based on another cell

I am currently formatting data from two different data sets.
One of the dataset reflects an observation count of people in room on hour basis, the second one is a count of people based on wifi logs generated in 5 minutes interval.
After merging these two dataframes into one, I run into the issue where each hour (as "10:00:00") has the data from the original set, but the other data (every 5min like "10:47:14") does not include this data.
Here is how the merge dataframe looks:
room time con auth capacity % Count module size
0 B002 Mon Nov 02 10:32:06 23 23 90 NaN NaN NaN NaN`
1 B002 Mon Nov 02 10:37:10 25 25 90 NaN NaN NaN NaN`
12527 B002 Mon Nov 02 10:00:00 NaN NaN 90 50% 45.0 COMP30520 60`
12528 B002 Mon Nov 02 11:00:00 NaN NaN 90 0% 0.0 COMP30520 60`
Is there a way for me to go through the dataframe and find all the information regarding the "occupancy", "occupancyCount", "module" and "size" from 11:00:00 and write it to all the cells that are of the same day and where the hour is between 10:00:00 and 10:59:59?
That would allow me to have all the information on each row and then allow me to gather the min(), max() and median() based on 'day' and 'hour'.
To answer the comment for the original dataframes, here there are:
first dataframe:
time room module size
0 Mon Nov 02 09:00:00 B002 COMP30190 29
1 Mon Nov 02 10:00:00 B002 COMP40660 53
second dataframe:
room time con auth capacity % Count
0 B002 Mon Nov 02 20:32:06 0 0 NaN NaN NaN
1 B002 Mon Nov 02 20:37:10 0 0 NaN NaN NaN
2 B002 Mon Nov 02 20:42:12 0 0 NaN NaN NaN
12797 B008 Wed Nov 11 13:00:00 NaN NaN 40 25 10.0
12798 B008 Wed Nov 11 14:00:00 NaN NaN 40 50 20.0
12799 B008 Wed Nov 11 15:00:00 NaN NaN 40 25 10.0
this is how these two dataframes were merged together:
DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Any help with this would be greatly appreciated.
Thanks a lot,
-Romain
Somewhere to start:
b = df[(df['time'] > X) & (df['time'] < Y)]
selects all the elements within times X and Y
And then
df.loc[df['column_name'].isin(b)]
Gives you the rows you want (ie - between X and Y) and you can just assign as you see fit.
I think you'll want to assign the values of the selected rows to those of row number X?
Hope that helps.
Note that these function are cut and paste jobs from
[1] Filter dataframe rows if value in column is in a set list of values
[2] Select rows from a DataFrame based on values in a column in pandas
If I understood it correctly, you want to fill all the missing values in your merged dataframe with the corresponding closest data point available in the given hour. I did something similar in essence in the past using a variate of pandas.cut for timeseries but I can't seem to find it, it wasn't really nice anyways.
While I'm not entirely sure, fillna method of the pandas dataframe might be what you want (docs here).
Let your two dataframes be named df_hour and df_cinq, you merged them like this:
df = pd.merge(df_hour, df_cinq, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Then you change your index to time and sort it:
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
The fillna method has an option called 'method' that can have these values (2):
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
Using it to do forward filling (i.e. missing values are filled with the preceding value in the frame):
df.fillna(method='ffill', inplace=True)
The problem with this on your data is that all of the missing data in the non-working hours belonging to the 5-minute observations will be filled with outdated data points. You can use the limit option to limit the amount of consecutive data points to be filled but I don't know if it's useful to you.
Here's a complete script I wrote as a toy example:
import pandas as pd
import random
hourly_count = 8 #workhours
cinq_count = 24 * 12 # 1day
hour_rng = pd.date_range('1/1/2016-09:00:00', periods = hourly_count, freq='H')
cinq_rng = pd.date_range('1/1/2016-00:02:53', periods = cinq_count,
freq='5min')
roomz = 'room0 room1 secretroom'.split()
hourlydata = {'col1': [], 'col2': [], 'room': []}
for i in range(hourly_count):
hourlydata['room'].append(random.choice(roomz))
hourlydata['col1'].append(random.random())
hourlydata['col2'].append(random.randint(0,100))
cinqdata = {'col3': [], 'col4': [], 'room': []}
frts = 'apples oranges peaches grapefruits whatmore'.split()
vgtbls = 'onion1 onion2 onion3 onion4 onion5 onion0'.split()
for i in range(cinq_count):
cinqdata['room'].append(random.choice(roomz))
cinqdata['col3'].append(random.choice(frts))
cinqdata['col4'].append(random.choice(vgtbls))
hourlydf = pd.DataFrame(hourlydata)
hourlydf['time'] = hour_rng
cinqdf = pd.DataFrame(cinqdata)
cinqdf['time'] = cinq_rng
df = pd.merge(hourlydf, cinqdf, left_on=['room','time'], right_on=['room',
'time'], how='outer', left_index=False, right_index=False)
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
df.fillna(method='ffill', inplace=True)
print(df['2016-1-1 09:00:00':'2016-1-1 17:00:00'])
Actually I was able to fix this by:
First: using partition on "time" feature in order to generate two additional columns, one for the day showed in "time" and one for the hour in the "time" column.
I used the lambda functions to get these columns:
df['date'] = df['date'].map(lambda x: x[10:-6])
df['time'] = df['time'].map(lambda x: x[8:-8])
Based on these two new columns I modified the way the dataframes were being merged.
here is the code I used to fix it:
dataframeFinal = pd.merge(dataframe1, dataframe2, left_on=["room", "date", "hour"],
right_on=["room", "date", "hour"], how="outer",
left_index=False, right_index=False, copy=False)
After this merge I ended up having duplicate time columns ('time_y' and "time_x').
So I replaced the NaN values as follows:
dataframeFinal.time_y.fillna(dataframeFinal.time_x, inplace=True)
Now the column "time_y" contains all the time values, no more NaN.
I do not need the "time_x" column so I drop it from the dataframe
dataframeFinal = dataframeFinal.drop('time_x', axis=1)

Pandas Reindex - Fill Column with Missing Values

I tried several examples of this topic but with no results. I'm reading a DataFrame like:
Code,Counts
10006,5
10011,2
10012,26
10013,20
10014,17
10015,2
10018,2
10019,3
How can I get another DataFrame like:
Code,Counts
10006,5
10007,NaN
10008,NaN
...
10011,2
10012,26
10013,20
10014,17
10015,2
10016,NaN
10017,NaN
10018,2
10019,3
Basically filling the missing values of the 'Code' Column? I tried the df.reindex() method but I can't figure out how it works. Thanks a lot.
I'd set the index to you 'Code' column, then reindex by passing in a new array based on your current index, arange accepts a start and stop param (you need to add 1 to the end) and then reset_index this assumes that your 'Code' values are already sorted:
In [21]:
df.set_index('Code', inplace=True)
df = df.reindex(index = np.arange(df.index[0], df.index[-1] + 1)).reset_index()
df
Out[21]:
Code Counts
0 10006 5
1 10007 NaN
2 10008 NaN
3 10009 NaN
4 10010 NaN
5 10011 2
6 10012 26
7 10013 20
8 10014 17
9 10015 2
10 10016 NaN
11 10017 NaN
12 10018 2
13 10019 3

Categories

Resources