Python pandas dataframe select rows from columns - python

In an Excel sheet with columns Rainfall / Year / Month, I want to sum rainfall data per year. That is, for instance, for the year 2000, from month 1 to 12, summing all the Rainfall cells into a new one.
I tried using pandas in Python but cannot manage (just started coding). How can I proceed? Any help is welcome, thanks!
Here the head of the data (which has been downloaded):
rainfall (mm) \tyear month country iso3 iso2
0 120.54000 1990 1 ECU NaN NaN
1 231.15652 1990 2 ECU NaN NaN
2 136.62088 1990 3 ECU NaN NaN
3 203.47653 1990 4 ECU NaN NaN
4 164.20956 1990 5 ECU NaN NaN

Use groupby and aggregate sum if need sum of all years:
df = df.groupby('\tyear')['rainfall (mm)'].sum()
But if need only one value:
df.loc[df['\tyear'] == 2000, 'rainfall (mm)'].sum()

If you just want the year 2000, use
df[df['\tyear'] == 2000]['rainfall (mm)'].sum()
Otherwise, jezrael's answer is nice because it sums rainfall (mm) for each distinct value of \tyear.

Related

Filter values as per std deviation for individual column

I am working on a requirement where I need to assign particular value to NaN based on the variable upper, which is my upper range of standard deviation
Here is a sample code-
data = {'year': ['2014','2014','2015','2014','2015','2015','2015','2014','2015'],
'month':['Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota',"Toyota"],
'make': [23,34,32,22,12,33,44,11,21]
}
df = pd.DataFrame.from_dict(data)
df=pd.pivot_table(df,index='month',columns='year',values='make',aggfunc=np.sum)
upper=df.mean() + 3*df.std()
This is just the sample data, the real data is huge, based on upper's value for every year, I need to filter the year column accordingly.
Sample I/P's:-
df-
upper std dev-
Desired O/P-
Based on the upper std deviation values in individual year, it should convert value as NaN if the value<upper.
Eg 2014 has upper=138, so only in 2014's column, if value<upper, convert it to NaN.
2014's upper value is only applicable in 2014 itself, and the same goes for 2015.
IIUC use DataFrame.lt for compare Dataframe by Series and then set NaNs if match by DataFrame.mask:
print (df.lt(upper))
year 2014 2015
month
Hyundai True True
Toyota True True
df = df.mask(df.lt(upper))
print (df)
year 2014 2015
month
Hyundai NaN NaN
Toyota NaN NaN

Calculate standard deviation for intervals in dataframe column

I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!
IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.
I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

Pandas: calculate mean of Dataframe column values per "year"

I have a data frame representing the customers checkins (visits) of restaurants. year is simply the year when a checkin in a restaurant happened .
What i want to do is to add a column average_checkin to my initial Dataframe df that represents the average number of visits of a restaurant per year.
data = {
'restaurant_id': ['--1UhMGODdWsrMastO9DZw', '--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--1UhMGODdWsrMastO9DZw','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA','--6MefnULPED_I942VcFNA'],
'year': ['2016','2016','2016','2016','2017','2017','2011','2011','2012','2012'],
}
df = pd.DataFrame (data, columns = ['restaurant_id','year'])
# here i count the total number of checkins a restaurant had
d = df.groupby('restaurant_id')['year'].count().to_dict()
df['nb_checkin'] = df['restaurant_id'].map(d)
mean_checkin= df.groupby(['restaurant_id','year']).agg({'nb_checkin':[np.mean]})
mean_checkin.columns = ['mean_checkin']
mean_checkin.reset_index()
# the values in mean_checkin makes no sens
#I need to merge it with df to add that new column
I am still new with the pandas lib, I tried something like this but my results makes no sens. Is there something wrong with my syntax? If any clarifications needed, please ask.
The average number of visits per year can be calculated as the total number of visits a restaurant has, divided by the number of unique years you have data for.
grouped = df.groupby(["restaurant_id"])
avg_annual_visits = grouped["year"].count() / grouped["year"].nunique()
avg_annual_visits = avg_annual_visits.rename("avg_annual_visits")
print(avg_annual_visits)
restaurant_id
--1UhMGODdWsrMastO9DZw 3.0
--6MefnULPED_I942VcFNA 2.0
Name: avg_annual_visits, dtype: float64
Then if you wanted to merge it back to your original data:
df = df.merge(avg_annual_visits, left_on="restaurant_id", right_index=True)
print(df)
restaurant_id year avg_annual_visits
0 --1UhMGODdWsrMastO9DZw 2016 3.0
1 --1UhMGODdWsrMastO9DZw 2016 3.0
2 --1UhMGODdWsrMastO9DZw 2016 3.0
3 --1UhMGODdWsrMastO9DZw 2016 3.0
4 --1UhMGODdWsrMastO9DZw 2017 3.0
5 --1UhMGODdWsrMastO9DZw 2017 3.0
6 --6MefnULPED_I942VcFNA 2011 2.0
7 --6MefnULPED_I942VcFNA 2011 2.0
8 --6MefnULPED_I942VcFNA 2012 2.0
9 --6MefnULPED_I942VcFNA 2012 2.0

Row-wise average for a subset of columns with missing values

I've got a 'DataFrame` which has occasional missing values, and looks something like this:
Monday Tuesday Wednesday
================================================
Mike 42 NaN 12
Jenna NaN NaN 15
Jon 21 4 1
I'd like to add a new column to my data frame where I'd calculate the average across all columns for every row.
Meaning, for Mike, I'd need
(df['Monday'] + df['Wednesday'])/2, but for Jenna, I'd simply use df['Wednesday amt.']/1
Does anyone know the best way to account for this variation that results from missing values and calculate the average?
You can simply:
df['avg'] = df.mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 27.000000
Jenna NaN NaN 15 15.000000
Jon 21 4 1 8.666667
because .mean() ignores missing values by default: see docs.
To select a subset, you can:
df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
Monday Tuesday Wednesday avg
Mike 42 NaN 12 42.0
Jenna NaN NaN 15 NaN
Jon 21 4 1 12.5
Alternative - using iloc (can also use loc here):
df['avg'] = df.iloc[:,0:2].mean(axis=1)
Resurrecting this Question because all previous answers currently print a Warning.
In most cases, use assign():
df = df.assign(avg=df.mean(axis=1))
For specific columns, one can input them by name:
df = df.assign(avg=df.loc[:, ["Monday", "Tuesday", "Wednesday"]].mean(axis=1))
Or by index, using one more than the last desired index as it is not inclusive:
df = df.assign(avg=df.iloc[:,0:3]].mean(axis=1))

Categories

Resources