Column name in pandas dataframe resulting from groupby - python

I have the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'clif_cod' : [1,2,3,3,4,4,4],
'peds_val_fat' : [10.2, 15.2, 30.9, 14.8, 10.99, 39.9, 54.9],
'mes' : [1,2,4,5,5,6,12],
'ano' : [2016, 2016, 2016, 2016, 2016, 2016, 2016]})
vetor_valores = df.groupby(['mes','clif_cod']).sum()
which yields me this output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.20
2 2 2016 15.20
4 3 2016 30.90
5 3 2016 14.80
4 2016 10.99
6 4 2016 39.90
12 4 2016 54.90
How do I select rows based on mes and clif_cod?
When I do list(df) I only get ano and peds_val_fat.

IIUC, you can just pass the argument as_index=False to your groupby. You can then access it as you would any other dataframe
vetor_valores = df.groupby(['mes','clif_cod'], as_index=False).sum()
>>> vetor_valores
mes clif_cod ano peds_val_fat
0 1 1 2016 10.20
1 2 2 2016 15.20
2 4 3 2016 30.90
3 5 3 2016 14.80
4 5 4 2016 10.99
5 6 4 2016 39.90
6 12 4 2016 54.90
To access values, you can now use iloc or loc as you would any dataframe:
# Select first row:
vetor_valores.iloc[0]
...
Alternatively, if you've already created your groupby and don't want to go back and re-make it, you can reset the index, the result is identical.
vetor_valores.reset_index()

By using pd.IndexSlice
vetor_valores.loc[[pd.IndexSlice[1,1]],:]
Out[272]:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2

You've got a dataframe with a two-level MultiIndex. Use both values to access rows, e.g., vetor_valores.loc[(4,3)].

Use axis parameter in .loc:
vetor_valores.loc(axis=0)[1,:]
Output:
ano peds_val_fat
mes clif_cod
1 1 2016 10.2

Related

How to fill missing values in a dataframe based on group value counts?

I have a pandas DataFrame with 2 columns: Year(int) and Condition(string). In column Condition I have a nan value and I want to replace it based on information from groupby operation.
import pandas as pd
import numpy as np
year = [2015, 2016, 2017, 2016, 2016, 2017, 2015, 2016, 2015, 2015]
cond = ["good", "good", "excellent", "good", 'excellent','excellent', np.nan, 'good','excellent', 'good']
X = pd.DataFrame({'year': year, 'condition': cond})
stat = X.groupby('year')['condition'].value_counts()
It gives:
print(X)
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 NaN
7 2016 good
8 2015 excellent
9 2015 good
print(stat)
year condition
2015 good 2
excellent 1
2016 good 3
excellent 1
2017 excellent 2
As nan value in 6th row gets year = 2015 and from stat I get that from 2015 the most frequent is 'good' so I want to replace this nan value with 'good' value.
I have tried with fillna and .transform method but it does not work :(
I would be grateful for any help.
I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna with map based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good

Select by Column Values

I know it is possible in arcpy. Finding out if can happen in pandas.
I have the following
data= {'Species':[ 'P.PIN','P.PIN','V.FOG', 'V.KOP', 'E.MON', 'E.CLA', 'E.KLI', 'D.FGH','W.ERT','S.MIX','P.PIN'],
'FY':[ '2002','2016','2018','2010','2009','2019','2017','2016','2018','2018','2016']}
I need to select all the P.PIN, P.RAD and any other species starting with E that have a FY equal to or older than 2016 and put into a new dataframe.
How can I get this done. All I am able to select P.PIN and P.RAD but have adding in all the other starting with E;
df3 =df[(df['FY']>=2016)&(df1['LastSpecies'].isin(['P.PIN','P.RAD']))]
Your help will be highly appreciated.
Step by step way. But you can also combine the logic inside the np.where() just want to show that all conditions were done.
Start by typecasting your df['FY'] values as int so we can use the greater than (>) operator.
>>> df['FY'] = df['FY'].astype(int)
>>> df['flag'] = np.where(df['Species'].isin(['P.PIN', 'P.RAD']), ['Take'], ['Remove'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Remove
6 E.KLI 2017 Remove
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> df['flag'] = np.where((df['FY'] > 2016) & (df['Species'].str.startswith('E')), ['Take'], df['flag'])
>>> df
Species FY flag
0 P.PIN 2002 Take
1 P.PIN 2016 Take
2 V.FOG 2018 Remove
3 V.KOP 2010 Remove
4 E.MON 2009 Remove
5 E.CLA 2019 Take
6 E.KLI 2017 Take
7 D.FGH 2016 Remove
8 W.ERT 2018 Remove
9 S.MIX 2018 Remove
10 P.PIN 2016 Take
>>> new_df = df[df['flag'].isin(['Take'])][['Species', 'FY']]
>>> new_df
Species FY
0 P.PIN 2002
1 P.PIN 2016
5 E.CLA 2019
6 E.KLI 2017
10 P.PIN 2016
Hope this helps :D

pandas groupby sum if count equals condition

I'm downloading data from FRED. I'm summing to get annual numbers, but don't want incomplete years. So I need a sum condition if count the number of obs is 12 because the series is monthly.
import pandas_datareader.data as web
mnemonic = 'RSFSXMV'
df = web.DataReader(mnemonic, 'fred', 2000, 2020)
df['year'] = df.index.year
new_df = df.groupby(["year"])[mnemonic].sum().reset_index()
print(new_df)
I don't want 2019 to show up.
In your case we using transform with nunique to make sure each year should have 12 unique month , if not we drop it before do the groupby sum
df['Month']=df.index.month
m=df.groupby('year').Month.transform('nunique')==12
new_df = df.loc[m].groupby(["year"])[mnemonic].sum().reset_index()
isin
df['Month']=df.index.month
m=df.groupby('year').Month.nunique()
new_df = df.loc[df.year.isin(m.index[m==12)].groupby(["year"])[mnemonic].sum().reset_index()
You could use a aggreate function count while groupby:
df['year'] = df.index.year
df = df.groupby('year').agg({'RSFSXMV': 'sum', 'year': 'count'})
which will give you:
RSFSXMV year
year
2000 2487790 12
2001 2563218 12
2002 2641870 12
2003 2770397 12
2004 2969282 12
2005 3196141 12
2006 3397323 12
2007 3531906 12
2008 3601512 12
2009 3393753 12
2010 3541327 12
2011 3784014 12
2012 3934506 12
2013 4043037 12
2014 4191342 12
2015 4252113 12
2016 4357528 12
2017 4561833 12
2018 4810502 12
2019 2042147 5
Then simply drop those rows with a year count less than 12

iterate over pandas dataframe and create another dataframe with repititive records

I have a dataframe act with columns as ['ids','start-yr','end-yr'].
I want to create another dataframe timeline with columns as ['ids','years'].
using the act df. So if act has fields as
ids start-yr end-yr
--------------------------------
'IAs728-ahe83j' 2014 2016
'J8273nbajsu-193h' 2012 2018
I want the timeline df to be populated like this:
ids years
------------------------
'IAs728-ahe83j' 2014
'IAs728-ahe83j' 2015
'IAs728-ahe83j' 2016
'J8273nbajsu-193h' 2012
'J8273nbajsu-193h' 2013
'J8273nbajsu-193h' 2014
'J8273nbajsu-193h' 2015
'J8273nbajsu-193h' 2016
'J8273nbajsu-193h' 2017
'J8273nbajsu-193h' 2018
My attempt so far:
timeline = pd.DataFrame(columns=['ids','years'])
cnt = 0
for ix, row in act.iterrows():
for yr in range(int(row['start-yr']), int(row['end-yr'])+1, 1):
timeline[cnt, 'ids'] = row['ids']
timeline[cnt, 'years'] = yr
cnt += 1
But this is a very costly operation, too much time consuming (which is obvious, i know). So what should be the best pythonic approach to populate a pandas df in a situation like this?
Any help is appreciated, thanks.
Use list comprehension with range for list of tuples and DataFrame constructor:
a = [(i, x) for i, a, b in df.values for x in range(a, b + 1)]
df = pd.DataFrame(a, columns=['ids','years'])
print (df)
ids years
0 'IAs728-ahe83j' 2014
1 'IAs728-ahe83j' 2015
2 'IAs728-ahe83j' 2016
3 'J8273nbajsu-193h' 2012
4 'J8273nbajsu-193h' 2013
5 'J8273nbajsu-193h' 2014
6 'J8273nbajsu-193h' 2015
7 'J8273nbajsu-193h' 2016
8 'J8273nbajsu-193h' 2017
9 'J8273nbajsu-193h' 2018
If possible multiple columns in DataFrame filter them by list:
c = ['ids','start-yr','end-yr']
a = [(i, x) for i, a, b in df[c].values for x in range(a, b + 1)]

Pandas Creating Dataframes from Loops

I am trying to make a dataframe so that I can send it to a CSV easily, otherwise I have to do this process manually..
I'd like this to be my final output. Each person has a month and year combo that starts at 1/1/2014 and goes to 12/1/2016:
Name date
0 ben 1/1/2014
1 ben 2/1/2014
2 ben 3/1/2014
3 ben 4/1/2014
....
12 dan 1/1/2014
13 dan 2/1/2014
14 dan 3/1/2014
code so far:
import pandas as pd
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df = pd.DataFrame({"Name": listof_people})
for month in months:
df.append({'date': month}, ignore_index=True)
print(df)
When I try looping to create the dataframe it either does not work, I get index errors (because of the non-matching lists) and I'm at a loss.
I've done a good bit of searching and have found some following links that are similar, but I can't reverse engineer the work to fit my case.
Filling empty python dataframe using loops
How to build and fill pandas dataframe from for loop?
I don't want anyone to feel like they are "doing my homework", so if i'm derping on something simple please let me know.
I think you can use product for all combination with to_datetime for column date:
from itertools import product
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df1 = pd.DataFrame(list(product(listof_people, months, days, years)))
df1.columns = ['Name', 'month','day','year']
print (df1)
Name month day year
0 ben 1 1 2014
1 ben 1 1 2015
2 ben 1 1 2016
3 ben 2 1 2014
4 ben 2 1 2015
5 ben 2 1 2016
6 ben 3 1 2014
7 ben 3 1 2015
8 ben 3 1 2016
9 ben 4 1 2014
10 ben 4 1 2015
...
...
df1['date'] = pd.to_datetime(df1[['month','day','year']])
df1 = df1[['Name','date']]
print (df1)
Name date
0 ben 2014-01-01
1 ben 2015-01-01
2 ben 2016-01-01
3 ben 2014-02-01
4 ben 2015-02-01
5 ben 2016-02-01
6 ben 2014-03-01
7 ben 2015-03-01
...
...
mux = pd.MultiIndex.from_product(
[listof_people, years, months],
names=['Name', 'Year', 'Month'])
pd.Series(
1, mux, name='Day'
).reset_index().assign(
date=pd.to_datetime(df[['Year', 'Month', 'Day']])
)[['Name', 'date']]

Categories

Resources