Sum based on criteria in row and column conditions [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a dataframe in Pandas that looks something like this:
Year Type Money
2012 A 2
2012 A 3
2012 B 4
2012 B 5
2012 C 7
2013 A 6
2013 A 4
2013 B 3
2013 B 2
2013 C 1
2014 A 3
2014 A 4
2014 B 5
I want to sum it up as such:
A B C
2012 5 9 7
2013 10 5 1
2014 7 5 0
For instance, the first entry of 5 is a sum of all entries in the data from year 2012 and with Type A.
Is there a simple way to go about doing this? I know how to go about this using SUMIFS in Excel but want to avoid that if possible.

Try:
df.groupby(['Year','Type']).Money.sum().unstack(level=1).fillna(0)
Output:
Type A B C
Year
2012 5.0 9.0 7.0
2013 10.0 5.0 1.0
2014 7.0 5.0 0.0

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

pivot dataframe using columns and values

I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Loop Optimization in python

I have a dataframe df like this
Product Yr Value
A 2014 1
A 2015 3
A 2016 2
B 2015 2
B 2016 1
I want to do max cumululative ie
Product Yr Value
A 2014 1
A 2015 3
A 2016 3
B 2015 2
B 2016 2
My actual data has about 50000 products
I am writing a code like:
df2=pd.DataFrame()
for i in (df['Product'].unique()):
data3=df[df['Product']==i]
data3.sort_values(by=['Yr'])
data3['Value']=data3['Value'].cummax()
df2=df2.append(data3)
#df2 is my result
This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?
You can use groupby.cummax instead:
df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()
df
#Product Yr Value
#0 A 2014 1
#1 A 2015 3
#2 A 2016 3
#3 B 2015 2
#4 B 2016 2

Get first index from multiindex grouping by level

I have a multiindex pandas dataframe:
SHOPPING_COUNT
CLIENT YEAR MONTH
1000063 2013 12 9
2014 1 9
2 7
3 9
2015 4 6
5 5
6 9
1001327 2014 5 1
6 1
2015 2 7
3 1
4 3
1001399 2013 8 1
And I would to know the first index of each client, ordering by level 0.
I mean, I would want to get:
1000063 2013 12
1001327 2014 5
1001399 2013 8
Let df be your dataframe, you can do something like:
df = df.groupby(level=0).apply(lambda x: x.iloc[0:1])
df.index = df.index.droplevel(0)
actually this should be more easy to do maybe, but I think that this method works.
It's not very programmatic, but if you look at the result of:
client = 1000063
df.loc[client].index
Then the following would work:
year = df.loc[client].index.levels[0][df.loc[client].index.labels[0][0]]
month = df.loc[client].index.levels[1][df.loc[client].index.labels[1][0]]

Categories

Resources