Summing up rows in a DataFrame while maintaining a similar DataFrame structure - python

I have the following DataFrame:
Stint Year ID Data1 Data2 Team
1 2010 A 10 1 SFN
1 2011 A 10 1 SFN
1 2013 A 10 1 SFN
2 2013 A 10 1 ATL
1 1922 B 10 1 ARI
1 1923 B 10 1 ARI
1 1924 B 10 1 ARI
I'm trying to return a new DataFrame which sums up values in the Data1 and Data2 columns for identical years. I would like the DataFrame above to ultimately look like this:
Year ID Data1 Data2
2010 A 10 1
2011 A 10 1
2013 A 20 2
1922 B 10 1
1923 B 10 1
1924 B 10 1
I've messed around with some groupby functions, but I'm having trouble getting the proper DataFrame structure.
Thanks!

groupby with as_index=False
Will not include grouped columns in a new index
df.groupby(['Year', 'ID'], as_index=False)[['Data1', 'Data2']].sum()
Year ID Data1 Data2
0 1922 B 10 1
1 1923 B 10 1
2 1924 B 10 1
3 2010 A 10 1
4 2011 A 10 1
5 2013 A 20 2

groupby with sort=false
Also, if you like to keep your data in the same Year format ie. [2010,2011,2013,1922,1923,1924], you can check 'sort=False'
so the same code can be written with the sort values set to false as:
df.groupby(['Year', 'ID'], as_index= False, sort= False)[['Data1', 'Data2']].sum()

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Why do I get index value inside column value when I do pandas groupby?

I have a data frame as follows:
import pandas as pd
df = pd.DataFrame()
df['year'] = ['2015','2015','2016','2016','2017','2017']
df['months'] = [2,4,4,2,3,5]
df['perc'] = ['25','35','55','75','34','38']
Which results in dataframe:
year months perc
0 2015 2 25
1 2015 4 35
2 2016 4 55
3 2016 2 75
4 2017 3 34
5 2017 5 38
When I do a pandas groupby on year column, the resultant dataframe/pandasgroupby object has index of original DF and the year inside the year column.
Command used:
result = pd.DataFrame(df.groupby('year',as_index=False).apply(lambda x: x.nlargest(1, ['months'])))
Where the year column has index of original DF (the 1, 2, 5) with the year value:
year months perc
0 1 2015 4 35
1 2 2016 4 55
2 5 2017 5 38
print(result['year']) gives:
0 1 2015
1 2 2016
2 5 2017
Name: year, dtype: object
Why Do I get index of original dataframe inside the year column and how to remove it?
I am not sure what is you expected output, but use group_keys=False as parameter to only keep the original index:
(df.groupby('year',as_index=False, group_keys=False)
.apply(lambda x: x.nlargest(1, ['months']))
)
output:
year months perc
1 2015 4 35
2 2016 4 55
5 2017 5 38

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Merge 2 dataframes on Days and Month

I have the following dataframes:
print(df1)
day month quantity Operation_type
21 6 6 2
24 6 4 2
...
print(df2)
day month quantity Operation_type
22 6 10 1
23 6 15 1
...
I would like to get the following dataset:
print(final_df)
day month quantity Operation_type
21 6 6 2
22 6 10 1
23 6 15 1
24 6 4 2
...
I tried using:
final_df = pd.merge(df1, df2, on=['day','month']) but it creates a huge dataset and does not seem to be working properly;
Furthermore, if day and month are the same, I would like to paste the line whose Operation_type == 2 before the one with ==1.
How can I solve this problem?
To combine the DataFrames into one, you don't want merge, you want pd.concat. To get the ordering properly, just use DataFrame.sort_values
pd.concat([df1, df2]).sort_values(by=['day', 'month', 'Operation_type'],
ascending=[True, True, False])
You can perform an outer merge to achieve this result.
res = pd.merge(df1, df2, how='outer').sort_values('day')
# day month quantity Operation_type
# 0 21 6 6 2
# 2 22 6 10 1
# 3 23 6 15 1
# 1 24 6 4 2

How to extract a percentage column from a periodic column and the sum of the column?

I have a matrix that looks like this in as pandas.DataFrame:
Store Sales year month day
0 1 5263 2015 7 31
1 1 5020 2015 7 30
2 1 4782 2015 7 29
3 2 5011 2015 7 28
4 2 6102 2015 7 27
[986159 rows x 5 columns]
After I do some transformation I get the total sales sum for each shop:
train['StoreTotalSales'] = train.groupby('Store')['Sales'].transform('sum')
But now I need to iterate through each row of train.groupby(['Store', 'day', 'month']) and then divide the Sales figure of each row of the groupby and divide by the StoreTotalSales.
I've tried the following:
train['PercentSales'] = train.groupby(['Store','day', 'month'])['Sales'].transform(lambda x: x /float(x.sum()))
But it's return all 1s for the new PercentSales column:
Store Sales year month day StoreTotalSales PercentSales
0 1 5263 2015 7 31 26178 1
1 1 5020 2015 7 30 26178 1
2 1 4782 2015 7 29 26178 1
3 2 5011 2015 7 28 12357 1
4 2 6102 2015 7 27 12357 1
But that's PercentSales row should have been:
0 5263/26178
1 5020/26178
2 4782/26178
3 5011/12357
4 6012/12357
Why the complication of another groupby? If all you want is to divide the column by the group sum, you can simply do:
train['PercentSales'] = train.groupby('Store')['Sales'].transform(lambda x: x/x.sum())
Or equivalently, following your method:
train['StoreTotalSales'] = train.groupby('Store'['Sales'].transform('sum')
train['PercentSales'] = train['Sales']/train['StoreTotalSales']
Let me know if you run into additional problems.

Categories

Resources