Looking to merge/concatenate/groupby different rows in Pandas dataframe - python

I will be iterating through a large list of dataframes of baseball statistics of different players. This data is indexed by year. What I am looking to do is group year while keeping salary the same and adding WAR. Also, I am looking to drop rows that are not single years. In my data set these entries are strings.
to group
for x in clean_stats_list:
x.groupby("Year")
to eliminate rows
for x in clean_stats_list:
for i in x['Year']:
if len(i) > 4:
x['Year'][i].drop()
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 3.8 2015 $10,000,000
9 0.2 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333
14 34.3 11 Seasons $96,952,999
16 25.4 CIN (8 yrs) $37,453,000
17 8.8 SFG (3 yrs) $59,499,999
This is what I am expecting to achieve:
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 4.0 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333

To filter out based on length of column Year, why don't you try creating a mask and then select based on it.
Code:
mask_df = your_df['Year'].str.len() == 4
your_df_cleaned = your_df.loc[mask_df]

You can use regex for validate years for avoid filter values with length 4 and not years with Series.str.contains and boolean indexing:
#https://stackoverflow.com/a/4374209
#validate between 1000-2999
df1 = df[df['Year'].str.contains('^[12][0-9]{3}$')]
#validate between 0000-9999
#df1 = df[df['Year'].str.contains('^\d{4}$')]
print (df1)
WAR Year Salary
0 1.4 2008 $390,000
1 0.9 2009 $418,000
2 2.4 2010 $445,000
3 3.6 2011 $3,400,000
4 5.2 2012 $5,400,000
5 1.3 2013 $7,400,000
6 6.8 2014 $10,000,000
7 3.8 2015 $10,000,000
9 0.2 2015 $10,000,000
11 5.5 2016 $15,833,333
12 2.0 2017 $21,833,333
13 1.3 2018 $21,833,333

Related

Pandas : Groupby sum values

I am using this data frame in excel :
I'd like to show the total sales per year.
Year Sales
2021 7
2018 6
2018 787
2018 935
2018 1 059
2018 5
2018 72
2018 2
2018 3
2019 218
2019 256
2020 2
2018 4
2021 8
2019 14
2020 3
2018 3
2018 1
2020 34
I'm using this :
df.groupby(['Year'])['Sales'].agg('sum')
And the result :
2018.0 67879351 05957223431
2019.0 21825614
2020.0 2334
2021.0 78
Do you know why I don't have the sum of the values ?
Thanks
'Sales' column is of dtype object so convert it to numeric:
df['Sales']=pd.to_numeric(df['Sales'].replace(r"\s+",'',regex=True),errors='coerce')
#df['Sales'].replace(r"\s+",'',regex=True).astype(float)
Now calculte sum():
out=df.groupby(['Year'])['Sales'].sum()
output of out:
Year
2018 2877
2019 488
2020 39
2021 15
Name: Sales, dtype: int64

How to populate a pandas dataframe (dfA) column "A" with values from another dataframe (dfB), depending on column / row values from dfB?

I have a df (dfA) with the life expectancy at birth and gdp per year of 6 countries. with the following structure:
country year expectancy gdp difference
chile 2000 60 1bn NA
chile 2001 63 1.5bn 0.5bn
chile 2002 65 2.5bn 0.5bn
chile 2003 68 3.5bn 1.0bn
.
.
.
chile 2015 80 10bn 10bn
Each row represents the data (gdp, expectancy, etc) for a country per year, spanning from the year 2000 to 2015 and with 6 countries.
I have created a new dataframe to store important overall variables per country, such as GDP delta (GDP in 2015 minus GDP in 2000) per country. The new df (dfB) looks like this:
country startEndDelta (dummydata)
Chile x
China y
Germany z
Mexico a
USA b
Zimbabwe c
What I want to do is add a new column to my newdf that shows which year had the greatest increase in GDP for each country.
I was already able to calcualte the year but I first had to create another dataframe with records from only one country. Here I do it the way I metioned before.
The way I wish to do this would be something similar to:
dfB['biggestDeltaYear'] = ?year with the biggest increase in GDP?
Where this single line of code populates every row in dfB for my new column 'biggestDeltaYear'.
What are my options?
Thank you very much
Maybe you can try use the groupby() method of pandas.DataFrame
dfA.groupby('country').apply(lambda x:x['year'].iloc[x['difference'].argmax()])
Here's another option:
dfA['biggestDeltaYear'] = (dfA.iloc[dfA.groupby('country')['difference']
.apply(lambda x: x.argmax())]['year'])
You should be able to achieve this using groupby and apply lambda operations in Pandas. Below is an example I drew:
Consider the following data:
Country,Year,GDP
Chile,2011,1.5
Chile,2012,1
Chile,2013,2
Chile,2014,2.3
Chile,2015,3.2
Nigeria,2011,0.6
Nigeria,2012,0.9
Nigeria,2013,2.1
Nigeria,2014,2.2
Nigeria,2015,2.6
Australia,2011,10.4
Australia,2012,14.4
Australia,2013,12.3
Australia,2014,13.3
Australia,2015,15
First, we apply the diff operation country wise:
df['diff'] = df.groupby("Country")["GDP"].transform(pd.DataFrame.diff)
Country Year GDP diff
0 Chile 2011 1.5 NaN
1 Chile 2012 1.0 -0.5
2 Chile 2013 2.0 1.0
3 Chile 2014 2.3 0.3
4 Chile 2015 3.2 0.9
5 Nigeria 2011 0.6 NaN
6 Nigeria 2012 0.9 0.3
7 Nigeria 2013 2.1 1.2
8 Nigeria 2014 2.2 0.1
9 Nigeria 2015 2.6 0.4
10 Australia 2011 10.4 NaN
11 Australia 2012 14.4 4.0
12 Australia 2013 12.3 -2.1
13 Australia 2014 13.3 1.0
14 Australia 2015 15.0 1.7
Then we can generate a boolean column based on the largest value:
df['biggestDeltaYear'] = df.groupby("Country")['diff'].apply(lambda x:x==x.max())
Country Year GDP diff biggestDeltaYear
0 Chile 2011 1.5 NaN False
1 Chile 2012 1.0 -0.5 False
2 Chile 2013 2.0 1.0 True
3 Chile 2014 2.3 0.3 False
4 Chile 2015 3.2 0.9 False
5 Nigeria 2011 0.6 NaN False
6 Nigeria 2012 0.9 0.3 False
7 Nigeria 2013 2.1 1.2 True
8 Nigeria 2014 2.2 0.1 False
9 Nigeria 2015 2.6 0.4 False
10 Australia 2011 10.4 NaN False
11 Australia 2012 14.4 4.0 True
12 Australia 2013 12.3 -2.1 False
13 Australia 2014 13.3 1.0 False
14 Australia 2015 15.0 1.7 False
The actual year values can also be obtained instead of boolean using:
df['Year'][df.groupby("Country")['diff'].apply(lambda x:x==x.max())]
or,
df.iloc[df.groupby("Country")['diff'].apply(lambda x:x.idxmax())]['Year']
HTH.

how to apply unique function and transform and keep the complete columns in the data frame pandas

My goal here is to extract the count of rows in the data frame in which for each PatienNumber and year and month show the count of them and keep all the columns in the data frame.
This is the original data frame:
PatientNumber QT Answer Answerdate year month dayofyear count formula
1 1 transferring No 2017-03-03 2017 3 62 2.0 (1/3)
2 1 preparing food No 2017-03-03 2017 3 62 2.0 (1/3)
3 1 medications Yes 2017-03-03 2017 3 62 1.0 (1/3)
4 2 transferring No 2006-10-05 2006 10 275 3.0 0
5 2 preparing food No 2006-10-05 2006 10 275 3.0 0
6 2 medications No 2006-10-05 2006 10 275 3.0 0
7 2 transferring Yes 2007-4-15 2007 4 105 2.0 2/3
8 2 preparing food Yes 2007-4-15 2007 4 105 2.0 2/3
9 2 medications No 2007-4-15 2007 4 105 1.0 2/3
10 2 transferring Yes 2007-12-15 2007 12 345 1.0 1/3
11 2 preparing food No 2007-12-15 2007 12 345 2.0 1/3
12 2 medications No 2007-12-15 2007 12 345 2.0 1/3
13 2 transferring Yes 2008-10-10 2008 10 280 1.0 (1/3)
14 2 preparing food No 2008-10-10 2008 10 280 2.0 (1/3)
15 2 medications No 2008-10-10 2008 10 280 2.0 (1/3)
16 3 medications No 2008-10-10 2008 12 280 …… ………..
so the desired output should be the same as this with one more column which shows the unique rows of [patientNumber, year, month]. for patient number=1 shows 1 for the PatientNumber= 2 shows 1 in year 2006, shows 2 in year 2007
I applied this code:
data=data.groupby(['Clinic Number','year'])["month"].nunique().reset_index(name='counts')
the output of this code look like:
Clinic Number year **counts**
0 494383 1999 1
1 494383 2000 2
2 494383 2001 1
3 494383 2002 1
4 494383 2003 1
the output counts is correct, except it does not keep the whole fields. I want the complete columns because later I have to do some calculation on them.
then I tried this code:
data['counts'] = data.groupby(['Clinic Number','year','month'])['month'].transform('count')
Again its not good because it does not show correct count. the output of this code is like this:
Clinic Number Question Text Answer Text ... year month counts
1 3529933 bathing No ... 2011 1 10
2 3529933 dressing No ... 2011 1 10
3 3529933 feeding No ... 2011 1 10
4 3529933 housekeeping No ... 2011 1 10
5 3529933 medications No ... 2011 1 10
here counts should be 1 because for that patient and that year there is just one month.
Use, the following modification to your code.
df['counts'] = df.groupby(['PatientNumber','year'])["month"].transform('nunique')
transform returns a series equal length to your original dataframe, therefore you can add this series into your dataframe as a column.

Find a differencing between year for sales

I have a data that analysing sales. I made some progress and this is the last part I did that show each store sales total for each year (2016-2017-2018).
Store_Key Year count Total_Sales
0 5.0 2016 28 6150.0
1 5.0 2017 39 8350.0
2 5.0 2018 27 5150.0
3 7.0 2016 3664 105370.0
4 7.0 2017 3736 116334.0
5 7.0 2018 3863 99375.0
6 10.0 2016 3930 79904.0
7 10.0 2017 3981 91227.0
8 10.0 2018 4432 97226.0
9 11.0 2016 4084 91156.0
10 11.0 2017 4220 99565.0
11 11.0 2018 4735 113584.0
12 16.0 2016 4257 135655.0
13 16.0 2017 4422 144725.0
14 16.0 2018 4630 133820.0
I want to see each store's sales difference between years. So I used pivot table and show each year with a difference column.
Store_Key 2016 2017 2018
5.0 6150.0 8350.0 5150.0
7.0 105370.0 116334.0 99375.0
10.0 79904.0 91227.0 97226.0
11.0 91156.0 99565.0 113584.0
16.0 135655.0 144725.0 133820.0
18.0 237809.0 245645.0 88167.0
20.0 110225.0 131999.0 83302.0
24.0 94087.0 101062.0 108888.0
If stores were constant, I would quickly find the difference when using the difference between columns, but unfortunately each year so many new stores are founding and shutting down.
So my question is: is there any way to get difference in stores with showing new stores and closing stores?
I can find stores with NULL values and separate it but I would love to check if there are some better options.
To get the difference between 2017 and 2016, you can do :
df['evolution'] = df['2017'] - df['2016']
If you would like to drop lines where there is at least one NaN value, you can remove those lines like this :
df.dropna(axis=0, how='any', inplace=False)
If you have 0 instead of NaN, you can do:
import numpy as np
df.replace(0, np.nan)

How to add a column with the growth rate in a budget table in Pandas?

I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()

Categories

Resources