I have this function written in python. I want this thing show only one value.
Here's the code
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
print(df)
And of course the output is this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999 366999
2 2012 361986 366999
3 2013 329461 366999
4 2014 355464 366999
5 2015 344998 366999
6 2016 274317 366999
7 2017 200916 366999
8 2018 217246 366999
9 2019 119830 366999
10 2020 66640 366999
Since it has the same value, I want the output like this
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
What should I change or add to my code?
You can use shift to generate a mask that can be used to replace duplicate consecutive values:
df.loc[df['Max Prod'] == df['Max Prod'].shift(1), 'Max Prod'] = ''
Output:
>>> df
Year Production (Ton) Max Prod
0 2010 339491 366999
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
You could also have the function as:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = ''
df.iloc[0, -1] = df['Production (Ton)'].max()
print(df)
Given what you have now:
def show_data():
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Max Prod'] = df['Production (Ton)'].max())
df = df['Max Prod'].drop_duplicates()
df = df.fillna('')
print(df)
Output:
Year Production-(Ton) Max Prod
0 2010 339491 549713
1 2011 366999
2 2012 361986
3 2013 329461
4 2014 355464
5 2015 344998
6 2016 274317
7 2017 200916
8 2018 217246
9 2019 119830
10 2020 66640
Related
I have a dataset that is formatted like this:
index string
1 2008
1 2009
1 2010
2
2
2
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5
5
5
I would like to fill in the missing data with the same sequence like this:
index string
1 2008
1 2009
1 2010
2 <-2008
2 <-2009
2 <-2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 <-2008
5 <-2009
5 <-2010
So the final result looks like this:
index string
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010
I am currently doing this in excel and it is an impossible task because of the number of rows that need to be filled.
I tried using fillna(method = 'ffill', limit = 2, inplace = True), but this will only fill data with what is in the previous cell. Any help is appreciated.
You can try:
l = [2008, 2009, 2010]
# is the row NaN?
m = df['string'].isna()
# update with 2008, 2009, etc. in a defined order
df.loc[m, 'string'] = (df.groupby('index').cumcount()
.map(dict(enumerate(l)))
)
# convert dtype if needed
df['string'] = df['string'].convert_dtypes()
Alternative just defining a start year:
start = 2008
m = df['string'].isna()
df.loc[m, 'string'] = df.groupby('index').cumcount().add(start)
df['string'] = df['string'].convert_dtypes()
Output:
index string
0 1 2008
1 1 2009
2 1 2010
3 2 2008
4 2 2009
5 2 2010
6 3 2008
7 3 2009
8 3 2010
9 4 2008
10 4 2009
11 4 2010
12 5 2008
13 5 2009
14 5 2010
Try this:
# Find where is Nan
m = df['string'].isna()
# Compute how many Nan with 'm.sum()'
# Replace 'Nan's with [2008, 2009, 2010]*(sum_of_Nan / 3) -> [2008,2009,2010,2008,2009,2010,...]
df.loc[m, 'string'] = [2008, 2009, 2010]*(m.sum()//3)
Output:
string
index
1 2008
1 2009
1 2010
2 2008
2 2009
2 2010
3 2008
3 2009
3 2010
4 2008
4 2009
4 2010
5 2008
5 2009
5 2010
I have dataframe in the following format:
course_id year month student_id
'Design' 2016 1 a123
'Design' 2016 1 a124
'Design' 2016 2 a125
'Design 2016 3 a126
'Marketing' 2016 1 b123
'Marketing' 2016 2 b124
'Marketing' 2016 3 b125
'Marketing' 2016 3 b126
'Marketing' 2016 3 b127
'Marketing' 2016 4 b128
How to calculate growth of every course in every month. I.e. to have the table in the following format
Year Month 'Design' 'Marketing'
2016 1 2 1
2016 2 1 1
2016 3 1 3
2016 4 0 1
You can use pivot_table function like:
df.pivot_table(index=['year', 'month'], columns='course_id', values='student_id', aggfunc=len).fillna(0).reset_index()
I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:
df1:
Complted_date
2015
2017
2020
df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021
And desired output is:
Complted_date Start_date
2015 2001
2015 2010
2015 2012
2015 2015
2017 2001
2017 2010
2017 2012
2017 2015
2017 2016
2017 2017
2020 2001
2020 2010
2020 2012
2020 2015
2020 2016
2020 2017
2020 2018
2020 2019
2020 2020
I've tried but I'm not getting the output I want.
Thank you for your help!!
Check out merge, which gives you the expected output:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('Complted_date>=Start_date')
.drop('key', axis=1)
)
Output:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
However, you might want to check out merge_asof:
pd.merge_asof(df2, df1,
right_on='Complted_date',
left_on='Start_date',
direction='forward')
Output:
Start_date Complted_date
0 2001 2015.0
1 2010 2015.0
2 2012 2015.0
3 2015 2015.0
4 2016 2017.0
5 2017 2017.0
6 2018 2020.0
7 2019 2020.0
8 2020 2020.0
9 2021 NaN
You can do cross-join and pick records which have Completed_date > Start_date:
Use df.merge with df.query:
In [101]: df1['tmp'] = 1
In [102]: df2['tmp'] = 1
In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)
In [108]: res
Out[108]:
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
10 2017 2001
11 2017 2010
12 2017 2012
13 2017 2015
14 2017 2016
15 2017 2017
20 2020 2001
21 2020 2010
22 2020 2012
23 2020 2015
24 2020 2016
25 2020 2017
26 2020 2018
27 2020 2019
28 2020 2020
Here is another way using pd.Series() and explode()
df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)
You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]:
left right
Complted_date Start_date
0 2015 2001
1 2015 2010
2 2015 2012
3 2015 2015
4 2017 2001
5 2017 2010
6 2017 2012
7 2017 2015
8 2017 2016
9 2017 2017
10 2020 2001
11 2020 2010
12 2020 2012
13 2020 2015
14 2020 2016
15 2020 2017
16 2020 2018
17 2020 2019
18 2020 2020
Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.
I have a dataframe:
import pandas as pd
import numpy as np
ycap = [2015, 2016, 2017]
df = pd.DataFrame({'a': np.repeat(ycap, 5),
'b': np.random.randn(15)})
a b
0 2015 0.436967
1 2015 -0.539453
2 2015 -0.450282
3 2015 0.907723
4 2015 -2.279188
5 2016 1.468736
6 2016 -0.169522
7 2016 0.003501
8 2016 0.182321
9 2016 0.647310
10 2017 0.679443
11 2017 -0.154405
12 2017 -0.197271
13 2017 -0.153552
14 2017 0.518803
I would like to add column c, that would look like following:
a b c
0 2015 -0.826946 2014
1 2015 0.275072 2013
2 2015 0.735353 2012
3 2015 1.391345 2011
4 2015 0.389524 2010
5 2016 -0.944750 2015
6 2016 -1.192546 2014
7 2016 -0.247521 2013
8 2016 0.521094 2012
9 2016 0.273950 2011
10 2017 -1.199278 2016
11 2017 0.839705 2015
12 2017 0.075951 2014
13 2017 0.663696 2013
14 2017 0.398995 2012
I try to achieve this using following, however 1, need to increment within the group. How could I do it? Thanks
gp = df.groupby('a')
df['c'] = gp['a'].apply(lambda x: x-1)
Subtract column a by Series created by cumcount and last subtract 1:
df['c'] = df['a'] - df.groupby('a').cumcount() - 1
print (df)
a b c
0 2015 0.285832 2014
1 2015 -0.223318 2013
2 2015 0.620920 2012
3 2015 -0.891164 2011
4 2015 -0.719840 2010
5 2016 -0.106774 2015
6 2016 -1.230357 2014
7 2016 0.747803 2013
8 2016 -0.002320 2012
9 2016 0.062715 2011
10 2017 0.805035 2016
11 2017 -0.385647 2015
12 2017 -0.457458 2014
13 2017 -1.589365 2013
14 2017 0.013825 2012
Detail:
print (df.groupby('a').cumcount())
0 0
1 1
2 2
3 3
4 4
5 0
6 1
7 2
8 3
9 4
10 0
11 1
12 2
13 3
14 4
dtype: int64
you can do it this way:
In [8]: df['c'] = df.groupby('a')['a'].transform(lambda x: x-np.arange(1, len(x)+1))
In [9]: df
Out[9]:
a b c
0 2015 0.436967 2014
1 2015 -0.539453 2013
2 2015 -0.450282 2012
3 2015 0.907723 2011
4 2015 -2.279188 2010
5 2016 1.468736 2015
6 2016 -0.169522 2014
7 2016 0.003501 2013
8 2016 0.182321 2012
9 2016 0.647310 2011
10 2017 0.679443 2016
11 2017 -0.154405 2015
12 2017 -0.197271 2014
13 2017 -0.153552 2013
14 2017 0.518803 2012
I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()