Using pandas first_valid_index() to get index of first non-null value of a column, how can I shifta single value of column rather than the whole column. i.e.
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'columnA': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
'columnB': [None, None, None, 10, 39, 30, 31,45, 23, 56],
'total': [100, 200, 300, 400, 500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(data)
df = df.set_index('year')
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 10 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
for col in df.columns:
if col not in ['total']:
idx = df[col].first_valid_index()
df.loc[idx, col] = df.loc[idx, col] + df.loc[idx, 'total'].shift(1)
print df
AttributeError: 'numpy.float64' object has no attribute 'shift'
desired result:
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
is that what you want?
In [63]: idx = df.columnB.first_valid_index()
In [64]: df.loc[idx, 'columnB'] += df.total.shift().loc[idx]
In [65]: df
Out[65]:
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
You can filter all column names, where is least one NaN value and then use union with column total:
for col in df.columns:
if col not in pd.Index(['total']).union(df.columns[~df.isnull().any()]):
idx = df[col].first_valid_index()
df.loc[idx, col] += df.total.shift().loc[idx]
print (df)
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000
Related
I have a multi-level column dataframe on the lines of one below:
How can I add columns 'Sales' = 'Qty' * 'Price' one each for each 'Year'?
The input dataframe in dictionary format:
{('Qty', 2001): [50, 50], ('Qty', 2002): [100, 10], ('Qty', 2003): [200, 20], ('Qty', 2004): [300, 30], ('Qty', 2005): [400, 40], ('Price', 2001): [20, 11], ('Price', 2002): [21, 12], ('Price', 2003): [22, 13], ('Price', 2004): [23, 14], ('Price', 2005): [24, 15]}
Currently, I am splitting the dataframe for each year separately and adding a computed column. If there is an easier method that would be great.
Here is the expected output
You can create the required column names with a list comprehension, and then simply assign the multiplication (df.mul).
new_cols = [('Sales', col) for col in df['Qty'].columns]
# [('Sales', 2001), ('Sales', 2002), ('Sales', 2003), ('Sales', 2004), ('Sales', 2005)]
df[new_cols] = df['Qty'].mul(df['Price'])
df
Qty Price Sales \
2001 2002 2003 2004 2005 2001 2002 2003 2004 2005 2001 2002 2003 2004
0 50 100 200 300 400 20 21 22 23 24 1000 2100 4400 6900
1 50 10 20 30 40 11 12 13 14 15 550 120 260 420
2005
0 9600
1 600
Let us stack to flatten multiindex columns then multiply and reshape back using unstack
df.stack().eval('Sales = Price * Qty').unstack()
Price Qty Sales
2001 2002 2003 2004 2005 2001 2002 2003 2004 2005 2001 2002 2003 2004 2005
0 20 21 22 23 24 50 100 200 300 400 1000 2100 4400 6900 9600
1 11 12 13 14 15 50 10 20 30 40 550 120 260 420 600
Year
Price
2017
200
2018
250
2019
300
Given the table above, is there a way to add months to each year ? For eg: 2017 should have months jan to dec and the same price carried forward in all of the 12 months for all the years listed in a data frame in Pandas?
Year
Price
2017/01/01
200
2017/02/01
200
2017/03/01
200
2017/04/01
200
2017/05/01
200
There's probably a better answer out there (I know very little Pandas), but one thing that comes to mind is:
Get the date represented by your numeric "Year". That will give you January 1st at midnight in that Year. You can drop the time part (the "hour", if you may) and keep just the date (January 1st of that year)
At this point you'll have your first row being January (month 1). Then you can replicate the row changing the "Year"'s month to 2 (February), 3 (March)... until... 12 (December) and insert it back in the Dataframe
import pandas as pd
df = pd.DataFrame([
{"Year": 2017, "Price": 200},
{"Year": 2018, "Price": 300},
{"Year": 2019, "Price": 400},
])
df["Year"] = pd.to_datetime(df["Year"], format='%Y').dt.date
for idx, row in df.iterrows():
for i in range(2, 13):
row["Year"] = row["Year"].replace(month=i)
df = pd.concat([df, row.to_frame().T])
df = df.sort_values(['Year']).reset_index(drop=True)
print(df)
# Year Price
# 0 2017-01-01 200
# 1 2017-02-01 200
# 2 2017-03-01 200
# 3 2017-04-01 200
# 4 2017-05-01 200
# 5 2017-06-01 200
# 6 2017-07-01 200
# 7 2017-08-01 200
# 8 2017-09-01 200
# 9 2017-10-01 200
# 10 2017-11-01 200
# 11 2017-12-01 200
# 12 2018-01-01 300
# 13 2018-02-01 300
# 14 2018-03-01 300
# 15 2018-04-01 300
# 16 2018-05-01 300
# 17 2018-06-01 300
# 18 2018-07-01 300
# 19 2018-08-01 300
# 20 2018-09-01 300
# 21 2018-10-01 300
# 22 2018-11-01 300
# 23 2018-12-01 300
# 24 2019-01-01 400
# 25 2019-02-01 400
# 26 2019-03-01 400
# 27 2019-04-01 400
# 28 2019-05-01 400
# 29 2019-06-01 400
# 30 2019-07-01 400
# 31 2019-08-01 400
# 32 2019-09-01 400
# 33 2019-10-01 400
# 34 2019-11-01 400
# 35 2019-12-01 400
You could try this:
df.columns = [i.strip() for i in df.columns]
df['Year'] = df['Year'].apply(lambda x: pd.date_range(start=str(x), end=str(x+1), freq='1M').strftime('%m'))
df = df.explode('Year').reset_index(drop=True)
>>>df
Year Price
0 01 200
1 02 200
2 03 200
3 04 200
4 05 200
5 06 200
6 07 200
7 08 200
8 09 200
9 10 200
10 11 200
11 12 200
12 01 250
13 02 250
14 03 250
15 04 250
16 05 250
17 06 250
18 07 250
19 08 250
20 09 250
21 10 250
22 11 250
23 12 250
24 01 300
25 02 300
26 03 300
27 04 300
28 05 300
29 06 300
30 07 300
31 08 300
32 09 300
33 10 300
34 11 300
35 12 300
Create a dataframe with months 1-12
Cross merge that with your original data
Create a date out of the year, month, and day 1
Sample code:
years = [2017, 2018, 2019, 2020, 2021, 2022]
prices = [200, 250, 300, 350, 350, 317]
your_df = pd.DataFrame(data=[(x, y) for x, y in zip(years, prices)], columns=["Year","Price"])
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
m_df = pd.DataFrame(data=months, columns=["Month"])
final_df = full_df.merge(your_df, how="cross")
final_df["Year"] = [datetime(y, m, 1) for y,m in zip(full_df.Year, full_df.Month)]
final_df = final_df.drop(columns="Month")
final_df
My df has USA states-related information. I want to rank the states based on its contribution.
My code:
df
State Value Year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
Expected Answer: Compute state_capacity by summing state values from all years. Then Rank the States based on the state capacity
df
State Value Year State_Capa. Rank
0 FL 100 2012 150 2
1 CA 150 2013 200 1
2 MA 25 2014 100 3
3 FL 150 2014 200 2
4 CA 50 2015 200 1
5 MA 75 2016 100 3
My approach: I am able to compute the state capacity using groupby. I ran into NaN when mapped it to the df.
state_capacity = df[['State','Value']].groupby(['State']).sum()
df['State_Capa.'] = df['State'].map(dict(state_cap))
df
State Value Year State_Capa.
0 FL 100 2012 NaN
1 CA 150 2013 NaN
2 MA 25 2014 NaN
3 FL 50 2014 NaN
4 CA 50 2015 NaN
5 MA 75 2016 NaN
Try with transform then rank
df['new'] = df.groupby('State').Value.transform('sum').rank(method='dense',ascending=False)
Out[42]:
0 2.0
1 1.0
2 3.0
3 2.0
4 1.0
5 3.0
Name: Value, dtype: float64
As mentioned in the comment, your question seems to have a problem. However, I guess this might be what you want:
df = pd.DataFrame({
'state': ['FL', 'CA', 'MA', 'FL', 'CA', 'MA'],
'value': [100, 150, 25, 50, 50, 75],
'year': [2012, 2013, 2014, 2014, 2015, 2016]
})
returns:
state value year
0 FL 100 2012
1 CA 150 2013
2 MA 25 2014
3 FL 50 2014
4 CA 50 2015
5 MA 75 2016
and
groupby_sum = df.groupby('state')['state', 'value'].sum()
groupby_sum['rank'] = groupby_sum['value'].rank()
groupby_sum.reset_index()
returns:
state value rank
0 CA 200 3.0
1 FL 150 2.0
2 MA 100 1.0
Given the following dataset, and current week as 2019/W37, how do I drop rows that are previous to current week using np.where?
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
I tried the following:
import pandas as pd
import numpy as np
from datetime import datetime
data = {
"Year": [2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019],
"Week": [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
"Value": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
df = pd.DataFrame(data)
print(df)
YearWeek = datetime.now().strftime("%Y/W%V")
print(YearWeek)
df["Exclude"] = np.where(str(df["Year"] + "/" + df["Week"]) < YearWeek, "Yes", "No")
print(df)
try this:
df_new = df[pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str), format="%Y/W%V", errors='ignore') >= YearWeek]
or using np.where()
df.iloc[np.where(pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str)), format="%Y/W%V", errors='ignore') >= YearWeek )]
To generate the exclude column:
df['exclude'] = np.where(pd.to_datetime((df["Year"].astype(str) + "/W" + df["Week"].astype(str)), format="%Y/W%V", errors='ignore') < YearWeek, 'Yes', 'No' )
>>> print(df)
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
>>> today = pd.to_datetime('today')
>>> today
Timestamp('2019-09-12 22:54:46.039542')
>>> df[(df.Week < today.week) | (df.Year < today.year)]
Year Week Value
0 2019 31 10
1 2019 32 20
2 2019 33 30
3 2019 34 40
4 2019 35 50
5 2019 36 60
You can use a decimal week system:
w = df['Year'] + df['Week'] / 54
now = pd.Timestamp.now()
this_week = now.year + now.week / 54
df[w >= this_week]
Result
Year Week Value
6 2019 37 70
7 2019 38 80
8 2019 39 90
9 2019 40 100
In the ISO Date System, a year can have up to 53 weeks so we use 54 to prevent the last week of year N appearing like year N+1. Anything over 54 works just as well. It's just a way for us to combine the year and the week into a single, comparable quantity.
We can do
df[(df.Year*100+df.Week)<int(pd.to_datetime('today').strftime('%Y%W'))]
I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})