add months in an existing data frame python - python

Year
Price
2017
200
2018
250
2019
300
Given the table above, is there a way to add months to each year ? For eg: 2017 should have months jan to dec and the same price carried forward in all of the 12 months for all the years listed in a data frame in Pandas?
Year
Price
2017/01/01
200
2017/02/01
200
2017/03/01
200
2017/04/01
200
2017/05/01
200

There's probably a better answer out there (I know very little Pandas), but one thing that comes to mind is:
Get the date represented by your numeric "Year". That will give you January 1st at midnight in that Year. You can drop the time part (the "hour", if you may) and keep just the date (January 1st of that year)
At this point you'll have your first row being January (month 1). Then you can replicate the row changing the "Year"'s month to 2 (February), 3 (March)... until... 12 (December) and insert it back in the Dataframe
import pandas as pd
df = pd.DataFrame([
{"Year": 2017, "Price": 200},
{"Year": 2018, "Price": 300},
{"Year": 2019, "Price": 400},
])
df["Year"] = pd.to_datetime(df["Year"], format='%Y').dt.date
for idx, row in df.iterrows():
for i in range(2, 13):
row["Year"] = row["Year"].replace(month=i)
df = pd.concat([df, row.to_frame().T])
df = df.sort_values(['Year']).reset_index(drop=True)
print(df)
# Year Price
# 0 2017-01-01 200
# 1 2017-02-01 200
# 2 2017-03-01 200
# 3 2017-04-01 200
# 4 2017-05-01 200
# 5 2017-06-01 200
# 6 2017-07-01 200
# 7 2017-08-01 200
# 8 2017-09-01 200
# 9 2017-10-01 200
# 10 2017-11-01 200
# 11 2017-12-01 200
# 12 2018-01-01 300
# 13 2018-02-01 300
# 14 2018-03-01 300
# 15 2018-04-01 300
# 16 2018-05-01 300
# 17 2018-06-01 300
# 18 2018-07-01 300
# 19 2018-08-01 300
# 20 2018-09-01 300
# 21 2018-10-01 300
# 22 2018-11-01 300
# 23 2018-12-01 300
# 24 2019-01-01 400
# 25 2019-02-01 400
# 26 2019-03-01 400
# 27 2019-04-01 400
# 28 2019-05-01 400
# 29 2019-06-01 400
# 30 2019-07-01 400
# 31 2019-08-01 400
# 32 2019-09-01 400
# 33 2019-10-01 400
# 34 2019-11-01 400
# 35 2019-12-01 400

You could try this:
df.columns = [i.strip() for i in df.columns]
df['Year'] = df['Year'].apply(lambda x: pd.date_range(start=str(x), end=str(x+1), freq='1M').strftime('%m'))
df = df.explode('Year').reset_index(drop=True)
>>>df
Year Price
0 01 200
1 02 200
2 03 200
3 04 200
4 05 200
5 06 200
6 07 200
7 08 200
8 09 200
9 10 200
10 11 200
11 12 200
12 01 250
13 02 250
14 03 250
15 04 250
16 05 250
17 06 250
18 07 250
19 08 250
20 09 250
21 10 250
22 11 250
23 12 250
24 01 300
25 02 300
26 03 300
27 04 300
28 05 300
29 06 300
30 07 300
31 08 300
32 09 300
33 10 300
34 11 300
35 12 300

Create a dataframe with months 1-12
Cross merge that with your original data
Create a date out of the year, month, and day 1
Sample code:
years = [2017, 2018, 2019, 2020, 2021, 2022]
prices = [200, 250, 300, 350, 350, 317]
your_df = pd.DataFrame(data=[(x, y) for x, y in zip(years, prices)], columns=["Year","Price"])
months = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
m_df = pd.DataFrame(data=months, columns=["Month"])
final_df = full_df.merge(your_df, how="cross")
final_df["Year"] = [datetime(y, m, 1) for y,m in zip(full_df.Year, full_df.Month)]
final_df = final_df.drop(columns="Month")
final_df

Related

How to find cumulative sum of specific column in CSV file

I have a csv file in the format:
20 05 2019 12:00:00, 100
21 05 2019 12:00:00, 200
22 05 2019 12:00:00, 480
And i want to access the second variable, ive tried a variety of different alterations but none have worked.
Initially i tried
import pandas as pd
import numpy as np
col = [i for i in range(2)]
col[1] = "Power"
data = pd.read_csv('FILENAME.csv', names=col)
df1 = data.sum(data, axis=1)
df2 = np.cumsum(df1)
print(df2)
You can use cumsum function:
data['Power'].cumsum()
Output:
0 100
1 300
2 780
Name: Power, dtype: int64
Use df.cumsum:
In [1820]: df = pd.read_csv('FILENAME.csv', names=col)
In [1821]: df
Out[1821]:
0 Power
0 20 05 2019 12:00:00 100
1 21 05 2019 12:00:00 200
2 22 05 2019 12:00:00 480
In [1823]: df['cumulative sum'] = df['Power'].cumsum()
In [1824]: df
Out[1824]:
0 Power cumulative sum
0 20 05 2019 12:00:00 100 100
1 21 05 2019 12:00:00 200 300
2 22 05 2019 12:00:00 480 780

conditional dataframe shift

I have below dataframe
ID1 ID2 mon price
10 2 06 500
20 3 07 200
20 3 08 300
20 3 09 400
21 2 07 100
21 2 08 200
21 2 09 300
Required output :-
ID1 ID2 mon price ID1_shift ID2_shift mon_shift price_shift
10 2 06 500 10 2 06 500
20 3 07 200 20 3 07 200
20 3 08 300 20 3 07 200
20 3 09 400 20 3 08 300
21 2 07 100 21 2 07 100
21 3 08 200 21 2 07 100
21 4 09 300 21 3 08 200
I tried using df.shift() by different ways but was not successfull.
YOur valueable comments will be helpful.
I want to shift dataframe group by (ID1,ID2) and if NaN then fill with current values.
I tried below but it works with single column.
df["price_shift"]=df.groupby(["ID1","ID2"]).price.shift().fillna(df["price"])
Thanks
I came up with below , but this is feasible for less no of columns. Is there any way where complete row can be shifted with group by as above ?
df1['price_shift']=df.groupby(['ID1','ID2']).price.shift(1).fillna(df['price'])
df1['mon_shift']=df.groupby(['ID1','ID2']).mon.shift(1).fillna(df['mon'])
df1[['ID1_shift','ID2_shift']]=df[['ID1','ID2']]
df2=pd.concat([df, df1],axis=1)
df2
try the below:
for column_name in df.columns:
df[column_name+"_shift"]=df[column_name]
cheers

How to create a column for each year from a single date column containing year and month?

If I have a Data
Date Values
2005-01 10
2005-02 20
2005-03 30
2006-01 40
2006-02 50
2006-03 70
How can I change Year Column? like this
Date 2015 2016
01 10 40
02 20 50
03 30 70
Thanks.
You can use split with pivot:
df[['year','month']] = df.Date.str.split('-', expand=True)
df = df.pivot(index='month', columns='year', values='Values')
print (df)
year 2005 2006
month
01 10 40
02 20 50
03 30 70

how to shift single value of a pandas dataframe column

Using pandas first_valid_index() to get index of first non-null value of a column, how can I shifta single value of column rather than the whole column. i.e.
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'columnA': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
'columnB': [None, None, None, 10, 39, 30, 31,45, 23, 56],
'total': [100, 200, 300, 400, 500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(data)
df = df.set_index('year')
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 10 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
for col in df.columns:
if col not in ['total']:
idx = df[col].first_valid_index()
df.loc[idx, col] = df.loc[idx, col] + df.loc[idx, 'total'].shift(1)
print df
AttributeError: 'numpy.float64' object has no attribute 'shift'
desired result:
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
is that what you want?
In [63]: idx = df.columnB.first_valid_index()
In [64]: df.loc[idx, 'columnB'] += df.total.shift().loc[idx]
In [65]: df
Out[65]:
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
You can filter all column names, where is least one NaN value and then use union with column total:
for col in df.columns:
if col not in pd.Index(['total']).union(df.columns[~df.isnull().any()]):
idx = df[col].first_valid_index()
df.loc[idx, col] += df.total.shift().loc[idx]
print (df)
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000

Pandas groupby month and year

I have the following dataframe:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month. I.e., Group by Jan 2013, Feb 2013, Mar 2013, etc...
I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.
I've tried various combinations of groupby and sum, but I just can't seem to get anything to work. How can I do it?
You can use either resample or Grouper (which resamples under the hood).
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') # the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.
I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...
Keep it simple:
GB = DF.groupby([(DF.index.year), (DF.index.month)]).sum()
giving you,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,
GB.plot('abc', 'xyz', kind='scatter')
There are different ways to do that.
I created the data frame to showcase the different techniques to filter your data.
df = pd.DataFrame({'Date': ['01-Jun-13', '03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
'abc': [100, -20, 40, 25, 60], 'xyz': [200, 50,-5, 15, 80] })
I separated months/year/day and separated month-year as you explained.
def getMonth(s):
return s.split("-")[1]
def getDay(s):
return s.split("-")[0]
def getYear(s):
return s.split("-")[2]
def getYearMonth(s):
return s.split("-")[1] + "-" + s.split("-")[2]
I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth
df['year'] = df['Date'].apply(lambda x: getYear(x))
df['month'] = df['Date'].apply(lambda x: getMonth(x))
df['day'] = df['Date'].apply(lambda x: getDay(x))
df['YearMonth'] = df['Date'].apply(lambda x: getYearMonth(x))
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
You can go through the different groups in groupby(..) items.
In this case, we are grouping by two columns:
for key, g in df.groupby(['year', 'month']):
print key, g
Output:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:
for key, g in df.groupby(['YearMonth']):
print key, g
Output:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
In case you want to access a specific item, you can use get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Similar to get_group. This hack would help to filter values and get the grouped values.
This also would give the same result.
print df[df['YearMonth']=='Jun-13']
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abc or xyz values during Jun-13
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply criteria on it to get related data.
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
I recommend also to check this answer as well.
You can also do it by creating a string column with the year and month as follows:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
However this doesn't preserve the order when you loop over the groups, e.g.
for name, group in grouped:
print(name)
Will give:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by #Q-man above:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)
Some of the answers are using Date as an index instead of a column (and there's nothing wrong with doing that).
However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt attribute. That is:
# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)
# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()

Categories

Resources