My dataframe has values of how many red cars are sold on a specific month. I have to build a predictive model to predict monthly sale
I want the current data frame to be converted into the format below for time series modeling.
How can I read the column and row header to create a date column? I am hoping for a new data frame.
You can use melt() to transform the dataframe from the wide to the long format. Then we combine the Year and month information to make an actual date:
import pandas as pd
df = pd.DataFrame({'YEAR' : [2021,2022],
'JAN' : [5, 232],
'FEB':[545, 48]})
df2 = df.melt(id_vars = ['YEAR'], var_name = 'month', value_name = 'sales')
df2['date'] = df2.apply(lambda row: pd.to_datetime(str(row['YEAR']) + row['month'], format = '%Y%b'), axis = 1)
df2.sort_values('date')[['date', 'sales']]
this gives the output:
date sales
0 2021-01-01 5
2 2021-02-01 545
1 2022-01-01 232
3 2022-02-01 48
(for time series analysis you would probably want to set the date column as index)
DataFrame below contains housing price dataset from 1996 to 2016.
Other than the first 6 columns, other columns need to be converted to Datetime type.
I tried to run the following code:
HousingPrice.columns[6:] = pd.to_datetime(HousingPrice.columns[6:])
but I got the error:
TypeError: Index does not support mutable operations
I wish to convert some columns in the columns Index to Datetime type, but not all columns.
The pandas index is immutable, so you can't do that.
However, you can access and modify the column index with array, see doc here.
HousingPrice.columns.array[6:] = pd.to_datetime(HousingPrice.columns[6:])
should work.
Note that this would change the column index only. In order to convert the columns values, you can do this :
date_cols = HousingPrice.columns[6:]
HousingPrice[date_cols] = HousingPrice[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
EDIT
Illustrated example:
data = {'0ther_col': [1,2,3], '1996-04': ['1996-04','1996-05','1996-06'], '1995-05':['1996-02','1996-08','1996-10']}
print('ORIGINAL DATAFRAME')
df = pd.DataFrame.from_records(data)
print(df)
print("\nDATE COLUMNS")
date_cols = df.columns[-2:]
print(df.dtypes)
print('\nCASTING DATE COLUMNS TO DATETIME')
df[date_cols] = df[date_cols].apply(pd.to_datetime, errors='coerce', axis=1)
print(df.dtypes)
print('\nCASTING DATE COLUMN INDEXES TO DATETIME')
print("OLD INDEX -", df.columns)
df.columns.array[-2:] = pd.to_datetime(df[date_cols].columns)
print("NEW INDEX -",df.columns)
print('\nFINAL DATAFRAME')
print(df)
yields:
ORIGINAL DATAFRAME
0ther_col 1995-05 1996-04
0 1 1996-02 1996-04
1 2 1996-08 1996-05
2 3 1996-10 1996-06
DATE COLUMNS
0ther_col int64
1995-05 object
1996-04 object
dtype: object
CASTING DATE COLUMNS TO DATETIME
0ther_col int64
1995-05 datetime64[ns]
1996-04 datetime64[ns]
dtype: object
CASTING DATE COLUMN INDEXES TO DATETIME
OLD INDEX - Index(['0ther_col', '1995-05', '1996-04'], dtype='object')
NEW INDEX - Index(['0ther_col', 1995-05-01 00:00:00, 1996-04-01 00:00:00], dtype='object')
FINAL DATAFRAME
0ther_col 1995-05-01 00:00:00 1996-04-01 00:00:00
0 1 1996-02-01 1996-04-01
1 2 1996-08-01 1996-05-01
2 3 1996-10-01 1996-06-01
I have a pandas dataframe as shown in the code below. I am trying to "resample"
the data to get daily count of the ticket column. It does not give any error but the
resampling it not wokring. This is a sample of a much larger dataset. I want to be
able to get counts by day, week, month quarter etc. But the .resample option is
not giving me a solution. What am I doing wrong?
import pandas as pd
df = pd.DataFrame([['2019-07-30T00:00:00','22:15:00','car'],
['2013-10-12T00:00:00','0:10:00','bus'],
['2014-03-31T00:00:00','9:06:00','ship'],
['2014-03-31T00:00:00','8:15:00','ship'],
['2014-03-31T00:00:00','12:06:00','ship'],
['2014-03-31T00:00:00','9:24:00','ship'],
['2013-10-12T00:00:00','9:06:00','ship'],
['2018-03-31T00:00:00','9:06:00','ship']],
columns=['date_field','time_field','transportation'])
df['date_field2'] = pd.to_datetime(df['date_field'])
df['time_field2'] = pd.to_datetime(df['time_field'],unit = 'ns').dt.time
df['date_time_field'] = df.apply(lambda df : pd.datetime.combine(df['date_field2'],df['time_field2']),1)
df.set_index(['date_time_field'],inplace=True)
df.drop(columns=['date_field','time_field','date_field2','time_field2'],inplace=True)
df['tickets']=1
df.sort_index(inplace=True)
df.drop(columns=['transportation'],inplace=True)
df.resample('D').sum()
print('\ndaily resampling:')
print(df)
I think you forget assign output to variable like:
df1 = df.resample('D').sum()
print (df1)
Also your code should be simplify:
#join columns together with space and pop for extract column
df['date_field'] = pd.to_datetime(df['date_field']+ ' ' + df.pop('time_field'))
#create and sorting DatetimeIndex, remove column
df = df.set_index(['date_field']).sort_index().drop(columns=['transportation'])
#resample counts
df1 = df.resample('D').size()
print (df1)
date_field
2013-10-12 2
2013-10-13 0
2013-10-14 0
2013-10-15 0
2013-10-16 0
..
2019-07-26 0
2019-07-27 0
2019-07-28 0
2019-07-29 0
2019-07-30 1
Freq: D, Length: 2118, dtype: int64
Also I think inplace is not good practice, check this and this.
Data
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
pd.DataFrame(data=data)
Current Solution
sku_total = df.groupby(['order','sku'])['ext price'].sum().rename('sku total').reset_index()
sku_total['sku total'] / sku_total['order'].map(df.groupby('order')['ext price'].sum())
Question
How to divide:
df.groupby(['order','sku'])['ext price'].sum()
by
df.groupby('order')['ext price'].sum()
Without having to reset_index?
Doesn't div do the trick or am I understanding something inccorectly?
import pandas as pd
import numpy as np
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
df = pd.DataFrame(data=data)
print(df)
df_1 = df.groupby(['order','sku'])['ext price'].sum()
df_2 = df.groupby('order')['ext price'].sum()
df_res = df_1.div(df_2)
print(df_res)
Output:
order sku
10001 B1-20000 0.409342
B1-86481 0.187409
S1-27722 0.403249
10005 S1-06532 0.429090
S1-27722 0.111798
S1-47412 0.424170
S1-82801 0.034942
10006 B1-20000 0.018657
B1-33087 0.134058
B1-33364 0.056063
S1-27722 0.791222
Name: ext price, dtype: float64
IIUC,
we can use transform which allows you to do groupby operations while maintaing the index:
you can then assign the variable to a new column if u wish.
s = (df.groupby(['order','sku'])['ext price'].transform('sum')
/ df.groupby('order')['ext price'].transform('sum'))
print(s)
0 0.409342
1 0.403249
2 0.187409
3 0.429090
4 0.034942
5 0.429090
6 0.424170
7 0.111798
8 0.791222
9 0.134058
10 0.056063
11 0.018657
I want to convert an hourly Pandas Series into a DataFrame as a DataFrame indexed only with the date and each hour as a column.
For example, let's say I have this Series:
YEAR = 2017
serie = pd.Series(pd.date_range(
start=f'{YEAR}-01-01', end=f'{YEAR}-12-31 23:00:00', freq='H'))
But I want it like:
h01 h02 h03 h04 h05 ...
Date
2017-01-01 data data data data data ...
I believe your Series is with DatetimeIndex and filled some data.
Then need DataFrame.pivot with DataFrame.assign for new columns created by DatetimeIndex.date and DatetimeIndex.strftime and Series.to_frame for one columns DataFrame:
YEAR = 2017
serie = pd.Series(np.arange(8760), pd.date_range(
start=f'{YEAR}-01-01', end=f'{YEAR}-12-31 23:00:00', freq='H'))
df = serie.to_frame('vals').assign(date = lambda x: x.index.date,
hour = lambda x: x.index.strftime('h%H'))
#print (df)
df1 = df.pivot('date','hour','vals')
#print (df1)
Another solution:
serie.index = [serie.index.date, serie.index.strftime('h%H')]
df1 = serie.unstack()