Generating missing consecutive dates between dates - python

I have a file that is dynamically generated(i.e., file headers remain the same but values changes). For instance, let the file be of this form:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,05/12/2016,40
3,321,06/12/2016,0
4,321,07/12/2016,60
5,321,10/12/2016,70
6,876,5/12/2016,100
7,876,7/12/2016,80
Notice for CLASS 321 there are some missing dates namely 03/12/2016, 04/12/2016, 08/12/2016, 09/12/2016. I'm trying to insert the missing dates in the appropriate places with their corresponding value for MRK being 0. The expected output would be like so:
ID,CLASS,DATE,MRK
1,321,02/12/2016,30
2,321,03/12/2016,0
3,321,04/12/2016,0
4,321,05/12/2016,40
5,321,06/12/2016,0
6,321,07/12/2016,60
7,321,08/12/2016,0
8,321,09/12/2016,0
9,321,10/12/2016,70
10,876,5/12/2016,100
11,876,6/12/2016,0
12,876,7/12/2016,80
This is what i came up with so far:
import pandas as pd
df = pd.read_csv('In.txt')
resampled_df = df.resample('D').mean()
print resampled_df
But I'm getting exception:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could somebody help out a python newbie here?

Read your CSV like this -
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True, # this is important since you have days first
index_col=['DATE'])
Now, call groupby + resample + first, and tie up loose ends -
df = df.groupby('CLASS').resample('1D')[['MRK']].first()
df.ID = np.arange(1, len(df) + 1)
df.MRK = df.MRK.fillna(0).astype(int)
df.reset_index()
CLASS DATE ID MRK
0 321 2016-12-02 1 30
1 321 2016-12-03 2 0
2 321 2016-12-04 3 0
3 321 2016-12-05 4 40
4 321 2016-12-06 5 0
5 321 2016-12-07 6 60
6 321 2016-12-08 7 0
7 321 2016-12-09 8 0
8 321 2016-12-10 9 70
9 876 2016-12-05 10 100
10 876 2016-12-06 11 0
11 876 2016-12-07 12 80
In particular, MRK needs fillna. The rest can be forward filled.
If the order of columns is important, here's another version.
df = pd.read_csv('file.csv',
sep=',',
parse_dates=['DATE'],
dayfirst=True)
c = df.columns
df = df.set_index('DATE').groupby('CLASS').resample('1D')[['MRK']].first()
df['MRK'] = df.MRK.fillna(0).astype(int)
df['ID'] = np.arange(1, len(df) + 1)
df = df.reset_index().reindex(columns=c)
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
df
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80

First convert to datetimes and then groupby by CLASS and resample and last add column ID by insert:
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.asfreq()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Alternative solution:
df = (df.set_index('DATE')
.groupby('CLASS')
.resample('d')['MRK']
.first()
.fillna(0)
.astype(int)
.reset_index())
df.insert(0, 'ID', range(1, len(df) + 1))
print (df)
ID CLASS DATE MRK
0 1 321 2016-12-02 30
1 2 321 2016-12-03 0
2 3 321 2016-12-04 0
3 4 321 2016-12-05 40
4 5 321 2016-12-06 0
5 6 321 2016-12-07 60
6 7 321 2016-12-08 0
7 8 321 2016-12-09 0
8 9 321 2016-12-10 70
9 10 876 2016-12-05 100
10 11 876 2016-12-06 0
11 12 876 2016-12-07 80
Last for same format as input use strftime:
df['DATE'] = df['DATE'].dt.strftime('%d/%m/%Y')
print (df)
ID CLASS DATE MRK
0 1 321 02/12/2016 30
1 2 321 03/12/2016 0
2 3 321 04/12/2016 0
3 4 321 05/12/2016 40
4 5 321 06/12/2016 0
5 6 321 07/12/2016 60
6 7 321 08/12/2016 0
7 8 321 09/12/2016 0
8 9 321 10/12/2016 70
9 10 876 05/12/2016 100
10 11 876 06/12/2016 0
11 12 876 07/12/2016 80

Related

How to divide multiple columns based on three conditions

This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !
You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333

Sort pandas csv list with string date

I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb ·   1
1 47 1 Feb ·   1
2 119 6 Feb ·   1
8 101 7 hrs ·   1
9 536 11 min ·   1
10 53 2 hrs ·   1
11 20 11 Feb ·   3
3 15 1 hrs ·   2
4 33 7 Feb ·   1
5 153 4 Feb ·   3
6 34 3 min ·   2
7 26 3 Feb ·   3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT

How to compress rows after groupby in pandas

I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18
The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

Pandas Dataframe calculating in intervals

I have a dataframe like this
time value
0 1 214
1 4 234
2 5 253
3 7 272
4 9 201
5 11 221
6 13 211
7 15 201
8 17 199
I want to split it into intervals and calculate for every interval the difference for the values to the first row of every interval.
Result should be like this with an interval of 6 for example (the lines inside are just for better explanation):
time value diff_to_first
0 1 214 0
1 4 234 20
2 5 253 39
--------------------------------
3 7 272 0
4 9 201 -71
5 11 221 -51
--------------------------------
6 13 211 0
7 15 201 -10
8 17 199 -12
With the following code i get the wanted result, but i think the code is not very elegant. Are there any better solutions (for example, how can i integrate the subset term in the loc statement) ?
import pandas as pd
interval = 6
low = 0
df = pd.DataFrame([[1, 214], [4, 234], [5, 253], [7, 272], [9, 201], [11, 221],
[13, 211], [15, 201], [17, 199]], columns=['time', 'value'])
df['diff_to_first'] = None
maxvalue = df['time'].max()
while low <= maxvalue:
high = low + interval
subset = df[ (df['time']>=low) & (df['time']<high) ]
first = subset.iloc[0]['value']
df.loc[ (df['time']>=low) & (df['time']<high),
'diff_to_first'] = df.loc[ (df['time']>=low) & (df['time']<high) , 'value'] - first
low = high
You can make a new column "group". Then use groupby and apply you defined function to join column with diff by group. It will be more elegant. But I think, my way to create "group" column also can be more elegant = )
def diff(df):
df['diff_to_first'] = df.value - df.value.values[0]
return df
df['group'] = np.concatenate([[i] * 3 for i in range(0, len(df)/3)])
df.groupby('group').apply(diff)
Output:
time value group diff_to_first
0 1 214 0 0
1 4 234 0 20
2 5 253 0 39
3 7 272 1 0
4 9 201 1 -71
5 11 221 1 -51
6 13 211 2 0
7 15 201 2 -10
8 17 199 2 -12
you can group the dataframe by value of interval and difference the grouped data with the shifting by 1 index
interval = 3
df['diff_to_first'] = df.value.groupby(np.repeat(np.arange(len(df)/interval),interval)[:len(df)]).apply(lambda x:x-x.shift()).fillna(0)
Out:
time value diff_to_first
0 1 214 0.0
1 4 234 20.0
2 5 253 19.0
3 7 272 0.0
4 9 201 -71.0
5 11 221 20.0
6 13 211 0.0
7 15 201 -10.0
8 17 199 -2.0

Cumulative Sum by date (Month)

I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270

Categories

Resources