Let's suppose that I have a dataset which consists of the following columns:
Stock_id: the id of a stock
Date: a date of 2018 e.g. 25/03/2018
Stock_value: the value of the stock at this specific date
I have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.
By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value at the rows is NA etc.
A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.
By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.
What is the most efficient way to do this?
I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.
Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.
Let's suppose that I initially have the following:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 06/01/2018 150
2 07/01/2018 147
Thus for Stock_id = 1 the date 04/01/2018 is missing.
For Stock_id = 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).
Hence, I would like to have the following as output:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 04/01/2018 NA
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 05/01/2018 NA
2 06/01/2018 150
2 07/01/2018 147
Use asfreq per groups, but if large data performance should be problematic:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.asfreq('D'))
.reset_index()
)
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 124.0
1 1 2018-01-02 130.0
2 1 2018-01-03 136.0
3 1 2018-01-04 NaN
4 1 2018-01-05 129.0
5 1 2018-01-06 131.0
6 1 2018-01-07 133.0
7 2 2018-01-03 144.0
8 2 2018-01-04 148.0
9 2 2018-01-05 NaN
10 2 2018-01-06 150.0
11 2 2018-01-07 147.0
EDIT:
If want change values by minimal datetime per group with some scalar for maximum datetime, use reindex with date_range:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
.reset_index()
)
df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()
Related
I have a data-frame with date as the index and columns for a few companies' stock prices. I want to append a few new dates at the bottom but with only one of the cos' stock price filled in. How can I do that? I have tried append and concat but am running into issues because I am updating only one co's stock price.
Lets say this is the dataframe:
Co1 Co2 Co3 Co4
Date 1 100 200 300 400
Date 2 105 210 290 350
Date 3 102 205 325 380
I want to add Co3's prices for two new dates from a data-series, lets say:
Date
Date 4 300
Date 5 310
Date 6 305
Name:Co3, Length:3, dtype:float64
Would appreciate if anyone can guide!
Thanks
Let's try pandas.concat
print(Co3)
Date
Date 4 300
Date 5 310
Date 6 305
Name: Co3, dtype: int64
pd.concat([df1, pd.DataFrame(Co3)]).fillna(0.0)
Co1 Co2 Co3 Co4
Date
Date 1 100.0 200.0 300 400.0
Date 2 105.0 210.0 290 350.0
Date 3 102.0 205.0 325 380.0
Date 4 0.0 0.0 300 0.0
Date 5 0.0 0.0 310 0.0
Date 6 0.0 0.0 305 0.0
A sample CSV data in which the first column is a time stamp (date + time):
2018-01-01 10:00:00,23,43
2018-01-02 11:00:00,34,35
2018-01-05 12:00:00,25,4
2018-01-10 15:00:00,22,96
2018-01-01 18:00:00,24,53
2018-03-01 10:00:00,94,98
2018-04-20 10:00:00,90,9
2018-04-10 10:00:00,45,51
2018-01-01 10:00:00,74,44
2018-12-01 10:00:00,76,87
2018-11-01 10:00:00,76,87
2018-12-12 10:00:00,87,90
I already wrote some codes to do the monthly aggregated values task while waiting for someone to give me some suggestions.
Thanks #moys, anyway!
import pandas as pd
df = pd.read_csv('Sample.txt', header=None, names = ['Timestamp', 'Value 1', 'Value 2'])
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['Monthly'] = df1['Timestamp'].dt.to_period('M')
grouper = pd.Grouper(key='Monthly')
df2 = df1.groupby(grouper)['Value 1', 'Value 2'].sum().reset_index()
The output is:
Monthly Value 1 Value 2
0 2018-01 202 275
1 2018-03 94 98
2 2018-04 135 60
3 2018-12 163 177
4 2018-11 76 87
What if there's a dataset with more columns, how to motified the my code to make it automatically working on the dataset which has more columns?
2018-02-01 10:00:00,23,43,32
2018-02-02 11:00:00,34,35,43
2018-03-05 12:00:00,25,4,43
2018-02-10 15:00:00,22,96,24
2018-05-01 18:00:00,24,53,98
2018-02-01 10:00:00,94,98,32
2018-02-20 10:00:00,90,9,24
2018-07-10 10:00:00,45,51,32
2018-01-01 10:00:00,74,44,34
2018-12-04 10:00:00,76,87,53
2018-12-02 10:00:00,76,87,21
2018-12-12 10:00:00,87,90,98
You can do something like below
df.groupby(pd.to_datetime(df['date']).dt.month).sum().reset_index()
Output Here, 'date' column is the month number.
date val1 val2
0 1 202 275
1 3 94 98
2 4 135 60
3 11 76 87
4 12 163 177
My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.
This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()
I tried to ask this question previously, but it was too ambiguous so here goes again. I am new to programming, so I am still learning how to ask questions in a useful way.
In summary, I have a pandas dataframe that resembles "INPUT DATA" that I would like to convert to "DESIRED OUTPUT", as shown below.
Each row contains an ID, a DateTime, and a Value. For each unique ID, the first row corresponds to timepoint 'zero', and each subsequent row contains a value 5 minutes following the previous row and so on.
I would like to calculate the mean of all the IDs for every 'time elapsed' timepoint. For example, in "DESIRED OUTPUT" Time Elapsed=0.0 would have the value 128.3 (100+105+180/3); Time Elapsed=5.0 would have the value 150.0 (150+110+190/3); Time Elapsed=10.0 would have the value 133.3 (125+90+185/3) and so on for Time Elapsed=15,20,25 etc.
I'm not sure how to create a new column which has the value for the time elapsed for each ID (e.g. 0.0, 5.0, 10.0 etc). I think that once I know how to do that, then I can use the groupby function to calculate the means for each time elapsed.
INPUT DATA
ID DateTime Value
1 2018-01-01 15:00:00 100
1 2018-01-01 15:05:00 150
1 2018-01-01 15:10:00 125
2 2018-02-02 13:15:00 105
2 2018-02-02 13:20:00 110
2 2018-02-02 13:25:00 90
3 2019-03-03 05:05:00 180
3 2019-03-03 05:10:00 190
3 2019-03-03 05:15:00 185
DESIRED OUTPUT
Time Elapsed Mean Value
0.0 128.3
5.0 150.0
10.0 133.3
Here is one way , using transform with groupby get the group key 'Time Elapsed', then just groupby it get the mean
df['Time Elapsed']=df.DateTime-df.groupby('ID').DateTime.transform('first')
df.groupby('Time Elapsed').Value.mean()
Out[998]:
Time Elapsed
00:00:00 128.333333
00:05:00 150.000000
00:10:00 133.333333
Name: Value, dtype: float64
You can do this explicitly by taking advantage of the datetime attributes of the DateTime column in your DataFrame
First get the year, month and day for each DateTime since they are all changing in your data
df['month'] = df['DateTime'].dt.month
df['day'] = df['DateTime'].dt.day
df['year'] = df['DateTime'].dt.year
print(df)
ID DateTime Value month day year
1 1 2018-01-01 15:00:00 100 1 1 2018
1 1 2018-01-01 15:05:00 150 1 1 2018
1 1 2018-01-01 15:10:00 125 1 1 2018
2 2 2018-02-02 13:15:00 105 2 2 2018
2 2 2018-02-02 13:20:00 110 2 2 2018
2 2 2018-02-02 13:25:00 90 2 2 2018
3 3 2019-03-03 05:05:00 180 3 3 2019
3 3 2019-03-03 05:10:00 190 3 3 2019
3 3 2019-03-03 05:15:00 185 3 3 2019
Then append a sequential DateTime counter column (per this SO post)
the counter is computed within (1) each year, (2) then each month and then (3) each day
since the data are in multiples of 5 minutes, use this to scale the counter values (i.e. the counter will be in multiples of 5 minutes, rather than a sequence of increasing integers)
df['Time Elapsed'] = df.groupby(['year', 'month', 'day']).cumcount() + 1
df['Time Elapsed'] *= 5
print(df)
ID DateTime Value month day year cumulative_record
1 1 2018-01-01 15:00:00 100 1 1 2018 5
1 1 2018-01-01 15:05:00 150 1 1 2018 10
1 1 2018-01-01 15:10:00 125 1 1 2018 15
2 2 2018-02-02 13:15:00 105 2 2 2018 5
2 2 2018-02-02 13:20:00 110 2 2 2018 10
2 2 2018-02-02 13:25:00 90 2 2 2018 15
3 3 2019-03-03 05:05:00 180 3 3 2019 5
3 3 2019-03-03 05:10:00 190 3 3 2019 10
3 3 2019-03-03 05:15:00 185 3 3 2019 15
Perform the groupby over the newly appended counter column
dfg = df.groupby('Time Elapsed')['Value'].mean()
print(dfg)
Time Elapsed
5 128.333333
10 150.000000
15 133.333333
Name: Value, dtype: float64
I have a dataframe with a series of floats called balance and a series of timestamps called due_date. I'd like to create a new column called current that displays the balance if the due_date is >= today (all else ""), a column called 1-30 Days that displays the balance if the due_date is 1 to 30 days ago (all else ""), and a column called >30 Days that displays the balance if the due_date is more than 30 days ago (all else "").
Here are some example rows:
balance due_date
0 250.00 2017-10-22
1 400.00 2017-10-04
2 3000.00 2017-09-08
3 3000.00 2017-09-08
4 250.00 2017-08-05
Any help would be greatly appreciated.
Using pd.cut and pd.crosstab
df['diff']=(pd.to_datetime('today')-df.due_date).dt.days
df['New']=pd.cut(df['diff'],bins = [0,1,30,99999],labels=["current","1-30","more than 30"])
pd.concat([df,pd.crosstab(df.index.get_level_values(0),df.New).apply(lambda x: x.mul(df.balance))],axis=1)
Out[928]:
balance due_date diff New more than 30
row_0
0 250.0 2017-01-22 261 more than 30 250.0
1 400.0 2017-02-04 248 more than 30 400.0
2 3000.0 2017-02-08 244 more than 30 3000.0
3 3000.0 2017-02-08 244 more than 30 3000.0
4 250.0 2017-02-05 247 more than 30 250.0