Upsample data and interpolate - python

I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!

I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()

Related

Python - calculating difference between price extracting time

I need to create a new column and the value should be:
the current fair_price - fair_price 15 minutes ago(or the closest row)
I need to filter who is the row 15 minutes before then calculate the diff.
import numpy as np
import pandas as pd
from datetime import timedelta
df = pd.DataFrame(pd.read_csv('./data.csv'))
def calculate_15min(row):
end_date = pd.to_datetime(row['date']) - timedelta(minutes=15)
mask = (pd.to_datetime(df['date']) <= end_date).head(1)
price_before = df.loc[mask]
return price_before['fair_price']
def calc_new_val(row):
return 'show date 15 minutes before, maybe it will be null, nope'
df['15_min_ago'] = df.apply(lambda row: calculate_15min(row), axis=1)
myFields = ['pkey_id', 'date', '15_min_ago', 'fair_price']
print(df[myFields].head(5))
df[myFields].head(5).to_csv('output.csv', index=False)
I did it using nodejs but python is not my beach, maybe you have a fast solution...
pkey_id,date,fair_price,15_min_ago
465620,2021-05-17 12:28:30,45080.23,fair_price_15_min_before
465625,2021-05-17 12:28:35,45060.17,fair_price_15_min_before
465629,2021-05-17 12:28:40,45052.74,fair_price_15_min_before
465633,2021-05-17 12:28:45,45043.89,fair_price_15_min_before
465636,2021-05-17 12:28:50,45040.93,fair_price_15_min_before
465640,2021-05-17 12:28:56,45049.95,fair_price_15_min_before
465643,2021-05-17 12:29:00,45045.38,fair_price_15_min_before
465646,2021-05-17 12:29:05,45039.87,fair_price_15_min_before
465650,2021-05-17 12:29:10,45045.55,fair_price_15_min_before
465652,2021-05-17 12:29:15,45042.53,fair_price_15_min_before
465653,2021-05-17 12:29:20,45039.34,fair_price_15_min_before
466377,2021-05-17 12:42:50,45142.74,fair_price_15_min_before
466380,2021-05-17 12:42:55,45143.24,fair_price_15_min_before
466393,2021-05-17 12:43:00,45130.98,fair_price_15_min_before
466398,2021-05-17 12:43:05,45128.13,fair_price_15_min_before
466400,2021-05-17 12:43:10,45140.9,fair_price_15_min_before
466401,2021-05-17 12:43:15,45136.38,fair_price_15_min_before
466404,2021-05-17 12:43:20,45118.54,fair_price_15_min_before
466405,2021-05-17 12:43:25,45120.69,fair_price_15_min_before
466407,2021-05-17 12:43:30,45121.37,fair_price_15_min_before
466413,2021-05-17 12:43:36,45133.71,fair_price_15_min_before
466415,2021-05-17 12:43:40,45137.74,fair_price_15_min_before
466419,2021-05-17 12:43:45,45127.96,fair_price_15_min_before
466431,2021-05-17 12:43:50,45100.83,fair_price_15_min_before
466437,2021-05-17 12:43:55,45091.78,fair_price_15_min_before
466438,2021-05-17 12:44:00,45084.75,fair_price_15_min_before
466445,2021-05-17 12:44:06,45094.08,fair_price_15_min_before
466448,2021-05-17 12:44:10,45106.51,fair_price_15_min_before
466456,2021-05-17 12:44:15,45122.97,fair_price_15_min_before
466461,2021-05-17 12:44:20,45106.78,fair_price_15_min_before
466466,2021-05-17 12:44:25,45096.55,fair_price_15_min_before
466469,2021-05-17 12:44:30,45088.06,fair_price_15_min_before
466474,2021-05-17 12:44:35,45086.12,fair_price_15_min_before
466491,2021-05-17 12:44:40,45065.95,fair_price_15_min_before
466495,2021-05-17 12:44:45,45068.21,fair_price_15_min_before
466502,2021-05-17 12:44:55,45066.47,fair_price_15_min_before
466506,2021-05-17 12:45:00,45063.82,fair_price_15_min_before
466512,2021-05-17 12:45:05,45070.48,fair_price_15_min_before
466519,2021-05-17 12:45:10,45050.59,fair_price_15_min_before
466523,2021-05-17 12:45:16,45041.13,fair_price_15_min_before
466526,2021-05-17 12:45:20,45038.36,fair_price_15_min_before
466535,2021-05-17 12:45:25,45029.72,fair_price_15_min_before
466553,2021-05-17 12:45:31,45016.2,fair_price_15_min_before
466557,2021-05-17 12:45:35,45011.2,fair_price_15_min_before
466559,2021-05-17 12:45:40,45007.04,fair_price_15_min_before
This is the CSV
Firstly convert your date column to datetime dtype:
df['date']=pd.to_datetime(df['date'])
Then filter values:
date15min=df['date']-pd.offsets.DateOffset(minutes=15)
out=df.loc[df['date'].isin(date15min.tolist())]
Now Finally do your calculations:
df['price_before_15min']=df['fair_price'].where(df['date'].isin((out['date']+pd.offsets.DateOffset(minutes=15)).tolist()))
df['price_before_15min']=df['price_before_15min'].diff()
df['date_before_15min']=date15min
Now If you print df you will get your desired output
Update:
For that purpose just make a slightly change in the above method:
out=df.loc[df['date'].dt.minute.isin(date15min.dt.minute.tolist())]
df['price_before_15min']=df['fair_price'].where(df['date'].dt.minute.isin((out['date']+pd.offsets.DateOffset(minutes=15)).dt.minute.tolist()))

Collect all transactions for each day and report total spent that day

I have a DataFrame that looks like this
date Burned
8/11/2019 7:00 0.0
8/11/2019 7:00 10101.0
8/11/2019 8:16 5.2
I have this code:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv")
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
a=pd.concat([df['DateTime'], df['Burned']], axis=1, keys=['date', 'Burned'])
b=a.groupby(df.index.date).count()
But I get this error: AttributeError: 'RangeIndex' object has no attribute 'date'
Basically I am wanting to sort all these times just by day since it has timestamps throughout the day. I don't care what time of the day different things occured, I just want to get the total number of 'Burned' per day.
First add parse_dates=['DateTime'] to read_csv for convert column Datetime:
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
Or first column:
df = pd.read_csv("../example.csv", parse_dates=[0])
In your solution is date column, so need Series.dt.date with sum:
b = a.groupby(a['date'].dt.date)['Burned'].sum().reset_index(name='Total')

Python: store a value in a variable so that you can recognize each reoccurence

If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Time Series using numpy or pandas

I'm a beginner of Python related environment and I have problem with using time series data.
The below is my OHLC 1 minute data.
2011-11-01,9:00:00,248.50,248.95,248.20,248.70
2011-11-01,9:01:00,248.70,249.00,248.65,248.85
2011-11-01,9:02:00,248.90,249.25,248.70,249.15
...
2011-11-01,15:03:00,250.25,250.30,250.05,250.15
2011-11-01,15:04:00,250.15,250.60,250.10,250.60
2011-11-01,15:15:00,250.55,250.55,250.55,250.55
2011-11-02,9:00:00,245.55,246.25,245.40,245.80
2011-11-02,9:01:00,245.85,246.40,245.75,246.35
2011-11-02,9:02:00,246.30,246.45,245.75,245.80
2011-11-02,9:03:00,245.75,245.85,245.30,245.35
...
I'd like to extract the last "CLOSE" data per each row and convert data format like the following:
2011-11-01, 248.70, 248.85, 249.15, ... 250.15, 250.60, 250.55
2011-11-02, 245.80, 246.35, 245.80, ...
...
I'd like to calculate the highest Close value and it's time(minute) per EACH DAY like the following:
2011-11-01, 10:23:03, 250.55
2011-11-02, 11:02:36, 251.00
....
Any help would be very appreciated.
Thank you in advance,
You can use the pandas library. In the case of your data you can get the max as:
import pandas as pd
# Read in the data and parse the first two columns as a
# date-time and set it as index
df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None)
# get only the fifth column (close)
df = df[[5]]
# Resample to date frequency and get the max value for each day.
df.resample('D', how='max')
If you want to show also the times, keep them in your DataFrame as a column and pass a function that will determine the max close value and return that row:
>>> df = pd.read_csv('your_file', parse_dates=[[0,1]], index_col=0, header=None,
usecols=[0, 1, 5], names=['d', 't', 'close'])
>>> df['time'] = df.index
>>> df.resample('D', how=lambda group: group.iloc[group['close'].argmax()])
close time
d_t
2011-11-01 250.60 2011-11-01 15:04:00
2011-11-02 246.35 2011-11-02 09:01:00
And if you wan't a list of the prices per day then just do a groupby per day and return the list of all the prices from every group using the apply on the grouped object:
>>> df.groupby(lambda dt: dt.date()).apply(lambda group: list(group['close']))
2011-11-01 [248.7, 248.85, 249.15, 250.15, 250.6, 250.55]
2011-11-02 [245.8, 246.35, 245.8, 245.35]
For more information take a look at the docs: Time Series
Update for the concrete data set:
The problem with your data set is that you have some days without any data, so the function passed in as the resampler should handle those cases:
def func(group):
if len(group) == 0:
return None
return group.iloc[group['close'].argmax()]
df.resample('D', how=func).dropna()

Categories

Resources