I have found some tasks to do, to develop myself more with Pandas, but I found some unexpected errors in the data files I used. And actually wanted to fix it by myself, but I have no idea how.
Basically I have an excel file, with columns - PayType, Money, Date. In the column of PayType, I have 4 different types of payment. Car rent payment, car service fee payment, and 2 more which are not important. Basically, on every entry of car rent payment, there is an automatic service fee deduction, which happens at the exactly same time. I used the Pivot table and divided PayTypes as columns, as I wanted to count the percentage of these fees.
Before Pivot Table:
enter image description here
Time difference example:
enter image description here
After Pivot Table:
enter image description here
import numpy as np
import pandas as pd
import xlrd
from pandas import Series, DataFrame
df = pd.read_excel ('C:/Data.xlsx', sheet_name = 'Sheet1',
usecols = ['PayType', 'Money', 'Date'])
df['Date'] = pd.to_datetime(df['Date'], format = '%Y-%m-%d %H:$M:%S.%f')
df = df.pivot_table(index = ['Date'],
columns = ['PayType']).fillna(0)
df = pd.merge_asof(df['Money', 'serviceFee'], df['Money', 'carRenting'], on = 'Date', tolerance =
pd.Timedelta('2s'))
df['Percentage'] = df['Money','serviceFee'] / df['Money','carRenting'] * 100
df['Percentage'] = df['Percentage'].abs()
df['Charges'] = np.where(df['Percentage'].notna(), np.where(df['Percentage'] > 26, 'Overcharge -
30%', 'Fixed - 25%'), 'Null')
df.to_excel("Finale123.xlsx")
So in the Pivot table, entries for renting the car and fee payments almost all of them happened at the same moment, so their time is equal and they are in one row. But there are few mistakes, where time is different for carrenting and feepayment just for 1 or 2 seconds. Because of this time difference, they are divided into 2 different rows.
I tried to use merge_asof, but it didn't work.
How can I merge 2 rows, which have different times (by 2 seconds max) and also this time column (date) is the actual index for the pivot table.
I had a similar problem. I needed to merge time series data of multiple sensors. The time interval of the sensor measurements are 5 seconds. The time format is yyyy:MM:dd HH:mm:ss. To do the merge, I also needed to sort the column used for the merge.
sensors_livingroom = load(filename_livingroom)
sensors_bedroom = load(filename_bedroom)
sensors_livingroom = sensors_livingroom.set_index("time")
sensors_bedroom = sensors_bedroom.set_index("time")
sensors_livingroom.index = pd.to_datetime(sensors_livingroom.index, dayfirst=True)
sensors_bedroom.index = pd.to_datetime(sensors_bedroom.index, dayfirst=True)
sensors_livingroom.sort_index(inplace=True)
sensors_bedroom.sort_index(inplace=True)
sensors = pd.merge_asof(sensors_bedroom, sensors_livingroom, on='time', direction="nearest")
In my case I wanted to merge to the nearest time value so I set the parameter direction to nearest. In your case, it seems that the time of one dataframe will always be smaller that the time of the other, so it may be better to set direction parameter to forward or backward. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html
Related
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.
Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)
I've created a pandas dataframe from a 205MB csv (approx 1.1 million rows by 15 columns). It holds a column called starttime that is dtype object (it's more precisely a string). The format is as follows: 7/1/2015 00:00:03.
I would like to create two new dataframes from this pandas dataframe. One should contain all rows corresponding with weekend dates, the other should contain all rows corresponding with weekday dates.
Weekend dates are:
weekends = ['7/4/2015', '7/5/2015', '7/11/2015', '7/12/2015',
'7/18/2015', '7/19/2015', '7/25/2015', '7,26/2015']
I attempted to convert the string to datetime (pd.to_datetime) hoping that would make the values easier to parse, but when I do it hangs for so long that I ended up restarting the kernel several times.
Then I decided to use df["date"], df["time"] = zip(*df['starttime'].str.split(' ').tolist()) to create two new columns in the original dataframe (one for date, one for time). Next I figured I'd use a boolean test to 'flag' weekend records (according to the new date field) as True and all others False and create another column holding those values, then I'd be able to group by True and False.
For example,
test1 = bikes['date'] == '7/1/2015' returns True for all 7/1/2015 values, but I can't figure out how to iterate over all items in weekends so that I get True for all weekend dates. I tried this and broke Python (hung again):
for i in weekends:
for k in df['date']:
test2 = df['date'] == i
I'd appreciate any help (with both my logic and my code).
First, create a DataFrame of string timestamps with 1.1m rows:
df = pd.DataFrame({'date': ['7/1/2015 00:00:03', '7/1/2015 00:00:04'] * 550000})
Next, you can simply convert them to Pandas timestamps as follows:
df['ts'] = pd.to_datetime(df.date)
This operation took just under two minutes. However, it took under seven seconds if you specify the format:
df['ts'] = pd.to_datetime(df.date, format='%m/%d/%Y %H:%M:%S')
Now, it is easy to set up a weekend flag as follows (which took about 3 seconds):
df['weekend'] = [d.weekday() >= 5 for d in df.ts]
Finally, it is easy to subset your DataFrame, which takes virtually no time:
df_weekdays = df.loc[~df.weekend, :]
df_weekends = df.loc[df.weekend, :]
The weekend flag is to help explain what is happening. You can simplify as follows:
df_weekdays = df.loc[df.ts.apply(lambda ts: ts.weekday() < 5), :]
df_weekends = df.loc[df.ts.apply(lambda ts: ts.weekday() >= 5), :]
I am trying to get a fairly basic resampling method to work with a pandas data frame. My data frame df is indexed by datetime entries and contains prices
price
datetime
2000-08-16 09:29:55.755000 7.302786
2000-08-16 09:30:10.642000 7.304059
2000-08-16 09:30:26.598000 7.304435
2000-08-16 09:30:41.372000 7.304314
2000-08-16 09:30:56.718000 7.304334
I would like to downsample this to 5min. Using
df.resample(rule='5Min',how='last',closed='left')
takes the closest point to the left in my data of a multiple of 5min; similarly
df.resample(rule='5Min',how='first',closed='left')
takes the closes point to the right.
However, I would like to take the linear interpolation between the point to the left and right instead, e.g. if my df contains the two consecutive entries
time t1, price p1
time t2, price p2
and
t1<t<t2 where t is a multiple of 5min
then the resampled dataframe should have the entry
time t, price p1+(t-t1)/(t2-t1)*(p2-p1)
try creating two separate dataframes, reset_index them (so they have the same numerical index), fillna on them, and then just do the math on df1 and df2. e.g:
df1 = df.resample(rule='5Min',how='last',closed='left').reset_index().fillna(method='ffill')
df2 = df.resample(rule='5Min',how='first',closed='left').reset_index().fillna(method='ffill')
dt = df1.datetime - df2.datetime
px_fld = df1.price + ...
something like that should do the trick.