Pandas index match with time values - python

I am trying to index match 2 dataframes and write the data back to excel. The Excel file that has to be filled looks like this:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00 5.45 16:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 NaN 3.45 NaN 3.65
2 Steak Dallas 20200506.0 8.5 8.85 NaN 8.45 NaN 8.65
The 'TimeH' and 'TimeL' should be index'd from a dataframe that looks like this
Name Date Time Open High Low Close Volume VWAP Trades
4 Apple 20200505 15:30:00 3.50 3.85 3.45 3.70 1500 3.73 95
5 Apple 20200505 17:00:00 3.65 3.70 3.50 3.60 1600 3.65 54
6 Apple 20200505 20:00:00 3.80 3.85 3.35 3.81 1700 3.73 41
7 Apple 20200505 22:00:00 3.60 3.84 3.45 3.65 1800 3.75 62
4 Steak 20200506 10:00:00 8.50 8.85 8.45 8.70 1500 8.73 95
5 Steak 20200506 12:00:00 8.65 8.70 8.50 8.60 1600 8.65 54
6 Steak 20200506 14:00:00 8.80 8.85 8.45 8.81 1700 8.73 41
7 Steak 20200506 16:00:00 8.60 8.84 8.45 8.65 1800 8.75 62
And then be pasted to the excel file, which should look like this after everything has worked:
Name Location Date Open High TimeH Low TimeL Close
1 Orange New York 20200501.0 5.5 5.58 18:00:00 5.45 16:00:00 5.7
0 Apple Minsk 20200505.0 3.5 3.85 10:00:00 3.45 20:00:00 3.65
2 Steak Dallas 20200506.0 8.5 8.85 15:30:00 8.45 14:00:00 8.65
I was using the following code to index the values 'Open', 'High', 'Low', 'Close', which works great:
rdf13 = rdf12.groupby(['Name','Date']).agg(Open=('Open','first'),High=('High','max'),Low=('Low','min'), Close=('Close','last'),Volume=('Volume','sum'),VWAP=('VWAP','mean'),Trades=('Trades','sum')).reset_index()
result11 = pd.merge(rdf13, rdf11, how='inner', on=['Name', 'Date']).iloc[:,:-4].dropna(1).rename(columns = {"Open_x": "Open", "High_x": "High", "Low_x": "Low", "Close_x": "Close", "Volume_x": "Volume", "VWAP_x": "VWAP", "Trades_x": "Trades"})
result12 = result11.reindex(index=result11.index[::-1])
result13 = result12[['Name', 'Location', 'Date', 'Check_2','Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']].reset_index()
readfile11 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile11['Count'] = np.arange(len(readfile11))
df11 = readfile11.set_index(['Name', 'Location', 'Date'])
df12 = result13.set_index(['Name', 'Location', 'Date'])
fdf11 = df12.combine_first(df11).reset_index().reindex(readfile11.columns, axis=1).sort_values('Count')
print("Updated Day1 Data Frame")
print(fdf11)
writefdf10 = fdf11.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
But when I append it to index the TimeH value with the following code:
colnames40 = rdf12.rename(columns = {"Time": "TimeH"})
result41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
result42 = result41.reindex(index=result41.index[::-1])
result43 = result42[['Name', 'Location', 'Date', 'Check_2', 'High', 'TimeH']].reset_index()
readfile41 = pd.read_excel("Trackers\TEST Tracker.xlsx")
readfile41['Count'] = np.arange(len(readfile41))
df41 = readfile41.set_index(['Name', 'Location', 'Date', 'High'])
df42 = result43.set_index(['Name', 'Location', 'Date', 'High'])
fdf41 = df42.combine_first(df41).reset_index().reindex(readfile41.columns, axis=1).sort_values('Count')
print("Updated Day3 Data Frame")
print(fdf41)
writefdf40 = fdf41.to_excel("Trackers\TEST Tracker.xlsx", "Entries", index=False)
it does not seem to work for some reason and returns nothing, so the 'NaN' values in the 'TimeH' column stay 'NaN'. I messed around with the variables, but I either got errors because I did something wrong or it still returned 'NaN' values to me.
Can someone here help me to make python index the time values?

Apparently I just had a little typo in my code.
esult41 = pd.merge(colnames40, rdf11, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
should have been
esult41 = pd.merge(colnames40, rdf31, how='inner', on=['Name', 'Date', 'High']).iloc[:,:-4].dropna(1).rename(columns = {"TimeH_x": "TimeH"})
Problem now is that the data returns duplicate values, which makes sense because of the reference rdf31, but the issue is that df.drop_duplicates(keep='first', inplace=False) returns 'None' values for some reason, but thats outside the scope of this question.

Related

alphavantage timeseries, fill missing datetime, volume of filled to be 0

Currently if you download data from alphavantage, I get broken timestamps
date open high low close volume
2022-08-01 04:15:00 1.00 1.01 0.99 1.00 200
2022-08-01 04:30:00 1.00 1.03 1.00 1.02 300
2022-08-01 05:40:00 1.02 1.04 1.00 1.03 500
as you can see, between 04:30 to 05:45, there is data that I like to fill, so 2022-08-01 04:45, 22-08-01 05:00 etc
I have some conditions too as follows:
missing date time values of ohlc will take from the last timestamp, so 04:45 ohlc will just be 04:30's data.
volume to be 0 for those missing lines created.
to have consistency in future donwloads, I like to specify start
and end datetimes. So lets say this stock is AAPL, and I want 0415
to 0930, in future I want TSLA to also have similar timestamps to
AAPL.
Anyway, desired output:
date open high low close volume
2022-08-01 04:15:00 1.00 1.01 0.99 1.00 200
2022-08-01 04:30:00 1.00 1.03 1.00 1.02 300
2022-08-01 04:45:00 1.00 1.03 1.00 1.02 0
2022-08-01 05:00:00 1.00 1.03 1.00 1.02 0
2022-08-01 05:15:00 1.00 1.03 1.00 1.02 0
2022-08-01 05:30:00 1.00 1.03 1.00 1.02 0
2022-08-01 05:45:00 1.02 1.04 1.00 1.03 500
Thanks alot, my head hurts...
edit:
I tried this, full code here. However this could only fill missing timestamps with last data, volume is also from the last timestamp. This is working code if u can replace the api key with your own:
import pandas as pd
from alpha_vantage.timeseries import TimeSeries
import time
import pytz
from datetime import datetime
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
api_key = 'pls use your own api key'
def get_eastern_datetime(calctype=1):
# 1 - datetime, 2-date, 3-time
est = pytz.timezone('US/Eastern')
fmt = '%d-%m-%Y %H-%M'
if calctype == 1:
fmt = '%d-%m-%Y %H-%M'
elif calctype == 2:
fmt = '%d-%m-%Y'
elif calctype == 3:
fmt = '%H:%M'
current_date = datetime.now()
return current_date.astimezone(est).strftime(fmt)
def get_alphavantage(ticker):
# Pull data from alpha vantage. 1. search a company name, get the symbol of the match, the pull data.
ts = TimeSeries(key=api_key, output_format='pandas')
tuple_back = ts.get_symbol_search(keywords=ticker)
# print(tuple_back)
symbol = tuple_back[0].iloc[0, 0].strip() # tuple_back[0] is the df. So we choose the 0th line, 0th item which is the symbol.
df, meta_data = ts.get_intraday(symbol=symbol, interval='15min', outputsize='full') # check this line syntax
# print(meta_data)
# Clean up and save
df = df.iloc[::-1] # reverse rows so that earliest on top
df.columns = ['open', 'high', 'low', 'close', 'volume']
df.reset_index(inplace=True) # must, to remove date as index
df['date'] = pd.to_datetime(df['date'], format='%Y-%M-%D %H:%M%S') # '%Y%m%d%H%M'
print(df.head(20), '\n')
# Resample
df = df.set_index('date').sort_index().asfreq(freq='15T', method='ffill')
df.to_csv(r"C:\your own path\{} {}.csv".format(symbol, get_eastern_datetime(calctype=1)))
print(df.head(20), '\n')
if __name__ == '__main__':
get_alphavantage('GME')
If I use .asfreq() without method="ffill" then I get rows with nan and I can use .fillna(0) for volume, and .ffill() for other columns - and I get expected results.
Minimal working example
data = ''' date open high low close volume
2022-08-01 04:15:00 1.00 1.01 0.99 1.00 200
2022-08-01 04:30:00 1.00 1.03 1.00 1.02 300
2022-08-01 05:45:00 1.02 1.04 1.00 1.03 500'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(data), sep='\s{2,}')
df['date'] = pd.to_datetime(df['date'])
# ---
df = df.set_index('date').sort_index().asfreq(freq='15T')
print('\n--- before ---\n')
print(df)
# ---
df['volume'] = df['volume'].fillna(0)
df['open'] = df['open'].ffill()
df['high'] = df['high'].ffill()
df['low'] = df['low'].ffill()
df['close'] = df['close'].ffill()
print('\n--- after ---\n')
print(df)
Result:
--- before ---
open high low close volume
date
2022-08-01 04:15:00 1.00 1.01 0.99 1.00 200.0
2022-08-01 04:30:00 1.00 1.03 1.00 1.02 300.0
2022-08-01 04:45:00 NaN NaN NaN NaN NaN
2022-08-01 05:00:00 NaN NaN NaN NaN NaN
2022-08-01 05:15:00 NaN NaN NaN NaN NaN
2022-08-01 05:30:00 NaN NaN NaN NaN NaN
2022-08-01 05:45:00 1.02 1.04 1.00 1.03 500.0
--- after ---
open high low close volume
date
2022-08-01 04:15:00 1.00 1.01 0.99 1.00 200.0
2022-08-01 04:30:00 1.00 1.03 1.00 1.02 300.0
2022-08-01 04:45:00 1.00 1.03 1.00 1.02 0.0
2022-08-01 05:00:00 1.00 1.03 1.00 1.02 0.0
2022-08-01 05:15:00 1.00 1.03 1.00 1.02 0.0
2022-08-01 05:30:00 1.00 1.03 1.00 1.02 0.0
2022-08-01 05:45:00 1.02 1.04 1.00 1.03 500.0

Time Series: Fill NaNs from another dataframe

I am working with temperature data and I have created a file that has multi-year averages of few thousand cities and the format is as below(df1)
Date City PRCP TMAX TMIN TAVG
01-Jan Zurich 0.94 3.54 0.36 1.95
01-Feb Zurich 4.12 9.14 3.04 6.09
01-Mar Zurich 4.1 5.9 0.3 3.1
01-Apr Zurich 0.32 13.78 4.22 9
01-May Zurich 9.42 11.32 5.34 8.33
.
.
.....
I have the above data for all 365 days with no nulls. Notice that the date column only has day and month because year is irrelevant.
Based on the above data I am trying to clean yearly files, my second dataframe has data in the below format(df2)
ID Date City PRCP TAVG TMAX TMIN
abcd1 2020-01-01 Zurich 0 -1.9 -0.9
abcd1 2020-01-02 Zurich 9.1 12.7 4.9
abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
abcd1 2020-01-04 Zurich 0 4.1 10.8 -2.6
.
.
.....
Each city has a unique ID. The date column has the format %y-%m-%d.
I am trying to replace the nulls in the second dataframe with the values in my first dataframe by matching day and month. This is what I tried
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce') ##date format change##
df1["Date"] = df1['Date'].dt.strftime('%d-%m')
df2 = df2.drop(columns='ID')
df2 = df2.fillna(df1) ##To replace nulls##
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')
df1["Date"] = df1['Date'].dt.strftime('%Y-%m-%d') ## Change data back to original format##
Even with this I end up with nulls in my yearly file i.e. df2{Note: df1 has no nulls}
Please suggest a better way to replace only nulls or any corrections to the code if necessary.
We can approach by adding a column Date2 onto df2 with the same format as the Date column on df1. Then, while setting both dataframes with this date format and City as index, we perform an update on df2 using .update(), as follows:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
.update() is to modify in place using non-NA values from another DataFrame. The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated. Hence, we make both dataframe with same row index so that updates will be performed on corresponding columns with same column index/labels.
Note that we use the parameter overwrite=False in .update() to ensure we only update values that are NaN in the original DataFrame df2.
Demo
Data Setup:
Added data onto df1 to showcase replacing values of df2 from df1:
print(df1)
Date City PRCP TMAX TMIN TAVG
0 01-Jan Zurich 0.94 3.54 0.36 1.95
1 02-Jan Zurich 0.95 3.55 0.37 1.96 <=== Added this row
2 01-Feb Zurich 4.12 9.14 3.04 6.09
3 01-Mar Zurich 4.10 5.90 0.30 3.10
4 01-Apr Zurich 0.32 13.78 4.22 9.00
5 01-May Zurich 9.42 11.32 5.34 8.33
print(df2) # before processing
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 NaN <=== with NaN value
1 abcd1 2020-01-02 Zurich 9.1 NaN 12.7 4.9 <=== with NaN value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.6
Run new codes:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
Result:
print(df2)
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 0.36 <== TMIN updated with df1 value
1 abcd1 2020-01-02 Zurich 9.1 1.96 12.7 4.90 <== TAVG updated with df1 value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.90
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.60

How to add missing dates in pandas

I have the following dataframe:
data
Out[120]:
High Low Open Close Volume Adj Close
Date
2018-01-02 12.66 12.50 12.52 12.66 20773300.0 10.842077
2018-01-03 12.80 12.67 12.68 12.76 29765600.0 10.927719
2018-01-04 13.04 12.77 12.78 12.98 37478200.0 11.116128
2018-01-05 13.22 13.04 13.06 13.20 46121900.0 11.304538
2018-01-08 13.22 13.11 13.21 13.15 33828300.0 11.261715
... ... ... ... ... ...
2020-06-25 6.05 5.80 5.86 6.03 73612700.0 6.030000
2020-06-26 6.07 5.81 6.04 5.91 118435400.0 5.910000
2020-06-29 6.07 5.81 5.91 6.01 58208400.0 6.010000
2020-06-30 6.10 5.90 5.98 6.08 61909300.0 6.080000
2020-07-01 6.18 5.95 6.10 5.98 62333600.0 5.980000
[629 rows x 6 columns]
Some of the dates are missing in Dates Column. I know i can do this to get all the dates:
pd.date_range(start, end, freq ='D')
Out[121]:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08',
'2018-01-09', '2018-01-10',
...
'2020-06-23', '2020-06-24', '2020-06-25', '2020-06-26',
'2020-06-27', '2020-06-28', '2020-06-29', '2020-06-30',
'2020-07-01', '2020-07-02'],
dtype='datetime64[ns]', length=914, freq='D')
How can i compare all the dates with the index and just add the dates which are missing.
Use DataFrame.reindex, working also if need some custom start and end datimes:
df = df.reindex(pd.date_range(start, end, freq ='D'))
Or DataFrame.asfreq for add missing datetimes between existing data:
df = df.asfreq('d')

parsing data in excel file to create data frame

I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep
Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]

Python pandas rolling mean while retaining index and column

I have a pandas DataFrame of statistics for NBA games. Here's a sample of the data for away teams:
away_team away_efg away_drb away_score
date
2000-10-31 19:00:00 Los Angeles Clippers 0.522 74.4 94
2000-10-31 19:00:00 Milwaukee Bucks 0.434 63.0 93
2000-10-31 19:30:00 Minnesota Timberwolves 0.523 73.8 106
2000-10-31 19:30:00 Charlotte Hornets 0.605 77.1 106
2000-10-31 19:30:00 Seattle SuperSonics 0.429 73.1 88
There are many more numeric columns other than the away_score column, and also analogous columns for the home team.
What I would like is, for each row, replace the numeric columns (other than score) with the mean of the previous three observations, partitioned by team. I can almost get what I want by doing the following:
home_df.groupby("team").apply(lambda x: x.rolling(window=3).mean())
This returns, for example,
>>> home_avg[home_avg["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb
0 NaN NaN NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN NaN NaN
81 0.146667 71.600000 9.4 74.666667 0.512000 0.347667 25.833333
Taking this, along with
>>> home_df[home_df["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb stl team tov trb
0 0.118 76.7 7.1 64.7 0.535 0.365 25.6 11.5 Utah Jazz 10.8 42.9
50 0.100 63.9 9.1 80.5 0.536 0.414 27.6 2.2 Utah Jazz 20.2 58.6
81 0.222 74.2 12.0 78.8 0.465 0.264 24.3 7.3 Utah Jazz 13.9 50.0
122 0.119 81.8 11.3 75.0 0.515 0.642 25.0 12.2 Utah Jazz 21.8 52.5
135 0.129 76.7 17.8 75.9 0.650 0.400 37.9 5.7 Utah Jazz 18.8 62.7
demonstrates that it is including the current row in the calculation of the mean. I want to avoid this. More specifically, the desired output for row 81 would be all NaNs (because there haven't been three games yet), and the entry in the 3par column for row 122 would be .146667 (the average of the values in that column for rows 0, 50, and 81).
So, my question is, how can I exclude the current row in the rolling mean calculation?
You can use shift here which shifts the index for a given amount to make your rolling window use the last three values excluding the current value:
# create dummy data frame with numeric values
df = pd.DataFrame({"numeric_col": np.random.randint(0, 100, size=5)})
print(df)
numeric_col
0 66
1 60
2 74
3 41
4 83
df["mean"] = df["numeric_col"].shift(1).rolling(window=3).mean()
print(df)
numeric_col mean
0 66 NaN
1 60 NaN
2 74 NaN
3 41 66.666667
4 83 58.333333
Accordingly, change your apply function to lambda x: x.shift(1).rolling(window=3).mean() to make it work in your specific example.

Categories

Resources