I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:
In [547]: result
Out[547]:
uuid timestamp col1 col2 col3
0 100 2016-03-30 00:00:00+02:00 NaN NaN NaN
1 100 2016-03-30 00:01:00+02:00 NaN NaN NaN
2 100 2016-03-30 00:02:00+02:00 NaN NaN NaN
3 100 2016-03-30 00:03:00+02:00 1.49 1.79 0.979
4 100 2016-03-30 00:04:00+02:00 NaN NaN NaN
... ... ... .. ...
1435 100 2016-03-30 23:55:00+02:00 NaN NaN NaN
1436 100 2016-03-30 23:56:00+02:00 1.39 2.19 1.09
1437 100 2016-03-30 23:57:00+02:00 NaN NaN NaN
1438 100 2016-03-30 23:58:00+02:00 NaN NaN NaN
1439 100 2016-03-30 23:59:00+02:00 NaN NaN NaN
[1440 rows x 5 columns]
I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:
In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Out[548]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-30 00:00:00+02:00 NaN NaN ... NaN 0
2016-03-30 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-30 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-30 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-30 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:
In [549]: result
Out[549]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-27 00:00:00+01:00 NaN NaN ... NaN 0
2016-03-27 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-27 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-27 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-28 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.
Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?
The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.
Are there any creative solutions to fix this?
Have you tried this? I think it should work
(First converts to local timezone, and then truncates the timezone info by .tz_localize(None))
result = result.set_index('timestamp').tz_convert('Europe/Berlin').tz_localize(None).resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Related
The dataset:
I have a collection csv files inside a folder with each csv file with title: timestamp and close. Each file is saved as {symbol}.csv where symbols range from a list of symbols eg: ['ADAUSDT', 'MAGICUSDT', 'LUNCUSDT', 'LINAUSDT', 'LEVERUSDT', 'BUSDUSDT', BTSUSDT, ALGOUSDT].... In reality I have over 100+ symbols
Here's the link to sample csv files incase you need them
What I would like to do:
I want to merge all the close prices inside these files into one data frame using pd.concat without losing much data. Most of the files start at a similar date, but some of them don't have much data back to 1 year (eg: LUNCUSDT) In those cases I want to find a way in which I can either drop those files and merge the rest depending on whether the rest of the dates all come within a close range.
If that is complicated maybe I would like to try to arrange them all together based on the most recent data. However, all the DateTime stamps in the last rows are also not in the same range.
I would appreciate any help on how I can approach this logic. Thanks in advance.
Here's my attempt:
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
merged_df = pd.DataFrame()
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df = df.rename(columns={'close': symbol})
#df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
merged_df = pd.concat([merged_df, df], axis=1)
merged_df
This unfortunately prints an uneven dataframe, which also has repeating timestamp column. Therefore I do not know how I could figure out the latest and earliest time in the row:
timestamp ADAUSDT timestamp XRPUSDT timestamp XLMUSDT timestamp TRXUSDT timestamp VETUSDT ... timestamp LEVERUSDT timestamp STGUSDT timestamp LUNCUSDT timestamp HFTUSDT timestamp MAGICUSDT
0 2022-02-14 17:35:00 1.048 2022-02-14 17:35:00 0.7989 2022-02-14 17:35:00 0.2112 2022-02-14 17:35:00 0.06484 2022-02-14 17:35:00 0.05662 ... 2022-07-13 04:00:00 0.001252 2022-08-19 09:00:00 0.4667 2022-09-09 08:00:00 0.000529 2022-11-07 13:00:00 3.6009 2022-12-12 08:00:00 0.7873
1 2022-02-14 17:40:00 1.047 2022-02-14 17:40:00 0.7986 2022-02-14 17:40:00 0.2111 2022-02-14 17:40:00 0.06482 2022-02-14 17:40:00 0.05665 ... 2022-07-13 04:05:00 0.001249 2022-08-19 09:05:00 0.5257 2022-09-09 08:05:00 0.000522 2022-11-07 13:05:00 2.9160 2022-12-12 08:05:00 0.8116
2 2022-02-14 17:45:00 1.048 2022-02-14 17:45:00 0.7981 2022-02-14 17:45:00 0.2111 2022-02-14 17:45:00 0.06488 2022-02-14 17:45:00 0.05668 ... 2022-07-13 04:10:00 0.001320 2022-08-19 09:10:00 0.5100 2022-09-09 08:10:00 0.000517 2022-11-07 13:10:00 2.6169 2022-12-12 08:10:00 0.8064
3 2022-02-14 17:50:00 1.047 2022-02-14 17:50:00 0.7980 2022-02-14 17:50:00 0.2109 2022-02-14 17:50:00 0.06477 2022-02-14 17:50:00 0.05658 ... 2022-07-13 04:15:00 0.001417 2022-08-19 09:15:00 0.5341 2022-09-09 08:15:00 0.000520 2022-11-07 13:15:00 2.4513 2022-12-12 08:15:00 0.8035
4 2022-02-14 17:55:00 1.047 2022-02-14 17:55:00 0.7969 2022-02-14 17:55:00 0.2108 2022-02-14 17:55:00 0.06474 2022-02-14 17:55:00 0.05656 ... 2022-07-13 04:20:00 0.001400 2022-08-19 09:20:00 0.6345 2022-09-09 08:20:00 0.000527 2022-11-07 13:20:00 2.5170 2022-12-12 08:20:00 0.8550
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105123 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105124 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105126 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
105127 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Any help would be appreciated. Thankyou!
Option 1
Your test implementation attempts to perform a horizontal concat. However, I think you should consider a vertical concat instead - and indeed, the title of your question is simply "best way to merge multiple csv files with different timestamps"
The advantage of using the standard vertical concat is that for time series datasets that arrives at irregular intervals, it is far more memory efficient. You've already touched a little bit on the issues that you would run into doing a horizontal concat in your example - multiple timestamp columns, one for each csv.
A better approach might be to add an additional column called symbol to differentiate each row, and a vertical concat like so
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
dfs = []
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
# adds symbol as a column
df['symbol'] = symbol
merged_df = pd.concat(dfs).set_index('timestamp').sort_index()
Now, your merged df will look something like this
+-----------+-------+--------+
| timestamp | close | symbol |
+-----------+-------+--------+
|...........|.......|........|
+----------------------------+
Option 2
If you still want to do a horizontal concat, you could try to take advantage of merge_asof instead to get things on the same row.
To get back into the realm of regular intervals, you could use pandas' date_range to generate a periodic column, and then merge_asof every csv file back into it.
symbols = pd.read_csv('symbols.csv')
symbols = symbols.symbols.to_list()
df_merged = pd.DataFrame(
data=pd.date_range(start='1/1/2023', end='1/2/2023', freq='T'),
columns=['time']
).set_index('time') # required for merge_asof
for symbol in symbols:
df = pd.read_csv(f"OHLC/5m/{symbol}.csv", usecols=[0,4])
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
df = df.set_index('timestamp') # required for merge_asof
df_merged = pd.merge_asof(
left=df_merged,
right=df.rename(columns={'close': symbol}),
left_index=True,
right_index=True,
)
Option 3
If you're ok with using a huge amount of memory, you can simply merge every dataframe as you are already doing, and then fillna with the bfill argument
I'm trying to use the usual times I take medication (so + 4 hours on top of that) and fill in a data frame with a label, of being 2,1 or 0, for when I am on this medication, or for the hour after the medication as 2 for just being off of the medication.
As an example of the dataframe I am trying to add this column too,
<bound method NDFrame.to_clipboard of id sentiment magnitude angry disgusted fearful \
created
2020-05-21 12:00:00 23.0 -0.033333 0.5 NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN NaN NaN
2021-04-20 01:45:00 46022.0 -1.000000 1.0 NaN NaN NaN
happy neutral sad surprised
created
2020-05-21 12:00:00 NaN NaN NaN NaN
2020-05-21 12:15:00 NaN NaN NaN NaN
2020-05-21 12:30:00 NaN NaN NaN NaN
2020-05-21 12:45:00 NaN NaN NaN NaN
2020-05-21 13:00:00 NaN NaN NaN NaN
... ... ... ... ...
2021-04-20 00:45:00 NaN NaN NaN NaN
2021-04-20 01:00:00 NaN NaN NaN NaN
2021-04-20 01:15:00 NaN NaN NaN NaN
2021-04-20 01:30:00 NaN NaN NaN NaN
2021-04-20 01:45:00 NaN NaN NaN NaN
[32024 rows x 10 columns]>
And the data for the timestamps for when i usually take my medication,
['09:00 AM', '12:00 PM', '03:00 PM']
How would I use those time stamps to get this sort of column information?
Update
So, trying to build upon the question, How would I make sure it only adds medication against places where there is data available, and making sure that the after medication timing of one hour is applied correctly!
Thanks
Use np.select() to choose the appropriate label for a given condition.
First dropna() if all values after created are null (subset=df.columns[1:]). You can change the subset depending on your needs (e.g., subset=['id'] if rows should be dropped just for having a null id).
Then generate datetime arrays for taken-, active-, and after-medication periods based on the duration of the medication. Check whether the created times match any of the times in active (label 1) or after (label 2), otherwise default to 0.
# drop rows that are empty except for column 0 (i.e., except for df.created)
df.dropna(subset=df.columns[1:], inplace=True)
# convert times to datetime
df.created = pd.to_datetime(df.created)
taken = pd.to_datetime(['09:00:00', '12:00:00', '15:00:00'])
# generate time arrays
duration = 2 # hours
active = np.array([(taken + pd.Timedelta(f'{h}H')).time for h in range(duration)]).ravel()
after = (taken + pd.Timedelta(f'{duration}H')).time
# define boolean masks by label
conditions = {
1: df.created.dt.floor('H').dt.time.isin(active),
2: df.created.dt.floor('H').dt.time.isin(after),
}
# create medication column with np.select()
df['medication'] = np.select(conditions.values(), conditions.keys(), default=0)
Here is the output with some slightly modified data that better demonstrate the active / after / nan scenarios:
created id sentiment magnitude medication
0 2020-05-21 12:00:00 23.0 -0.033333 0.5 1
3 2020-05-21 12:45:00 39.0 -0.500000 0.5 1
4 2020-05-21 13:00:00 90.0 -0.500000 0.5 1
5 2020-05-21 13:15:00 100.0 -0.033333 0.1 1
9 2020-05-21 14:15:00 1000.0 0.033333 0.5 2
10 2020-05-21 14:30:00 3.0 0.001000 1.0 2
17 2021-04-20 01:00:00 46022.0 -1.000000 1.0 0
20 2021-04-20 01:45:00 46022.0 -1.000000 1.0 0
The following routine retrieves a data file.
wget.download("https://www.aaii.com/files/surveys/sentiment.xls", "C:/temp/sentiment.xls")
df = pd.read_excel("C:/temp/sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
The first three data lines are incomplete so I can slice them off like this df[3:]
At about line 1640 there is a blank line. I wish to skip the rest of the file after that line. I tried to find that line like so and get its index so I could do another slice, but I get nan for the index value.
df[df.isnull().all(1)].index.values[0]
How can I find that line and skip the rest of the file?
I think you have two nan-row problems in this file:
The first row after the header is already an empty row leading to a nan index.
The reason for your post here, the empty row which indicates the end of the data you're interested in.
first import the data as you did it:
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
NaN NaN NaN NaN ... NaN NaN NaN
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
then remove first empty row (nan-index), problem No1:
df = df[1:]
df.head()
Bullish Neutral Bearish ... High Low Close
Date ...
1987-06-26 00:00:00 NaN NaN NaN ... NaN NaN NaN
1987-07-17 00:00:00 NaN NaN NaN ... 314.59 307.63 314.59
1987-07-24 00:00:00 0.36 0.50 0.14 ... 311.39 307.81 309.27
1987-07-31 00:00:00 0.26 0.48 0.26 ... 318.66 310.65 318.66
1987-08-07 00:00:00 0.56 0.15 0.29 ... 323.00 316.23 323.00
And now you want to index all rows before the first nan-index, problem No2.
Idea: create a boolean array with True entries for all nan- indices, cast to integer and build the cumulative sum. Now you have an array, which is 0 for all the data you want and >0 from any unwanted line on until the end.
This tested against 0 returns a boolean index for your data:
data_idx = df.index.isna().astype(int).cumsum() == 0
Applied to your dataframe:
df[data_idx]
Bullish Neutral ... Low Close
Date ...
1987-06-26 00:00:00 NaN NaN ... NaN NaN
1987-07-17 00:00:00 NaN NaN ... 307.63 314.59
1987-07-24 00:00:00 0.360000 0.500000 ... 307.81 309.27
1987-07-31 00:00:00 0.260000 0.480000 ... 310.65 318.66
1987-08-07 00:00:00 0.560000 0.150000 ... 316.23 323.00
... ... ... ... ...
2018-10-11 00:00:00 0.306061 0.339394 ... 2784.86 2785.68
2018-10-18 00:00:00 0.339350 0.310469 ... 2710.51 2809.21
2018-10-25 00:00:00 0.279693 0.310345 ... 2651.89 2656.10
2018-11-01 00:00:00 0.379310 0.275862 ... 2603.54 2711.74
2018-11-08 00:00:00 0.412844 0.275229 ... 2700.44 2813.89
[1635 rows x 12 columns]
I have a time series that I want to lag and predict on for future data one year ahead that looks like:
Date Energy Pred Energy Lag Error
.
2017-09-01 9 8.4
2017-10-01 10 9
2017-11-01 11 10
2017-12-01 12 11.5
2018-01-01 1 1.3
NaT (pred-true)
NaT
NaT
NaT
.
.
All I want to do is impute dates into the NaT entries to continue from 2018-01-01 to 2019-01-01 (just fill them like we're in Excel drag and drop) because there are enough NaT positions to fill up to that point.
I've tried model['Date'].fillna() with various methods and either just repeats the same previous date or drops things I don't want to drop.
Any way to just fill these NaTs with 1 month increments like the previous data?
Make the df and set the index (there are better ways to set the index):
"""
Date,Energy,Pred Energy,Lag Error
2017-09-01,9,8.4
2017-10-01,10,9
2017-11-01,11,10
2017-12-01,12,11.5
2018-01-01,1,1.3
"""
import pandas as pd
df = pd.read_clipboard(sep=",", parse_dates=True)
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)
df.drop("Date", axis=1, inplace=True)
df
Reindex to a new date_range:
idx = pd.date_range(start='2017-09-01', end='2019-01-01', freq='MS')
df = df.reindex(idx)
Output:
Energy Pred Energy Lag Error
2017-09-01 9.0 8.4 NaN
2017-10-01 10.0 9.0 NaN
2017-11-01 11.0 10.0 NaN
2017-12-01 12.0 11.5 NaN
2018-01-01 1.0 1.3 NaN
2018-02-01 NaN NaN NaN
2018-03-01 NaN NaN NaN
2018-04-01 NaN NaN NaN
2018-05-01 NaN NaN NaN
2018-06-01 NaN NaN NaN
2018-07-01 NaN NaN NaN
2018-08-01 NaN NaN NaN
2018-09-01 NaN NaN NaN
2018-10-01 NaN NaN NaN
2018-11-01 NaN NaN NaN
2018-12-01 NaN NaN NaN
2019-01-01 NaN NaN NaN
Help from:
Pandas Set DatetimeIndex
I have two dataframes loaded from CSV file:
time_df: consist of all the dates I want as shown below
0 2017-01-31
1 2017-01-26
2 2017-01-12
3 2017-01-09
4 2017-01-02
price_df: consist of other fields and many dates that i do not need
Date NYSEARCA:TPYP NYSEARCA:MENU NYSEARCA:SLYV NYSEARCA:CZA
0 2017-01-31 NaN 16.56 117.75 55.96
1 2017-01-26 NaN 16.68 116.89 55.84
2 2017-01-27 NaN 16.70 118.47 56.04
3 2017-01-12 NaN 16.81 119.14 56.13
5 2017-01-09 NaN 16.91 120.00 56.26
6 2017-01-08 NaN 16.91 120.00 56.26
7 2017-01-02 NaN 16.91 120.00 56.26
My aim is to delete the rows where dates in price_df does not equals to the dates in time_df
tried:
del price_df['Date'] if price_df['Date']!=time_df['Date']
but can't so I tried to print print(price_df['Date']!= time_df['Date'])
but it shows the next error: Can only compare identically-labeled Series objects
Sounds like a problem an inner join can fix:
time_df.merge(price_df, on='Date',copy=False)