I would like to import the following file which contains data in a weekly format (Thursdays only) and convert it to a daily file with the values from Thursday filled out through the next Wednesday skipping Saturday and Sunday.
https://www.aaii.com/files/surveys/sentiment.xls
I can import it:
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
Here is the result:
But that is as far as I can get. Even the simplest resampling fails with
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I tried df['Date'] = pd.to_datetime(df['Date']) and other methods with no incremental success.
Thoughts as to how to get this done?
You can try like..
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
your Date column having NaN values so when you trying to convert as datetime it fails to do so ..
>>> df['Date']
0 NaN
1 1987-06-26 00:00:00
2 1987-07-17 00:00:00
3 1987-07-24 00:00:00
4 1987-07-31 00:00:00
So, you to convert the datetime you need to use coerce to get it..
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Now your Date's are processed ..
>>> df['Date']
0 NaT
1 1987-06-26
2 1987-07-17
3 1987-07-24
4 1987-07-31
5 1987-08-07
6 1987-08-14
7 1987-08-21
Now Set your index to the Date column before you can resample as mention in the comments:
>>> df.set_index('Date', inplace=True)
>>> df.head()
Bullish Neutral Bearish Total Mov Avg Spread Average +St. Dev. - St. Dev. High Low Close
Date
NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1987-06-26 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 NaN NaN NaN
1987-07-17 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 314.59 307.63 314.59
1987-07-24 0.36 0.50 0.14 1.0 NaN 0.22 0.382642 0.484295 0.280989 311.39 307.81 309.27
1987-07-31 0.26 0.48 0.26 1.0 NaN 0.00 0.382642 0.484295 0.280989 318.66 310.65 318.66
I think this is the correct answer, converts to daily, strips non-trading days and Saturday/Sunday.
import pandas as pd
from pandas.tseries.offsets import BDay
# read csv, use SENTIMENT sheet, drop the first three rows, parse dates to datetime, index on date
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df = df[3:].asfreq('D', method='ffill') # skip 3 lines then expand to daily and fill forward
df = df[df.index.map(BDay().onOffset)] # strip non-trading weekdays
df = df[df.index.dayofweek < 5] # strip Saturdays and Sundays
print(df.head(250))
There may be a more elegant method, but that gets the job done.
Related
I am having some trouble managing and combining columns in order to get one datetime column out of three columns containing the date, the hours and the minutes.
Assume the following df (copy and type df= = pd.read_clipboard() to reproduce) with the types as noted below:
>>>df
date hour minute
0 2021-01-01 7.0 15.0
1 2021-01-02 3.0 30.0
2 2021-01-02 NaN NaN
3 2021-01-03 9.0 0.0
4 2021-01-04 4.0 45.0
>>>df.dtypes
date object
hour float64
minute float64
dtype: object
I want to replace the three columns with one called 'datetime' and I have tried a few things but I face the following problems:
I first create a 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time and then I try to concatenate it with the 'date' df['datetime']= df['date'] + ' ' + df['time'] (with the purpose of converting the 'datetime' column pd.to_datetime(df['datetime']). However, I get
TypeError: can only concatenate str (not "datetime.time") to str
If I convert 'hour' and 'minute' to str to concatenate the three columns to 'datetime', then I face the problem with the NaN values, which prevents me from converting the 'datetime' to the corresponding type.
I have also tried to first convert the 'date' column df['date']= df['date'].astype('datetime64[ns]') and again create the 'time' column df['time']= (pd.to_datetime(df['hour'], unit='h') + pd.to_timedelta(df['minute'], unit='m')).dt.time to combine the two: df['datetime']= pd.datetime.combine(df['date'],df['time']) and it returns
TypeError: combine() argument 1 must be datetime.date, not Series
along with the warning
FutureWarning: The pandas.datetime class is deprecated and will be removed from pandas in a future version. Import from datetime module instead.
Is there a generic solution to combine the three columns and ignore the NaN values (assume it could return 00:00:00).
What if I have a row with all NaN values? Would it possible to ignore all NaNs and 'datetime' be NaN for this row?
Thank you in advance, ^_^
First convert date to datetimes and then add hour and minutes timedeltas with replace missing values to 0 timedelta:
td = pd.Timedelta(0)
df['datetime'] = (pd.to_datetime(df['date']) +
pd.to_timedelta(df['hour'], unit='h').fillna(td) +
pd.to_timedelta(df['minute'], unit='m').fillna(td))
print (df)
date hour minute datetime
0 2021-01-01 7.0 15.0 2021-01-01 07:15:00
1 2021-01-02 3.0 30.0 2021-01-02 03:30:00
2 2021-01-02 NaN NaN 2021-01-02 00:00:00
3 2021-01-03 9.0 0.0 2021-01-03 09:00:00
4 2021-01-04 4.0 45.0 2021-01-04 04:45:00
Or you can use Series.add with fill_value=0:
df['datetime'] = (pd.to_datetime(df['date'])
.add(pd.to_timedelta(df['hour'], unit='h'), fill_value=0)
.add(pd.to_timedelta(df['minute'], unit='m'), fill_value=0))
I would recommend converting hour and minute columns to string and constructing the datetime string from the provided components.
Logically, you need to perform the following steps:
Step 1. Fill missing values for hour and minute with zeros.
df['hour'] = df['hour'].fillna(0)
df['minute'] = df['minute'].fillna(0)
Step 2. Convert float values for hour and minute into integer ones, because your final output should look like 2021-01-01 7:15, not 2021-01-01 7.0:15.0.
df['hour'] = df['hour'].astype(int)
df['minute'] = df['minute'].astype(int)
Step 3. Convert integer values for hour and minute to the string representation.
df['hour'] = df['hour'].astype(str)
df['minute'] = df['minute'].astype(str)
Step 4. Concatenate date, hour and minute into one column of the correct format.
df['result'] = df['date'].str.cat(df['hour'].str.cat(df['minute'], sep=':'), sep=' ')
Step 5. Convert your result column to datetime object.
pd.to_datetime(df['result'])
It is also possible to fullfill all of this steps in one command, though it will read a bit messy:
df['result'] = pd.to_datetime(df['date'].str.cat(df['hour'].fillna(0).astype(int).astype(str).str.cat(df['minute'].fillna(0).astype(int).astype(str), sep=':'), sep=' '))
Result:
date hour minute result
0 2020-01-01 7.0 15.0 2020-01-01 07:15:00
1 2020-01-02 3.0 30.0 2020-01-02 03:30:00
2 2020-01-02 NaN NaN 2020-01-02 00:00:00
3 2020-01-03 9.0 0.0 2020-01-03 09:00:00
4 2020-01-04 4.0 45.0 2020-01-04 04:45:00
I have table:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
I need to add new rows to the table where the date (day) is missing. In the added rows, in columns col1 and col2, there must be values copied from the row located below in the table (from rows with more recent dates).
I need to get the following table:
Use pandas.to_datetime and asfreq:
df.set_index(pd.to_datetime(df['date'])).drop('date', 1).asfreq('1 d').bfill().reset_index()
Output:
date col1 col2
0 2019-01-22 10.0 15.0
1 2019-01-23 10.0 15.0
2 2019-01-24 10.0 15.0
3 2019-01-25 10.0 15.0
4 2019-01-26 200.0 260.0
5 2019-01-27 200.0 260.0
6 2019-01-28 200.0 260.0
7 2019-01-29 3010.0 3800.0
8 2019-01-30 3010.0 3800.0
9 2019-01-31 3010.0 3800.0
10 2019-02-01 3010.0 3800.0
11 2019-02-02 3010.0 3800.0
12 2019-02-03 3010.0 3800.0
13 2019-02-04 40109.0 45009.0
14 2019-02-05 40109.0 45009.0
df = df.sort_values("date")
df = df.fillna(method='bfill')
Sort the dataframe according to date and fill the nulls with the next non-null values.
try this code:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
df.drop('date',1,inplace=True)
df.resample('D').asfreq().bfill()
df.reset_index(inplace=True)
convert date to an actual date obj (was a str)
set index to be the date column (because of how resample/bfill works)
drop the date column
resample dates on a daily bases, backfill missing data
reset index so its back to being a regular column
download link1I have extracted raw data from the a csv file and I have set the index column to Date.Here is in the attached clip
The index is not in datetime format and when I try to convert using the below code
df.index=pd.to_datetime(df.index)
I get this error:
"ValueError: month must be in 1..12"
The current dtype for the index is 'object'
I have seen some previous questions related to conversion to datetime but am afraid I couldn't use that to get a solution to my question. Could someone help please?
thanks,
There is problem 3 different types of datetimes - solution is parse each separately - for unmatched values are created NaNs, so for replace them use Series.combine_first:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df['Date'] = d1.combine_first(d2).combine_first(d3)
#check not parsed datetimes
print(df[df['Date'].isna()])
Date Mkt-RF SMB HML RF
1113 NaT NaN NaN NaN NaN
1114 NaT NaN NaN NaN NaN
1115 NaT Mkt-RF SMB HML RF
1208 NaT NaN NaN NaN NaN
1209 NaT NaN NaN NaN NaN
Another possible solution is create 3 separate DataFrames:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
df1 = df.assign(Date=d1).dropna(subset=['Date'])
print (df1.head())
Date Mkt-RF SMB HML RF
0 2019-07-26 2.96 -2.3 -2.87 0.22
1 2019-08-26 2.64 -1.4 4.19 0.25
2 2019-09-26 0.36 -1.32 0.01 0.23
3 2019-10-26 -3.24 0.04 0.51 0.32
4 2019-11-26 2.53 -0.2 -0.35 0.31
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
df2 = df.assign(Date=d2).dropna(subset=['Date'])
print (df2.head())
Date Mkt-RF SMB HML RF
1116 1927-01-01 29.47 -2.46 -3.75 3.12
1117 1928-01-01 35.39 4.2 -6.15 3.56
1118 1929-01-01 -19.54 -30.8 11.81 4.75
1119 1930-01-01 -31.23 -5.13 -12.28 2.41
1120 1931-01-01 -45.11 3.53 -14.29 1.07
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df3 = df.assign(Date=d3).dropna(subset=['Date'])
print (df3.head())
Date Mkt-RF SMB HML RF
0 1926-07-01 2.96 -2.3 -2.87 0.22
1 1926-08-01 2.64 -1.4 4.19 0.25
2 1926-09-01 0.36 -1.32 0.01 0.23
3 1926-10-01 -3.24 0.04 0.51 0.32
4 1926-11-01 2.53 -0.2 -0.35 0.31
The file contains more than one data series. The beginning of the file has a header line and then dates formatted as %Y%m. But at line 1115, we find a line containing only empty values, followed with a textual information (Annual Factors: January-December), a new header line and than annual data with date formatted as %Y only. This is far beyond what read_csv can automagically process.
So my advice is to first load the file without trying to parse the Date column, then reject any line past the first one containing an empty date, and only then parse the date on the remaining lines.
Code could be:
df = pd.read_csv('FFdata1.csv').loc[df.index < df[df.Date.isna()].index[0]]
df['Date'] = pd.to_datetime(df.Date,format='%Y%m')
df.set_index('Date', inplace=True)
Problem
I have a data frame containing financial data sampled at 1 minute intervals. Occasionally a row or two of data might be missing.
I'm looking for a good (simple and efficient) way to insert new rows into the dataframe at the points in which there is missing data.
The new rows can be empty except for the index, which contains the timestamp.
For example:
#Example Input---------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
# Desired Ouput--------------------------------------------
open high low close
2019-02-07 16:01:00 124.624 124.627 124.647 124.617
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 124.646 124.655 124.664 124.645
My current method is based off this post -
Find missing minute data in time series data using pandas - which is advises only how to identify the gaps. Not how to fill them.
What I'm doing is creating a DateTimeIndex of 1min intervals. Then using this index, I create an entirely new dataframe, which can then be merged into my original dataframe thus filling the gaps. Code is shown below. It seems quite a round about way of doing this. I would like to know if there is a better way. Maybe with resampling the data?
import pandas as pd
from datetime import datetime
# Initialise prices dataframe with missing data
prices = pd.DataFrame([[datetime(2019,2,7,16,0), 124.634, 124.624, 124.65, 124.62],[datetime(2019,2,7,16,4), 124.624, 124.627, 124.647, 124.617]])
prices.columns = ['datetime','open','high','low','close']
prices = prices.set_index('datetime')
print(prices)
# Create a new dataframe with complete set of time intervals
idx_ref = pd.DatetimeIndex(start=datetime(2019,2,7,16,0), end=datetime(2019,2,7,16,4),freq='min')
df = pd.DataFrame(index=idx_ref)
# Merge the two dataframes
prices = pd.merge(df, prices, how='outer', left_index=True,
right_index=True)
print(prices)
Use DataFrame.asfreq working with Datetimeindex:
prices = prices.set_index('datetime').asfreq('1Min')
print(prices)
open high low close
datetime
2019-02-07 16:00:00 124.634 124.624 124.650 124.620
2019-02-07 16:01:00 NaN NaN NaN NaN
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 124.624 124.627 124.647 124.617
A more manual answer would be:
from datetime import datetime, timedelta
from dateutil import parser
import pandas as pd
df = pd.DataFrame({
'a': ['2021-02-07 11:00:30', '2021-02-07 11:00:31', '2021-02-07 11:00:35'],
'b': [64.8, 64.8, 50.3]
})
max_dt = parser.parse(max(df['a']))
min_dt = parser.parse(min(df['a']))
dt_range = []
while min_dt <= max_dt:
dt_range.append(min_dt.strftime("%Y-%m-%d %H:%M:%S"))
min_dt += timedelta(seconds=1)
complete_df = pd.DataFrame({'a': dt_range})
final_df = complete_df.merge(df, how='left', on='a')
It converts the following dataframe:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:35 50.3
to:
a b
0 2021-02-07 11:00:30 64.8
1 2021-02-07 11:00:31 64.8
2 2021-02-07 11:00:32 NaN
3 2021-02-07 11:00:33 NaN
4 2021-02-07 11:00:34 NaN
5 2021-02-07 11:00:35 50.3
which we can fill its null values later
The proposal of #jezrael didnt't work for me initially because my index used to be different type than DatetimeIndex. The execution of prices.asfreq() wiped out all prices data, though it filled the gaps with Nan that way:
open high low close
datetime
2019-02-07 16:00:00 NaN NaN NaN NaN
2019-02-07 16:01:00 NaN NaN NaN NaN
2019-02-07 16:02:00 NaN NaN NaN NaN
2019-02-07 16:03:00 NaN NaN NaN NaN
2019-02-07 16:04:00 NaN NaN NaN NaN
To fix this I had to change the type of index column like this
prices['date'] = pd.to_datetime(prices['datetime'])
prices = prices.set_index('date')
prices.drop(['datetime'], axis=1, inplace=True)
That code will convert the type of 'datetime' column to DatetimeIndex type, and set the new column as index
Now I can call
prices = prices.asfreq('1Min')
I need to download and process the Australian Bureaux of Meteorology weather files. So far the following Python works well, it's extracting and cleansing the data exactly as I want
import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns', 'rain', 'prob'], header=0, converters={'stn': str})
The issue is the file is overwritten daily, and the metadata which indicates what day and time the forecast was produced on is in the comment fields on the first two lines, i.e. the file contains the following data
# date=20131111
# time=06
[fcst_DB]
stn[7] , per, evap, amax, amin, gmin, suns, rain, prob
"001006", 0,-9999.0, 39.9,-9999.0,-9999.0,-9999.0, 4.0, 100.0
"001006", 1,-9999.0, 39.4, 26.5,-9999.0,-9999.0, 6.0, 100.0
"001006", 2,-9999.0, 35.5, 26.2,-9999.0,-9999.0, 7.0, 100.0
Is it possible using pandas to include the first two lines in the result. Ideally by adding a date and time column to the result and using the values 20131111 and 06 for each row in the output.
Regards
Dave
Will the first two lines always be a date and time? In that case I'd suggest parsing those separately and handing the rest of the stream off to read_csv.
import urllib2
r = urllib2.urlopen(url)
In [29]: r = urllib2.urlopen(url)
In [30]: date = next(r).strip('# date=').rstrip()
In [31]: time = next(r).strip('# time=').rstrip()
In [32]: stamp = pd.to_datetime(x + ' ' + time)
In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)
Then use your code to read (I changed the skiprows to 1)
In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
'rain', 'prob'], header=0, converters={'stn': str})
In [43]: df['timestamp'] = stamp
In [44]: df.head()
Out[44]:
stn per evap amax amin gmin suns rain prob timestamp
0 001006 0 NaN 39.9 NaN NaN NaN 2.9 100.0 2013-11-12 00:00:00
1 001006 1 NaN 35.8 25.8 NaN NaN 7.0 100.0 2013-11-12 00:00:00
2 001006 2 NaN 37.0 25.5 NaN NaN 4.0 71.4 2013-11-12 00:00:00
3 001006 3 NaN 39.0 26.0 NaN NaN 1.0 60.0 2013-11-12 00:00:00
4 001006 4 NaN 41.2 26.1 NaN NaN 0.0 40.0 2013-11-12 00:00:00