Not able to convert dataframe index from datetime format - python

download link1I have extracted raw data from the a csv file and I have set the index column to Date.Here is in the attached clip
The index is not in datetime format and when I try to convert using the below code
df.index=pd.to_datetime(df.index)
I get this error:
"ValueError: month must be in 1..12"
The current dtype for the index is 'object'
I have seen some previous questions related to conversion to datetime but am afraid I couldn't use that to get a solution to my question. Could someone help please?
thanks,

There is problem 3 different types of datetimes - solution is parse each separately - for unmatched values are created NaNs, so for replace them use Series.combine_first:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df['Date'] = d1.combine_first(d2).combine_first(d3)
#check not parsed datetimes
print(df[df['Date'].isna()])
Date Mkt-RF SMB HML RF
1113 NaT NaN NaN NaN NaN
1114 NaT NaN NaN NaN NaN
1115 NaT Mkt-RF SMB HML RF
1208 NaT NaN NaN NaN NaN
1209 NaT NaN NaN NaN NaN
Another possible solution is create 3 separate DataFrames:
df = pd.read_csv('FFdata1.csv', index_col=['Date'])
df = df.reset_index()
#format YYDDMM
d1 = pd.to_datetime(df['Date'], format='%y%d%m', errors='coerce')
df1 = df.assign(Date=d1).dropna(subset=['Date'])
print (df1.head())
Date Mkt-RF SMB HML RF
0 2019-07-26 2.96 -2.3 -2.87 0.22
1 2019-08-26 2.64 -1.4 4.19 0.25
2 2019-09-26 0.36 -1.32 0.01 0.23
3 2019-10-26 -3.24 0.04 0.51 0.32
4 2019-11-26 2.53 -0.2 -0.35 0.31
#format YYYY
d2 = pd.to_datetime(df['Date'], format='%Y', errors='coerce')
df2 = df.assign(Date=d2).dropna(subset=['Date'])
print (df2.head())
Date Mkt-RF SMB HML RF
1116 1927-01-01 29.47 -2.46 -3.75 3.12
1117 1928-01-01 35.39 4.2 -6.15 3.56
1118 1929-01-01 -19.54 -30.8 11.81 4.75
1119 1930-01-01 -31.23 -5.13 -12.28 2.41
1120 1931-01-01 -45.11 3.53 -14.29 1.07
#format YYYYMM
d3 = pd.to_datetime(df['Date'], format='%Y%m', errors='coerce')
df3 = df.assign(Date=d3).dropna(subset=['Date'])
print (df3.head())
Date Mkt-RF SMB HML RF
0 1926-07-01 2.96 -2.3 -2.87 0.22
1 1926-08-01 2.64 -1.4 4.19 0.25
2 1926-09-01 0.36 -1.32 0.01 0.23
3 1926-10-01 -3.24 0.04 0.51 0.32
4 1926-11-01 2.53 -0.2 -0.35 0.31

The file contains more than one data series. The beginning of the file has a header line and then dates formatted as %Y%m. But at line 1115, we find a line containing only empty values, followed with a textual information (Annual Factors: January-December), a new header line and than annual data with date formatted as %Y only. This is far beyond what read_csv can automagically process.
So my advice is to first load the file without trying to parse the Date column, then reject any line past the first one containing an empty date, and only then parse the date on the remaining lines.
Code could be:
df = pd.read_csv('FFdata1.csv').loc[df.index < df[df.Date.isna()].index[0]]
df['Date'] = pd.to_datetime(df.Date,format='%Y%m')
df.set_index('Date', inplace=True)

Related

How to get the difference between first row and current row in pandas

I have 2 columns in dataframe (date & sell_price). My expected output is like this. I want to add one more column named profit in dataframe that needs to be calculated by current sell_price - first sell price (stars one)
date sell_price profit(needs to be added)
0 2018-10-26 **21.20** NaN
1 2018-10-29 15.15 -6.05
2 2018-10-30 15.65 -5.55
3 2018-10-31 0.15 -21.05
4 2018-11-01 5.20 -16.00
I know the diff in pandas that gives difference between consecutive rows. How can we achieve expected o/p with diff or any other function on pandas?
For general Index like DatetimeIndex use iloc with iat, but it working only with positions, so necessary get_loc:
pos = df.columns.get_loc('sell_price')
df['profit'] = df.iloc[1:, pos] - df.iat[0, pos]
If default RangeIndex use loc with at:
df['profit'] = df.loc[1:, 'sell_price'] - df.at[0, 'sell_price']
print (df)
date sell_price profit
0 2018-10-26 21.20 NaN
1 2018-10-29 15.15 -6.05
2 2018-10-30 15.65 -5.55
3 2018-10-31 0.15 -21.05
4 2018-11-01 5.20 -16.00

How to resample AAII weekly data to daily?

I would like to import the following file which contains data in a weekly format (Thursdays only) and convert it to a daily file with the values from Thursday filled out through the next Wednesday skipping Saturday and Sunday.
https://www.aaii.com/files/surveys/sentiment.xls
I can import it:
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
Here is the result:
But that is as far as I can get. Even the simplest resampling fails with
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I tried df['Date'] = pd.to_datetime(df['Date']) and other methods with no incremental success.
Thoughts as to how to get this done?
You can try like..
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
your Date column having NaN values so when you trying to convert as datetime it fails to do so ..
>>> df['Date']
0 NaN
1 1987-06-26 00:00:00
2 1987-07-17 00:00:00
3 1987-07-24 00:00:00
4 1987-07-31 00:00:00
So, you to convert the datetime you need to use coerce to get it..
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Now your Date's are processed ..
>>> df['Date']
0 NaT
1 1987-06-26
2 1987-07-17
3 1987-07-24
4 1987-07-31
5 1987-08-07
6 1987-08-14
7 1987-08-21
Now Set your index to the Date column before you can resample as mention in the comments:
>>> df.set_index('Date', inplace=True)
>>> df.head()
Bullish Neutral Bearish Total Mov Avg Spread Average +St. Dev. - St. Dev. High Low Close
Date
NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1987-06-26 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 NaN NaN NaN
1987-07-17 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 314.59 307.63 314.59
1987-07-24 0.36 0.50 0.14 1.0 NaN 0.22 0.382642 0.484295 0.280989 311.39 307.81 309.27
1987-07-31 0.26 0.48 0.26 1.0 NaN 0.00 0.382642 0.484295 0.280989 318.66 310.65 318.66
I think this is the correct answer, converts to daily, strips non-trading days and Saturday/Sunday.
import pandas as pd
from pandas.tseries.offsets import BDay
# read csv, use SENTIMENT sheet, drop the first three rows, parse dates to datetime, index on date
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df = df[3:].asfreq('D', method='ffill') # skip 3 lines then expand to daily and fill forward
df = df[df.index.map(BDay().onOffset)] # strip non-trading weekdays
df = df[df.index.dayofweek < 5] # strip Saturdays and Sundays
print(df.head(250))
There may be a more elegant method, but that gets the job done.

python converting url CSV to pandas for numerical and date value always gives wrong output

Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)

Append values in pandas where value equals other value

I have two data frames:
dfi = pd.read_csv('C:/Users/Mauricio/Desktop/inflation.csv')
dfm = pd.read_csv('C:/Users/Mauricio/Desktop/maturity.csv')
# equals the following
observation_date CPIAUCSL
0 1947-01-01 21.48
1 1947-02-01 21.62
2 1947-03-01 22.00
3 1947-04-01 22.00
4 1947-05-01 21.95
observation_date DGS10
0 1962-01-02 4.06
1 1962-01-03 4.03
2 1962-01-04 3.99
3 1962-01-05 4.02
4 1962-01-08 4.03
I created a copy as df doing the following:
df = dfi.copy(deep=True)
which returns an exact copy of dfi, dfi dates go by month and dfm dates go by day, I want to create a new column in df that everytime a date in dfi == a date in dfm, to append the DGS10 value in it.
I have this so far:
for date in df.observation_date:
for date2 in dfm.observation_date:
if date==date2:
df['mat_rate'] = dfm['DGS10']
# this is what I get but dates do not match values
observation_date CPIAUCSL mat_rate
0 1947-01-01 21.48 4.06
1 1947-02-01 21.62 4.03
2 1947-03-01 22.00 3.99
3 1947-04-01 22.00 4.02
4 1947-05-01 21.95 4.03
It works but does not append the dates where date == date2 what can I do so it appends the values where date equals date2 only?
Thank you!
If the date formats are inconsistent, convert them first:
dfi.observation_date = pd.to_datetime(dfi.observation_date, format='%Y-%m-%d')
dfm.observation_date = pd.to_datetime(dfm.observation_date, format='%Y-%m-%d')
Now, getting your result should be easy with a merge:
df = dfi.merge(dfm, on='observation_date')

Extracting metadata from comment fields using pandas

I need to download and process the Australian Bureaux of Meteorology weather files. So far the following Python works well, it's extracting and cleansing the data exactly as I want
import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns', 'rain', 'prob'], header=0, converters={'stn': str})
The issue is the file is overwritten daily, and the metadata which indicates what day and time the forecast was produced on is in the comment fields on the first two lines, i.e. the file contains the following data
# date=20131111
# time=06
[fcst_DB]
stn[7] , per, evap, amax, amin, gmin, suns, rain, prob
"001006", 0,-9999.0, 39.9,-9999.0,-9999.0,-9999.0, 4.0, 100.0
"001006", 1,-9999.0, 39.4, 26.5,-9999.0,-9999.0, 6.0, 100.0
"001006", 2,-9999.0, 35.5, 26.2,-9999.0,-9999.0, 7.0, 100.0
Is it possible using pandas to include the first two lines in the result. Ideally by adding a date and time column to the result and using the values 20131111 and 06 for each row in the output.
Regards
Dave
Will the first two lines always be a date and time? In that case I'd suggest parsing those separately and handing the rest of the stream off to read_csv.
import urllib2
r = urllib2.urlopen(url)
In [29]: r = urllib2.urlopen(url)
In [30]: date = next(r).strip('# date=').rstrip()
In [31]: time = next(r).strip('# time=').rstrip()
In [32]: stamp = pd.to_datetime(x + ' ' + time)
In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)
Then use your code to read (I changed the skiprows to 1)
In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
'rain', 'prob'], header=0, converters={'stn': str})
In [43]: df['timestamp'] = stamp
In [44]: df.head()
Out[44]:
stn per evap amax amin gmin suns rain prob timestamp
0 001006 0 NaN 39.9 NaN NaN NaN 2.9 100.0 2013-11-12 00:00:00
1 001006 1 NaN 35.8 25.8 NaN NaN 7.0 100.0 2013-11-12 00:00:00
2 001006 2 NaN 37.0 25.5 NaN NaN 4.0 71.4 2013-11-12 00:00:00
3 001006 3 NaN 39.0 26.0 NaN NaN 1.0 60.0 2013-11-12 00:00:00
4 001006 4 NaN 41.2 26.1 NaN NaN 0.0 40.0 2013-11-12 00:00:00

Categories

Resources