Extracting metadata from comment fields using pandas - python

I need to download and process the Australian Bureaux of Meteorology weather files. So far the following Python works well, it's extracting and cleansing the data exactly as I want
import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns', 'rain', 'prob'], header=0, converters={'stn': str})
The issue is the file is overwritten daily, and the metadata which indicates what day and time the forecast was produced on is in the comment fields on the first two lines, i.e. the file contains the following data
# date=20131111
# time=06
[fcst_DB]
stn[7] , per, evap, amax, amin, gmin, suns, rain, prob
"001006", 0,-9999.0, 39.9,-9999.0,-9999.0,-9999.0, 4.0, 100.0
"001006", 1,-9999.0, 39.4, 26.5,-9999.0,-9999.0, 6.0, 100.0
"001006", 2,-9999.0, 35.5, 26.2,-9999.0,-9999.0, 7.0, 100.0
Is it possible using pandas to include the first two lines in the result. Ideally by adding a date and time column to the result and using the values 20131111 and 06 for each row in the output.
Regards
Dave

Will the first two lines always be a date and time? In that case I'd suggest parsing those separately and handing the rest of the stream off to read_csv.
import urllib2
r = urllib2.urlopen(url)
In [29]: r = urllib2.urlopen(url)
In [30]: date = next(r).strip('# date=').rstrip()
In [31]: time = next(r).strip('# time=').rstrip()
In [32]: stamp = pd.to_datetime(x + ' ' + time)
In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)
Then use your code to read (I changed the skiprows to 1)
In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
'rain', 'prob'], header=0, converters={'stn': str})
In [43]: df['timestamp'] = stamp
In [44]: df.head()
Out[44]:
stn per evap amax amin gmin suns rain prob timestamp
0 001006 0 NaN 39.9 NaN NaN NaN 2.9 100.0 2013-11-12 00:00:00
1 001006 1 NaN 35.8 25.8 NaN NaN 7.0 100.0 2013-11-12 00:00:00
2 001006 2 NaN 37.0 25.5 NaN NaN 4.0 71.4 2013-11-12 00:00:00
3 001006 3 NaN 39.0 26.0 NaN NaN 1.0 60.0 2013-11-12 00:00:00
4 001006 4 NaN 41.2 26.1 NaN NaN 0.0 40.0 2013-11-12 00:00:00

Related

Pandas Aggregate Daily Data to Monthly Timeseries

I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```
Would this work?
pd.resample('M').mean().dropna()
Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.

How to add rows to a table with missing dates (days), filling in the added rows with the values copied from the rows below?

I have table:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
I need to add new rows to the table where the date (day) is missing. In the added rows, in columns col1 and col2, there must be values copied from the row located below in the table (from rows with more recent dates).
I need to get the following table:
Use pandas.to_datetime and asfreq:
df.set_index(pd.to_datetime(df['date'])).drop('date', 1).asfreq('1 d').bfill().reset_index()
Output:
date col1 col2
0 2019-01-22 10.0 15.0
1 2019-01-23 10.0 15.0
2 2019-01-24 10.0 15.0
3 2019-01-25 10.0 15.0
4 2019-01-26 200.0 260.0
5 2019-01-27 200.0 260.0
6 2019-01-28 200.0 260.0
7 2019-01-29 3010.0 3800.0
8 2019-01-30 3010.0 3800.0
9 2019-01-31 3010.0 3800.0
10 2019-02-01 3010.0 3800.0
11 2019-02-02 3010.0 3800.0
12 2019-02-03 3010.0 3800.0
13 2019-02-04 40109.0 45009.0
14 2019-02-05 40109.0 45009.0
df = df.sort_values("date")
df = df.fillna(method='bfill')
Sort the dataframe according to date and fill the nulls with the next non-null values.
try this code:
import pandas as pd
import numpy as np
df = pd.DataFrame([
("2019-01-22", np.nan, np.nan),
("2019-01-25", 10, 15),
("2019-01-28", 200, 260),
("2019-02-03", 3010, 3800),
("2019-02-05", 40109, 45009)],
columns=["date", "col1", "col2"])
df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
df.drop('date',1,inplace=True)
df.resample('D').asfreq().bfill()
df.reset_index(inplace=True)
convert date to an actual date obj (was a str)
set index to be the date column (because of how resample/bfill works)
drop the date column
resample dates on a daily bases, backfill missing data
reset index so its back to being a regular column

How to resample AAII weekly data to daily?

I would like to import the following file which contains data in a weekly format (Thursdays only) and convert it to a daily file with the values from Thursday filled out through the next Wednesday skipping Saturday and Sunday.
https://www.aaii.com/files/surveys/sentiment.xls
I can import it:
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
Here is the result:
But that is as far as I can get. Even the simplest resampling fails with
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I tried df['Date'] = pd.to_datetime(df['Date']) and other methods with no incremental success.
Thoughts as to how to get this done?
You can try like..
df = pd.read_excel("sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y')
your Date column having NaN values so when you trying to convert as datetime it fails to do so ..
>>> df['Date']
0 NaN
1 1987-06-26 00:00:00
2 1987-07-17 00:00:00
3 1987-07-24 00:00:00
4 1987-07-31 00:00:00
So, you to convert the datetime you need to use coerce to get it..
>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
Now your Date's are processed ..
>>> df['Date']
0 NaT
1 1987-06-26
2 1987-07-17
3 1987-07-24
4 1987-07-31
5 1987-08-07
6 1987-08-14
7 1987-08-21
Now Set your index to the Date column before you can resample as mention in the comments:
>>> df.set_index('Date', inplace=True)
>>> df.head()
Bullish Neutral Bearish Total Mov Avg Spread Average +St. Dev. - St. Dev. High Low Close
Date
NaT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1987-06-26 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 NaN NaN NaN
1987-07-17 NaN NaN NaN NaN NaN NaN 0.382642 0.484295 0.280989 314.59 307.63 314.59
1987-07-24 0.36 0.50 0.14 1.0 NaN 0.22 0.382642 0.484295 0.280989 311.39 307.81 309.27
1987-07-31 0.26 0.48 0.26 1.0 NaN 0.00 0.382642 0.484295 0.280989 318.66 310.65 318.66
I think this is the correct answer, converts to daily, strips non-trading days and Saturday/Sunday.
import pandas as pd
from pandas.tseries.offsets import BDay
# read csv, use SENTIMENT sheet, drop the first three rows, parse dates to datetime, index on date
df = pd.read_excel("C:\\Users\\Public\\Portfolio\\exports\\sentiment.xls", sheet_name = "SENTIMENT", skiprows=3, parse_dates=['Date'], date_format='%m-%d-%y', index_col ='Date')
df = df[3:].asfreq('D', method='ffill') # skip 3 lines then expand to daily and fill forward
df = df[df.index.map(BDay().onOffset)] # strip non-trading weekdays
df = df[df.index.dayofweek < 5] # strip Saturdays and Sundays
print(df.head(250))
There may be a more elegant method, but that gets the job done.

python converting url CSV to pandas for numerical and date value always gives wrong output

Here is the python code which is trying to read the CSV file from alphavantage URL and converts it to pandas data frame. Multiple issues are there with this.
Before raising the issue, here is the code below.
dailyurl = 'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY_ADJUSTED&symbol=NSE:{}&apikey=key&outputsize=full&datatype=csv'.format(Ticker)
cols = ['timestamp', 'open', 'high', 'low', 'close','adjusted_close','volume','dividend_amount','split_coefficient']
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols)
dfmonthly = pd.read_csv(monthlyurl, skiprows=0, header=None,names=cols)
dfdaily.rename(columns = {'timestamp':'date'}, inplace = True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.drop(dfdaily.index[:1], inplace=True)
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
dfdaily.reset_index(inplace=True, drop=False)
print(dfdaily.head(6))
Issues:
dfdaily = pd.read_csv(dailyurl, skiprows=0, header=None,names=cols) return values seems to not match with pandas dataframe (looks like it contains a string) hence when I use this dataframe I am getting error "high is not dobule"
This URL return value contains multi-index as below
0 1 2 3 4
0 Timestamp open High Low close
1 09-02-2017 100 110 99 96
In the above first 0,1,2,3,4 column index not wanted hence added
dfdaily.drop(dfdaily.index[:1], inplace=True) now ,is there a better way to get the dataframe output converting this from csv to pddataframe.
As i see the read values are string i just tried making the dataframe as numeric value by using this line
dfdaily = dfdaily.apply(pd.to_numeric, errors='coerce').fillna(0).astype(float)
this converts the date value to 0.0 so lost the purpose the date should be retain as its.And with this many lines of code for converting pandasdata frame it takes lot of time,so really a better way of doing to get the desired output is needed.
The output I am getting is :
index date open high low close adjusted_close volume
0 1 0.0 1629.05 1655.00 1617.30 1639.40 1639.40 703720.0
1 2 0.0 1654.00 1679.00 1638.05 1662.15 1662.15 750746.0
2 3 0.0 1680.00 1687.00 1620.60 1641.65 1641.65 1466983.0
3 4 0.0 1530.00 1683.75 1511.20 1662.15 1662.15 2109416.0
4 5 0.0 1600.00 1627.95 1546.50 1604.95 1604.95 1472164.0
5 6 0.0 1708.05 1713.00 1620.20 1628.90 1628.90 1645045.0
Multiindex is not required and date shall be as date not "0"
and other open high low close shall be in numerical format.
light on this optimization , a nice code which will give pandas numerical dataframe with an index as "date" so that it can be used for arithmetic logical execution further.
I think you need omit parameter names, because csv has header. Also for DatetimeIndex add parameter index_col for set first column to index and parse_dates for convert it to datetimes. Last rename_axis rename timestamp to date:
dfdaily = pd.read_csv(dailyurl, index_col=[0], parse_dates=[0])
dfdaily = dfdaily.rename_axis('date')
print (dfdaily.head())
open high low close adjusted_close volume \
date
2018-02-09 20.25 21.0 20.25 20.25 20.25 21700
2018-02-08 20.50 20.5 20.25 20.50 20.50 1688900
2018-02-07 20.50 20.5 20.25 20.50 20.50 301800
2018-02-06 20.25 21.0 20.25 20.25 20.25 39400
2018-02-05 20.50 21.0 20.25 20.50 20.50 5400
dividend_amount split_coefficient
date
2018-02-09 0.0 1.0
2018-02-08 0.0 1.0
2018-02-07 0.0 1.0
2018-02-06 0.0 1.0
2018-02-05 0.0 1.0
print (dfdaily.dtypes)
open float64
high float64
low float64
close float64
adjusted_close float64
volume int64
dividend_amount float64
split_coefficient float64
dtype: object
print (dfdaily.index)
DatetimeIndex(['2018-02-09', '2018-02-08', '2018-02-07', '2018-02-06',
'2018-02-05', '2018-02-02', '2018-02-01', '2018-01-31',
'2018-01-30', '2018-01-29',
...
'2000-01-14', '2000-01-13', '2000-01-12', '2000-01-11',
'2000-01-10', '2000-01-07', '2000-01-06', '2000-01-05',
'2000-01-04', '2000-01-03'],
dtype='datetime64[ns]', name='date', length=4556, freq=None)

How to fillna/missing values for an irregular timeseries for a Drug when Half-life is known

I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64

Categories

Resources