I am creating a pandas dataframe from historical weather data downloaded from weather underground.
import json
import requests
import pandas as pd
import numpy as np
import datetime
from dateutil.parser import parse
address = "http://api.wunderground.com/api/7036740167876b59/history_20060405/q/CA/San_Francisco.json"
r = requests.get(address)
wu_data = r.json()
Because I do not need all the data I only use the list of observations. This list contains two elements - date and utcdate - that are actually dictionaries.
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
I would like to index the dataframe I have created with the parsed date from the 'pretty' key within the dictionary. I can access this value by using the array index, but I can't figure out how to do this directly without a loop. For example, for the 23th element I can write
pretty_date = df["date"].values[23]["pretty"]
print pretty_date
time = parse(pretty_date)
print time
And I get
11:56 PM PDT on April 05, 2006
2006-04-05 23:56:00
This is what I am doing at the moment
g = lambda x: parse(x["pretty"])
df_dates = pd.DataFrame.from_dict(df["date"])
df.index = df_date["date"].apply(g)
df is now reindexed. At this point I can remove the columns I do not need.
Is there a more direct way to do this?
Please notice that sometimes there are multiple observations for the same date, but I deal with data cleaning, duplicates, etc. in a different part of the code.
Since the dtype held in pretty is just object, you can simply grab them to a list and get indexed. Not sure if this is what you want:
# by the way, `r.json` should be without ()`
wu_data = r.json
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
# just index using list comprehension, getting "pretty" inside df["date"] object.
df.index = [parse(df["date"][n]["pretty"]) for n in range(len(df))]
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-04-05 00:56:00, ..., 2006-04-05 23:56:00]
Length: 24, Freq: None, Timezone: None
Hope this helps.
Related
I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
I am having a hard time summing two dates that are saved in two separate json files. I want to add set dates together which are saved in separate libraries.
The first file (A1.json) contains: {"expires": "2019-09-11"}
The second file (Whitelist.json) contains: {"expires": "0000-01-00"}
These dates are created by using tkcalendar and are later exported to these seperate files, the idea being that summing them lets me set a time date one month into the future. However, I can't seem to add them together without some form of an error.
I have tried converting the json files to strings in python and then adding them and also using the striptime command to sum the dates.
Here is the relevant chunk of the code:
{with open('A1.json') as f:
data=json.loads(f.read())
for material in data.items():
A1 = (format(material[1]['expires']))
with open('Whitelist.json') as f:
data=json.loads(f.read())
for material in data.items():
A2 = (format(material[1]['expires']))
print(A1+A2)}
When this is used, they just get pasted one after another. They don't get summed the way I need.
I also have tried the following code:
{t1 = dt.datetime.strptime('A1', '%d-%m-%Y')
t2 = dt.datetime.strptime('Whitelist', '%d-%m-%Y')
time_zero = dt.datetime.strptime('00:00:00', '%d/%m/%Y')
print((t1 - time_zero + Whitelist).time())}
However, this constantly gives out ValueError: time data does not match format '%y:%m:%d'.
What I expect is the sum of 2019-09-11 and 0000-01-00's result is 2019-10-11. However, the result is 2019-09-110000-01-00. Trying the strptime method gives out ValueErrors such as: ValueError: time data does not match format '%y:%m:%d'.
Thank you in advance, and I apologize if I did something wrong on my first post.
Use pandas:
the actual format of the json file isn't provided, so use something like the following to get the data into a DataFrame:
pd.read_json('A1.json', orient='records'): parameters will depend on the format of the file
json_normalize
d2 is not a proper datetime format so don't try to convert it.
the Code section below, will use a dict to set up the DataFrame for the example.
json files to DataFrames:
df1 = pd.read_json('A1.json', orient='records')
df2 = pd.read_json('Whitelist.json', orient='records')
df = pd.DataFrame()
df['expires'] = df1.expires
df['d2'] = df2.expires
Code:
import pandas as pd
df = pd.DataFrame({"expires": ["2019-09-11", "2019-10-11", "2019-11-11"],
"d2": ["0000-01-00", "0000-02-00", "0000-03-00"]})
Expand d2 using str.split:
df.expires = pd.to_datetime(df.expires)
df[['y', 'm', 'd']] = df.d2.str.split('-', expand=True)
Use pd.DateOffset:
df['expires_new'] = df[['expires', 'm']].apply(lambda x: x[0] + pd.DateOffset(months=int(x[1])), axis=1)
if d2 is expected to have more than just a new m or month value, the lambda expression can be changed to call a function that adjusts for y, m, and d values.
I am pulling data using pytreasurydirect and I would like to query each unique cusip and then append them and create a pandas dataframe table. I am having difficulties generating the the pandas dataframe. I believe it is because of the unicode structure of the data.
import pandas as pd
from pytreasurydirect import TreasuryDirect
td = TreasuryDirect()
cusip_list = [['912796PY9','08/09/2018'],['912796PY9','06/07/2018']]
for i in cusip_list:
cusip =''.join(i[0])
issuedate =''.join(i[1])
cusip_value=(td.security_info(cusip, issuedate))
#pd.DataFrame(cusip_value.items())
df = pd.DataFrame(cusip_value, index=['a'])
td = td.append(df, ignore_index=False)
Example of data from pytreasurydirect :
Index([u'accruedInterestPer100', u'accruedInterestPer1000',
u'adjustedAccruedInterestPer1000', u'adjustedPrice',
u'allocationPercentage', u'allocationPercentageDecimals',
u'announcedCusip', u'announcementDate', u'auctionDate',
u'auctionDateYear',
...
u'totalTendered', u'treasuryDirectAccepted',
u'treasuryDirectTendersAccepted', u'type',
u'unadjustedAccruedInterestPer1000', u'unadjustedPrice',
u'updatedTimestamp', u'xmlFilenameAnnouncement',
u'xmlFilenameCompetitiveResults', u'xmlFilenameSpecialAnnouncement'],
dtype='object', length=116)
I think you want to define a function like this:
def securities(type):
secs = td.security_type(type)
keys = secs[0].keys() if secs else []
seri = [pd.Series([sec[key] for sec in secs]) for key in keys]
return pd.DataFrame(dict(zip(keys, seri)))
Then, use it:
df = securities('Bond')
df[['cusip', 'issueDate', 'maturityDate']].head()
to get results like these, for example (TreasuryDirect returns a lot of addition columns):
cusip issueDate maturityDate
0 912810SD1 2018-08-15T00:00:00 2048-08-15T00:00:00
1 912810SC3 2018-07-16T00:00:00 2048-05-15T00:00:00
2 912810SC3 2018-06-15T00:00:00 2048-05-15T00:00:00
3 912810SC3 2018-05-15T00:00:00 2048-05-15T00:00:00
4 912810SA7 2018-04-16T00:00:00 2048-02-15T00:00:00
At least today those are the results today. The results will change over time as bonds are issued and, alas, mature. Note the multiple issueDates per cusip.
Finally, per the TreasuryDirect website (https://www.treasurydirect.gov/webapis/webapisecurities.htm), the possible security types are: Bill, Note, Bond, CMB, TIPS, FRN.
If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123
i have a pandas dataframe having a column as
from pandas import DataFrame
df = pf.DataFrame({ 'column_name' : [u'Monday,30 December,2013', u'Delivered', u'19:23', u'1']})
now i want to extract every thing from it and store in three columns as
date status time
[30/December/2013] ['Delivered'] [19:23]
i have so far used this :
import dateutil.parser as dparser
dparser.parse([u'Monday,30 December,2013', u'Delivered', u'19:23', u'1'])
but this throws an error . can anyone please guide me to a solution ?
You can apply() a function to a column, see the whole example:
from pandas import DataFrame
df = DataFrame({'date': ['Monday,30 December,2013'], 'delivery': ['Delivered'], 'time': ['19:23'], 'status':['1']})
# delete the status column
del df['status']
def splitter(val):
parts = val.split(',')
return parts[1]
df['date'] = df['date'].apply(splitter)
This yields a dataframe with date, delivery and the time.