I have a use case where I need to create a python dictionary with year and months and then concatenate all the dataframes to single dataframe. I have done the implementation as below:
dict_year_month = {}
temp_dict_1={}
temp_dict_2={}
for ym in [201104,201105 ... 201706]:
key_name = 'df_'+str(ym)+'A'
temp_dict_1[key_name]=df[(df['col1']<=ym) & (df['col2']>ym)
& (df['col3']==1)]
temp_dict_2[key_name]=df[(df['col1']<=ym) & (df['col2']==0)
& (df['col3']==1)]
if not temp_dict_1[key_name].empty:
dict_year_month [key_name] =temp_dict_1[key_name]
dict_year_month [key_name].loc[:, 'new_col'] = ym
elif not temp_dict_2[key_name].empty:
dict_year_month [key_name] =temp_dict_2[key_name]
dict_year_month [key_name].loc[:, 'new_col'] = ym
dict_year_month [key_name]=dict_year_month [key_name].sort_values('col4')
dict_year_month [key_name]=dict_year_month [key_name].drop_duplicates('col5')
.. do some other processing
create individual dataframes as df_201104A .. and so on ..
dict_year_month
#concatenate all the above individual dataframe into single dataframe:
df1 = pd.concat([
dict_year_month['df_201104A'],dict_year_month['df_201105A'],
... so on till dict_year_month['df_201706A'])
Now the challenge is I have to rerun the set of code on each quarter so every time I have to update this script with new yearmonths dict key and in pd.concat as well needs to updated with new year month details. I am looking for some other solution by which I can probably read the keys and create a list of dataframes in concatenate from a properties file or config file?
There are only a few things you need to do to get there - the first is just to enumerate through the months between your start and end month, which I do below using rrule, reading in the start and end dates from a file. This gets you the keys for your dictionary. Then just use the .values() method on the dictionaries to get all the dataframes.
from dateutil import rrule
from datetime import datetime, timedelta
import pickle
#get these from whereever, config, etc.
params = {
'start_year':2011,
'start_month':4,
'end_year':2017,
'end_month':6,
}
pickle.dump(params, open("params.pkl", "wb"))
params = pickle.load(open("params.pkl", "rb"))
start = datetime(year=params['start_year'], month=params['start_month'], day=1)
end = datetime(year=params['end_year'], month=params['end_month'], day=1)
keys = [int(dt.strftime("%Y%m")) for dt in rrule.rrule(rrule.MONTHLY, dtstart=start, until=end)]
print(keys)
## Do some things and get a dict
dict_year_month = {'201104':pd.DataFrame([[1, 2, 3]]), '201105':pd.DataFrame([[4, 5, 6]])} #... etc
pd.concat(dict_year_month.values())
The pickle file is to show one way of saving and loading parameters - it is a binary format so manually editing the parameters wouldn't really work. You might want to investigate something like yaml to get more sophisticated, feel free to ask a new question if you need help with that.
Related
I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.
I am having a hard time summing two dates that are saved in two separate json files. I want to add set dates together which are saved in separate libraries.
The first file (A1.json) contains: {"expires": "2019-09-11"}
The second file (Whitelist.json) contains: {"expires": "0000-01-00"}
These dates are created by using tkcalendar and are later exported to these seperate files, the idea being that summing them lets me set a time date one month into the future. However, I can't seem to add them together without some form of an error.
I have tried converting the json files to strings in python and then adding them and also using the striptime command to sum the dates.
Here is the relevant chunk of the code:
{with open('A1.json') as f:
data=json.loads(f.read())
for material in data.items():
A1 = (format(material[1]['expires']))
with open('Whitelist.json') as f:
data=json.loads(f.read())
for material in data.items():
A2 = (format(material[1]['expires']))
print(A1+A2)}
When this is used, they just get pasted one after another. They don't get summed the way I need.
I also have tried the following code:
{t1 = dt.datetime.strptime('A1', '%d-%m-%Y')
t2 = dt.datetime.strptime('Whitelist', '%d-%m-%Y')
time_zero = dt.datetime.strptime('00:00:00', '%d/%m/%Y')
print((t1 - time_zero + Whitelist).time())}
However, this constantly gives out ValueError: time data does not match format '%y:%m:%d'.
What I expect is the sum of 2019-09-11 and 0000-01-00's result is 2019-10-11. However, the result is 2019-09-110000-01-00. Trying the strptime method gives out ValueErrors such as: ValueError: time data does not match format '%y:%m:%d'.
Thank you in advance, and I apologize if I did something wrong on my first post.
Use pandas:
the actual format of the json file isn't provided, so use something like the following to get the data into a DataFrame:
pd.read_json('A1.json', orient='records'): parameters will depend on the format of the file
json_normalize
d2 is not a proper datetime format so don't try to convert it.
the Code section below, will use a dict to set up the DataFrame for the example.
json files to DataFrames:
df1 = pd.read_json('A1.json', orient='records')
df2 = pd.read_json('Whitelist.json', orient='records')
df = pd.DataFrame()
df['expires'] = df1.expires
df['d2'] = df2.expires
Code:
import pandas as pd
df = pd.DataFrame({"expires": ["2019-09-11", "2019-10-11", "2019-11-11"],
"d2": ["0000-01-00", "0000-02-00", "0000-03-00"]})
Expand d2 using str.split:
df.expires = pd.to_datetime(df.expires)
df[['y', 'm', 'd']] = df.d2.str.split('-', expand=True)
Use pd.DateOffset:
df['expires_new'] = df[['expires', 'm']].apply(lambda x: x[0] + pd.DateOffset(months=int(x[1])), axis=1)
if d2 is expected to have more than just a new m or month value, the lambda expression can be changed to call a function that adjusts for y, m, and d values.
I have a lisit of DataFrames that come from the census api, i had stored each year pull into a list.
So at the end of my for loop i have a list with dataframes per year and a list of years to go along side the for loop.
The problem i am having is merging all the DataFrames in the list while also taging them with a list of years.
So i have tried using the reduce function, but it looks like it only taking 2 of the 6 Dataframes i have.
concat just adds them to the dataframe with out tagging or changing anything
# Dependencies
import pandas as pd
import requests
import json
import pprint
import requests
from census import Census
from us import states
# Census
from config import (api_key, gkey)
year = 2012
c = Census(api_key, year)
for length in range(6):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E",
"B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
data_df = pd.DataFrame(data)
data_df = data_df.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E":"Median Home Value",
"B25064_001E":"Median Rent",
"B15003_022E":"Bachelor Degrees",
"B19013_001E":"Median Income"})
data_df = data_df.astype({'Zipcode':'int64'})
filtervalue = data_df['Median Home Value']>0
filtervalue2 = data_df['Median Rent']>0
filtervalue3 = data_df['Median Income']>0
cleandata = data_df[filtervalue][filtervalue2][filtervalue3]
cleandata = cleandata.dropna()
yearlst.append(year)
datalst.append(cleandata)
year += 1
so this generates the two seperate list one with the year and other with dataframe.
So my output came out to either one Dataframe with missing Dataframe entries or it just concatinated all without changing columns.
what im looking for is how to merge all within a list, but datalst[0] to be tagged with yearlst[0] when merging if at all possible
No need for year list, simply assign year column to data frame. Plus avoid incrementing year and have it the iterator column. In fact, consider chaining your process:
for year in range(2012, 2019):
c = Census(api_key, year)
data = c.acs5.get(('NAME', "B25077_001E","B25064_001E", "B15003_022E","B19013_001E"),
{'for': 'zip code tabulation area:*'})
cleandata = (pd.DataFrame(data)
.rename(columns={"NAME": "Name",
"zip code tabulation area": "Zipcode",
"B25077_001E": "Median_Home_Value",
"B25064_001E": "Median_Rent",
"B15003_022E": "Bachelor_Degrees",
"B19013_001E": "Median_Income"})
.astype({'Zipcode':'int64'})
.query('(Median_Home_Value > 0) & (Median_Rent > 0) & (Median_Income > 0)')
.dropna()
.assign(year_column = year)
)
datalst.append(cleandata)
final_data = pd.concat(datalst, ignore_index = True)
Im trying to add a two columns to each dataframe based on their name before concatenating all of them. One column is year and the other is trimester. So t1_15 would be trimester 1 and year 2015.
I tried building a function that did it in one go but due to time constraints just ended up doing it manually like this. I'm now returning to this problem with more time and would really like to sort it out.
frames_15 = [t1_15, t2_15, t3_15, t4_15]
for i in frames_15:
i['year'] = 2015
frames_16 = [t1_16, t2_16, t3_16, t4_16]
for i in frames_16:
i['year'] = 2016
frames_17 = [t1_17, t2_17, t3_17]
for i in frames_17:
i['year'] = 2017
frames_trim_1 = [t1_15, t1_16, t1_17]
for i in frames_trim_1:
i['trimestre'] = 1
frames_trim_2 = [t2_15, t2_16, t2_17]
for i in frames_trim_2:
i['trimestre'] = 2
frames_trim_3 = [t3_15, t3_16, t3_17]
for i in frames_trim_3:
i['trimestre'] = 3
frames_trim_4 = [t4_15, t4_16]
for i in frames_trim_4:
i['trimestre'] = 4
id like each df to have a year and trimester column based on its name.
Thanks in advance
The best way would be to build a dictionary, where you register your dataframes. You already gave them names according to their assignment to a trimester.
If you already know these infos by the time of creation, you could even register these dataframes in a dictionary, whose keys are tuples of trimester and year.
If you have something like in your description, you could also use the globals dictionary, but this is not quite clean and should be regarded as last choice in case there is no cleaner way.
If you build up your dictionary with keys named like the variable names above, or if you want to use the globals dictionary direclty, you could do it as follows:
import re
df_directory= dict(globals())
name_re= re.compile('^t([0-9])_([0-9]{2})$')
for name, df in df_directory.items():
matcher= name_re.match(name)
if matcher and isinstance(df, pd.DataFrame):
trimester, year= matcher.groups()
df['trimestre']= int(trimester)
df['year']= int(year) + 2000
This processes all variables named by the schema tX_XX and of type DataFrame and parses the trimester and year out of the name to assign it to columns each.
I am creating a pandas dataframe from historical weather data downloaded from weather underground.
import json
import requests
import pandas as pd
import numpy as np
import datetime
from dateutil.parser import parse
address = "http://api.wunderground.com/api/7036740167876b59/history_20060405/q/CA/San_Francisco.json"
r = requests.get(address)
wu_data = r.json()
Because I do not need all the data I only use the list of observations. This list contains two elements - date and utcdate - that are actually dictionaries.
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
I would like to index the dataframe I have created with the parsed date from the 'pretty' key within the dictionary. I can access this value by using the array index, but I can't figure out how to do this directly without a loop. For example, for the 23th element I can write
pretty_date = df["date"].values[23]["pretty"]
print pretty_date
time = parse(pretty_date)
print time
And I get
11:56 PM PDT on April 05, 2006
2006-04-05 23:56:00
This is what I am doing at the moment
g = lambda x: parse(x["pretty"])
df_dates = pd.DataFrame.from_dict(df["date"])
df.index = df_date["date"].apply(g)
df is now reindexed. At this point I can remove the columns I do not need.
Is there a more direct way to do this?
Please notice that sometimes there are multiple observations for the same date, but I deal with data cleaning, duplicates, etc. in a different part of the code.
Since the dtype held in pretty is just object, you can simply grab them to a list and get indexed. Not sure if this is what you want:
# by the way, `r.json` should be without ()`
wu_data = r.json
df = pd.DataFrame.from_dict(wu_data["history"]["observations"])
# just index using list comprehension, getting "pretty" inside df["date"] object.
df.index = [parse(df["date"][n]["pretty"]) for n in range(len(df))]
df.index
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-04-05 00:56:00, ..., 2006-04-05 23:56:00]
Length: 24, Freq: None, Timezone: None
Hope this helps.