I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]
Related
I want to create 24 dataframes in a Loop, where the first dataframe is created by:
df_hour_0 = df_dummies.copy()
df_hour_0['Price_24_1'] = df_hour_0['Price_REG1']
df_hour_0['Price_24_2'] = df_hour_0['Price_REG2']
df_hour_0['Price_24_3'] = df_hour_0['Price_REG3']
df_hour_0['Price_24_4'] = df_hour_0['Price_REG4']
df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4'
]] = df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']].shift(1)
First, I tried this approach, which doesn't work. (Maybe another version of this approach can work?).
for i in range(24):
df_hour_i = df_dummies.copy()
df_hour_i['Price_24_1'] = df_hour_i['Price_REG1']
df_hour_i['Price_24_2'] = df_hour_i['Price_REG2']
df_hour_i['Price_24_3'] = df_hour_i['Price_REG3']
df_hour_i['Price_24_4'] = df_hour_i['Price_REG4']
df_hour_i[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']] = df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']].shift(i+1)
Now I have read that people, in general, recommend using a dictionary so I have tried this:
d = {}
for x in range(23):
d[x] = pd.DataFrame()
which gives me 24 empty DF in one dictionary, but now I struggle with how to fill them.
Build a list of DataFrames like this:
cols = ['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']
dfs = []
for i in range(24):
df = df_hour_0.copy()
df[cols] = df[cols].shift(i)
dfs.append(df)
Shift by i will be at index i.
I want to create dataframe from my data. What I do is essentially a grid search over different parameters for my algorithm. Do you have any idea how can this be done better, because right now if I need to add two more parameters in my grid, or add more data on which I perform my analysis — I need to manually add a lot of lists, then append to it some values, and then in Dataframe dict add another column. IS there another way? Because right now it looks really ugly.
type_preds = []
type_models = []
type_lens = []
type_smalls = []
lfc1s = []
lfc2s = []
lfc3s = []
lv2s = []
sfp1s = []
len_small_fils = []
ratio_small_fills = []
ratio_big_fils = []
for path_to_config in path_to_results.iterdir():
try:
type_pred, type_model, type_len, type_small, len_small_fil, ratio_big_fil, ratio_small_fill = path_to_config.name[:-4].split('__')
except:
print(path_to_config)
continue
path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1]
sfp1, lv2, lfc3, lfc2, lfc1 = display_metrics(path_to_gts, path_to_trackings)
type_preds.append(type_pred)
type_models.append(type_model)
type_lens.append(type_len)
type_smalls.append(type_small)
len_small_fils.append(len_small_fil)
ratio_big_fils.append(ratio_big_fil)
ratio_small_fills.append(ratio_small_fill)
lfc1s.append(lfc1)
lfc2s.append(lfc2)
lfc3s.append(lfc3)
lv2s.append(lv2)
sfp1s.append(sfp1)
df = pd.DataFrame({
'type_pred': type_preds,
'type_model': type_models,
'type_len': type_lens,
'type_small': type_smalls,
'len_small_fil': len_small_fils,
'ratio_small_fill': ratio_small_fills,
'ratio_big_fill': ratio_big_fils,
'lfc3': lfc3s,
'lfc2': lfc2s,
'lfc1': lfc1s,
'lv2': lv2s,
'sfp1': sfp1s
})
Something along these lines might make it easier:
data = []
for path_to_config in path_to_results.iterdir():
row = []
try:
row.extend(path_to_config.name[:-4].split('__'))
except:
print(path_to_config)
continue
path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1]
row.extend(display_metrics(path_to_gts, path_to_trackings))
data.append(row)
df = pd.DataFrame(
data,
columns=[
"type_pred",
"type_model",
"type_len",
"type_small",
"len_small_fil",
"ratio_big_fil",
"ratio_small_fill",
# End of first half
"sfp1",
"lv2",
"lfc3",
"lfc2",
"lfc1",
])
Then every time you add an extra return variable to either the first or second function you just need to add an extra column to the final DataFrame creation.
I have 1 minute OHLCV Candlestick data, and I need to aggregate it to create 15m Candlesticks. The database comes from MongoDB; this is a version in clean Python:
def get_candela(self,tf):
c = dict()
candel = dict()
candele_finale = list()
prov_c = list()
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
candele = list(db_candela)
timestamp_calc = list()
open_calc = list()
max_calc = list()
min_calc = list()
close_calc = list()
vol_calc = list()
#for _ in range(len(candele)):
for a in range(tf):
if len(candele) == 0:
break
prov_c.append(candele[a])
c.append(prov_c)
candele[:tf]=[]
for b in range(len(c)):
cndl = c[b]
for d in range(tf):
print(cndl)
cnd = cndl[d]
#print(len(cnd))
timestamp_calc.append(cnd["timestamp"])
open_calc.append(cnd["open"])
max_calc.append(cnd["high"])
min_calc.append(cnd["low"])
close_calc.append(cnd["close"])
vol_calc.append(cnd["volume"])
index_close=len(close_calc)
candel["timestamp"] = timestamp_calc[d]
candel["open"] = open_calc[0]
candel["high"] = max(max_calc)
candel["low"] = min(min_calc)
candel["close"] = close_calc[index_close-1]
candel["volume"] = sum(vol_calc)
#print(candel)
candele_finale.append(candel)
max_calc.clear()
min_calc.clear()
vol_calc.clear()
return candele_finale
This returns an array with only the last candlestick created.
And this is another version in pandas:
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
prov_c = list()
for item in db_candela:
cc={"timestamp":item["timestamp"],"open":item["open"],"high":item["high"],"low":item["low"],"close":item["close"],"volume":item["volume"]}
prov_c.append(cc)
print(prov_c)
data = pandas.DataFrame([prov_c], index=[pandas.to_datetime(cc["timestamp"])])
#print(data)
df = data.resample('5T').agg({'timestamp':'first','open':'first','high':'max', 'low':'min','close' : 'last','volume': 'sum'})
#print(data.mean())
#with pandas.option_context('display.max_rows', None, 'display.max_columns',None): # more options can be specified also
pprint(df)
This returns a dataframe with weird/random values.
I had the same question today, and I answered it. Basically there is a pandas function called resample that does all of the work for you.
Here is my code:
import json
import pandas as pd
#load the raw data and clean it up
data_json = open('./testapiresults.json') #load json object
data_dict = json.load(data_json) #convert to a dict object
df = pd.DataFrame.from_dict(data_dict) #convert to panel data's dataframe object
df['datetime'] = pd.to_datetime(df['datetime'],unit='ms') #convert from unix time (ms from epoch) to human time
df = df.set_index(pd.DatetimeIndex(df['datetime'])) #set time series as index format for resample function
#resample to aggregate to a different frequency of data
candle_summary = pd.DataFrame()
candle_summary['open'] = df.open.resample('15Min').first()
candle_summary['high'] = df.high.resample('15Min').max()
candle_summary['low'] = df.low.resample('15Min').min()
candle_summary['close'] = df.close.resample('15Min').last()
candle_summary.head()
I had to export to csv and recalculate it in excel to doublecheck that it was calculating correctly, but this works. I have not figured out the pandas.DataFrame.resample('30min').ohlc() function, but that looks like it could make some really elegant code if I could figure out how to make it work without tons of errors.
I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))
I have few categorical columns (description) in my DataFrame df_churn which i'd like to convert to numerical values. And of course I'd like to create a lookup table because i will need to convert them back eventually.
The problem is that every column has a different number of categories so appending to df_categories is not easy and I cant think of any simple way of do so.
Here is what I have so far. It stops after first column, because of the different length.
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_categories = pd.DataFrame()
def categorizer(_clmn):
for clmn in cat_clmn:
dict_cat = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df_categories[clmn] = dict_cat.values()
df_categories[clmn + '_key'] = dict_cat.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(dict_cat)
categorizer(cat_clmn)
There is a temporary solution, but I am sure it can be done in a better way.
df_CLI_REGION = pd.DataFrame()
df_CLI_PROVINCE = pd.DataFrame()
df_CLI_ORIGIN = pd.DataFrame()
df_cli_origin2 = pd.DataFrame()
df_cli_origin3 = pd.DataFrame()
df_ONE_PRD_TYPE_1 = pd.DataFrame()
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_lst = [df_CLI_REGION,df_CLI_PROVINCE,df_CLI_ORIGIN,df_cli_origin2,df_cli_origin3, df_ONE_PRD_TYPE_1]
def categorizer(_clmn):
for clmn, df in zip(cat_clmn,df_lst):
d = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df[clmn] = d.values()
df[clmn + '_key'] = d.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(d)
categorizer(cat_clmn)