I want to create 24 dataframes in a Loop, where the first dataframe is created by:
df_hour_0 = df_dummies.copy()
df_hour_0['Price_24_1'] = df_hour_0['Price_REG1']
df_hour_0['Price_24_2'] = df_hour_0['Price_REG2']
df_hour_0['Price_24_3'] = df_hour_0['Price_REG3']
df_hour_0['Price_24_4'] = df_hour_0['Price_REG4']
df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4'
]] = df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']].shift(1)
First, I tried this approach, which doesn't work. (Maybe another version of this approach can work?).
for i in range(24):
df_hour_i = df_dummies.copy()
df_hour_i['Price_24_1'] = df_hour_i['Price_REG1']
df_hour_i['Price_24_2'] = df_hour_i['Price_REG2']
df_hour_i['Price_24_3'] = df_hour_i['Price_REG3']
df_hour_i['Price_24_4'] = df_hour_i['Price_REG4']
df_hour_i[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']] = df_hour_0[['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']].shift(i+1)
Now I have read that people, in general, recommend using a dictionary so I have tried this:
d = {}
for x in range(23):
d[x] = pd.DataFrame()
which gives me 24 empty DF in one dictionary, but now I struggle with how to fill them.
Build a list of DataFrames like this:
cols = ['Price_24_1', 'Price_24_2', 'Price_24_3', 'Price_24_4']
dfs = []
for i in range(24):
df = df_hour_0.copy()
df[cols] = df[cols].shift(i)
dfs.append(df)
Shift by i will be at index i.
Related
I want to create dataframe from my data. What I do is essentially a grid search over different parameters for my algorithm. Do you have any idea how can this be done better, because right now if I need to add two more parameters in my grid, or add more data on which I perform my analysis — I need to manually add a lot of lists, then append to it some values, and then in Dataframe dict add another column. IS there another way? Because right now it looks really ugly.
type_preds = []
type_models = []
type_lens = []
type_smalls = []
lfc1s = []
lfc2s = []
lfc3s = []
lv2s = []
sfp1s = []
len_small_fils = []
ratio_small_fills = []
ratio_big_fils = []
for path_to_config in path_to_results.iterdir():
try:
type_pred, type_model, type_len, type_small, len_small_fil, ratio_big_fil, ratio_small_fill = path_to_config.name[:-4].split('__')
except:
print(path_to_config)
continue
path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1]
sfp1, lv2, lfc3, lfc2, lfc1 = display_metrics(path_to_gts, path_to_trackings)
type_preds.append(type_pred)
type_models.append(type_model)
type_lens.append(type_len)
type_smalls.append(type_small)
len_small_fils.append(len_small_fil)
ratio_big_fils.append(ratio_big_fil)
ratio_small_fills.append(ratio_small_fill)
lfc1s.append(lfc1)
lfc2s.append(lfc2)
lfc3s.append(lfc3)
lv2s.append(lv2)
sfp1s.append(sfp1)
df = pd.DataFrame({
'type_pred': type_preds,
'type_model': type_models,
'type_len': type_lens,
'type_small': type_smalls,
'len_small_fil': len_small_fils,
'ratio_small_fill': ratio_small_fills,
'ratio_big_fill': ratio_big_fils,
'lfc3': lfc3s,
'lfc2': lfc2s,
'lfc1': lfc1s,
'lv2': lv2s,
'sfp1': sfp1s
})
Something along these lines might make it easier:
data = []
for path_to_config in path_to_results.iterdir():
row = []
try:
row.extend(path_to_config.name[:-4].split('__'))
except:
print(path_to_config)
continue
path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1]
row.extend(display_metrics(path_to_gts, path_to_trackings))
data.append(row)
df = pd.DataFrame(
data,
columns=[
"type_pred",
"type_model",
"type_len",
"type_small",
"len_small_fil",
"ratio_big_fil",
"ratio_small_fill",
# End of first half
"sfp1",
"lv2",
"lfc3",
"lfc2",
"lfc1",
])
Then every time you add an extra return variable to either the first or second function you just need to add an extra column to the final DataFrame creation.
I have 1 minute OHLCV Candlestick data, and I need to aggregate it to create 15m Candlesticks. The database comes from MongoDB; this is a version in clean Python:
def get_candela(self,tf):
c = dict()
candel = dict()
candele_finale = list()
prov_c = list()
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
candele = list(db_candela)
timestamp_calc = list()
open_calc = list()
max_calc = list()
min_calc = list()
close_calc = list()
vol_calc = list()
#for _ in range(len(candele)):
for a in range(tf):
if len(candele) == 0:
break
prov_c.append(candele[a])
c.append(prov_c)
candele[:tf]=[]
for b in range(len(c)):
cndl = c[b]
for d in range(tf):
print(cndl)
cnd = cndl[d]
#print(len(cnd))
timestamp_calc.append(cnd["timestamp"])
open_calc.append(cnd["open"])
max_calc.append(cnd["high"])
min_calc.append(cnd["low"])
close_calc.append(cnd["close"])
vol_calc.append(cnd["volume"])
index_close=len(close_calc)
candel["timestamp"] = timestamp_calc[d]
candel["open"] = open_calc[0]
candel["high"] = max(max_calc)
candel["low"] = min(min_calc)
candel["close"] = close_calc[index_close-1]
candel["volume"] = sum(vol_calc)
#print(candel)
candele_finale.append(candel)
max_calc.clear()
min_calc.clear()
vol_calc.clear()
return candele_finale
This returns an array with only the last candlestick created.
And this is another version in pandas:
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
prov_c = list()
for item in db_candela:
cc={"timestamp":item["timestamp"],"open":item["open"],"high":item["high"],"low":item["low"],"close":item["close"],"volume":item["volume"]}
prov_c.append(cc)
print(prov_c)
data = pandas.DataFrame([prov_c], index=[pandas.to_datetime(cc["timestamp"])])
#print(data)
df = data.resample('5T').agg({'timestamp':'first','open':'first','high':'max', 'low':'min','close' : 'last','volume': 'sum'})
#print(data.mean())
#with pandas.option_context('display.max_rows', None, 'display.max_columns',None): # more options can be specified also
pprint(df)
This returns a dataframe with weird/random values.
I had the same question today, and I answered it. Basically there is a pandas function called resample that does all of the work for you.
Here is my code:
import json
import pandas as pd
#load the raw data and clean it up
data_json = open('./testapiresults.json') #load json object
data_dict = json.load(data_json) #convert to a dict object
df = pd.DataFrame.from_dict(data_dict) #convert to panel data's dataframe object
df['datetime'] = pd.to_datetime(df['datetime'],unit='ms') #convert from unix time (ms from epoch) to human time
df = df.set_index(pd.DatetimeIndex(df['datetime'])) #set time series as index format for resample function
#resample to aggregate to a different frequency of data
candle_summary = pd.DataFrame()
candle_summary['open'] = df.open.resample('15Min').first()
candle_summary['high'] = df.high.resample('15Min').max()
candle_summary['low'] = df.low.resample('15Min').min()
candle_summary['close'] = df.close.resample('15Min').last()
candle_summary.head()
I had to export to csv and recalculate it in excel to doublecheck that it was calculating correctly, but this works. I have not figured out the pandas.DataFrame.resample('30min').ohlc() function, but that looks like it could make some really elegant code if I could figure out how to make it work without tons of errors.
I need a DataFrame of one column ['Week'] that has all values from 0 to 100 inclusive.
I need it as a Dataframe so I can perform a pd.merge
So far I have tried creating an empty DataFrame, creating a series of 0-100 and then attempting to append this series to the DataFrame as a column.
alert_count_list = pd.DataFrame()
week_list= pd.Series(range(0,101))
alert_count_list['week'] = alert_count_list.append(week_list)
Try this:
df = pd.DataFrame(columns=["week"])
df.loc[:,"week"] = np.arange(101)
alert_count_list = pd.DataFrame(np.zeros(101), columns=['week'])
or
alert_count_list = pd.DataFrame({'week':range(101)})
You can try:
week_vals = []
for i in range(0, 101):
week_vals.append(i)
df = pd.Dataframe(columns = ['week'])
df['week'] = week_vals
I have few categorical columns (description) in my DataFrame df_churn which i'd like to convert to numerical values. And of course I'd like to create a lookup table because i will need to convert them back eventually.
The problem is that every column has a different number of categories so appending to df_categories is not easy and I cant think of any simple way of do so.
Here is what I have so far. It stops after first column, because of the different length.
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_categories = pd.DataFrame()
def categorizer(_clmn):
for clmn in cat_clmn:
dict_cat = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df_categories[clmn] = dict_cat.values()
df_categories[clmn + '_key'] = dict_cat.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(dict_cat)
categorizer(cat_clmn)
There is a temporary solution, but I am sure it can be done in a better way.
df_CLI_REGION = pd.DataFrame()
df_CLI_PROVINCE = pd.DataFrame()
df_CLI_ORIGIN = pd.DataFrame()
df_cli_origin2 = pd.DataFrame()
df_cli_origin3 = pd.DataFrame()
df_ONE_PRD_TYPE_1 = pd.DataFrame()
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_lst = [df_CLI_REGION,df_CLI_PROVINCE,df_CLI_ORIGIN,df_cli_origin2,df_cli_origin3, df_ONE_PRD_TYPE_1]
def categorizer(_clmn):
for clmn, df in zip(cat_clmn,df_lst):
d = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df[clmn] = d.values()
df[clmn + '_key'] = d.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(d)
categorizer(cat_clmn)
I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]