Iteratively add columns of various length to DataFrame - python
I have few categorical columns (description) in my DataFrame df_churn which i'd like to convert to numerical values. And of course I'd like to create a lookup table because i will need to convert them back eventually.
The problem is that every column has a different number of categories so appending to df_categories is not easy and I cant think of any simple way of do so.
Here is what I have so far. It stops after first column, because of the different length.
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_categories = pd.DataFrame()
def categorizer(_clmn):
for clmn in cat_clmn:
dict_cat = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df_categories[clmn] = dict_cat.values()
df_categories[clmn + '_key'] = dict_cat.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(dict_cat)
categorizer(cat_clmn)
There is a temporary solution, but I am sure it can be done in a better way.
df_CLI_REGION = pd.DataFrame()
df_CLI_PROVINCE = pd.DataFrame()
df_CLI_ORIGIN = pd.DataFrame()
df_cli_origin2 = pd.DataFrame()
df_cli_origin3 = pd.DataFrame()
df_ONE_PRD_TYPE_1 = pd.DataFrame()
cat_clmn = ['CLI_REGION','CLI_PROVINCE','CLI_ORIGIN','cli_origin2','cli_origin3', 'ONE_PRD_TYPE_1']
df_lst = [df_CLI_REGION,df_CLI_PROVINCE,df_CLI_ORIGIN,df_cli_origin2,df_cli_origin3, df_ONE_PRD_TYPE_1]
def categorizer(_clmn):
for clmn, df in zip(cat_clmn,df_lst):
d = {key: value for value, key in enumerate(df_churn[clmn].unique())}
df[clmn] = d.values()
df[clmn + '_key'] = d.keys()
df_churn[clmn + '_CAT'] = df_churn[clmn].map(d)
categorizer(cat_clmn)
Related
how to iterate with groupby in pandas
I have a function minmax, that basically iterates over a dataframe of transactions. I want to calculate a set of calculations including the id, so accountstart,accountend are the two fields calculated. The intention is to make this calculations my month and account. So when I do: df1 = df.loc[df['accountNo']==10] minmax(df1) it works. What I can't do is: df.groupby('accountNo').apply(minmax) When I do: grouped = df.groupby('accountNo') for i,j in grouped: print(minmax(j)) It does the computation, print the result, but without print it complains about KeyError: -1 that is itertools. So akward. How to tackle that in Pandas? def minmax(x): dfminmax = {} accno = set(x['accountNo']) accno = repr(accno) kgroup = x.groupby('monthStart')['cumsum'].sum() maxt = x['startbalance'].max() kgroup = pd.DataFrame(kgroup) kgroup['startbalance'] = 0 kgroup['startbalance'][0] = maxt kgroup['endbalance'] = 0 kgroup['accountNo'] = accno kgroup['accountNo'] = kgroup['accountNo'].str.strip('{}.0') kgroup.reset_index(inplace=True) for idx, row in kgroup.iterrows(): if kgroup.loc[idx,'startbalance']==0: kgroup.loc[idx,'startbalance']=kgroup.loc[idx-1,'endbalance'], if kgroup.loc[idx,'endbalance']==0: kgroup.loc[idx,'endbalance'] = kgroup.loc[idx,'cumsum']+kgroup.loc[idx,'startbalance'] dfminmax['monthStart'].append(kgroup['monthStart']) dfminmax['startbalance'].append(kgroup['startbalance']) dfminmax['endbalance'].append(kgroup['endbalance']) dfminmax['accountNo'].append(kgroup['accountNo']) return dfminmax
.apply() takes pandas Series as inputs, not DataFrames. Using .agg, as in df.groupby('accountNo').agg(yourfunction) should yield better results. Be sure to check out the documentation for details on implementation.
creating Dataframe from a lot of lists
I want to create dataframe from my data. What I do is essentially a grid search over different parameters for my algorithm. Do you have any idea how can this be done better, because right now if I need to add two more parameters in my grid, or add more data on which I perform my analysis — I need to manually add a lot of lists, then append to it some values, and then in Dataframe dict add another column. IS there another way? Because right now it looks really ugly. type_preds = [] type_models = [] type_lens = [] type_smalls = [] lfc1s = [] lfc2s = [] lfc3s = [] lv2s = [] sfp1s = [] len_small_fils = [] ratio_small_fills = [] ratio_big_fils = [] for path_to_config in path_to_results.iterdir(): try: type_pred, type_model, type_len, type_small, len_small_fil, ratio_big_fil, ratio_small_fill = path_to_config.name[:-4].split('__') except: print(path_to_config) continue path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1] sfp1, lv2, lfc3, lfc2, lfc1 = display_metrics(path_to_gts, path_to_trackings) type_preds.append(type_pred) type_models.append(type_model) type_lens.append(type_len) type_smalls.append(type_small) len_small_fils.append(len_small_fil) ratio_big_fils.append(ratio_big_fil) ratio_small_fills.append(ratio_small_fill) lfc1s.append(lfc1) lfc2s.append(lfc2) lfc3s.append(lfc3) lv2s.append(lv2) sfp1s.append(sfp1) df = pd.DataFrame({ 'type_pred': type_preds, 'type_model': type_models, 'type_len': type_lens, 'type_small': type_smalls, 'len_small_fil': len_small_fils, 'ratio_small_fill': ratio_small_fills, 'ratio_big_fill': ratio_big_fils, 'lfc3': lfc3s, 'lfc2': lfc2s, 'lfc1': lfc1s, 'lv2': lv2s, 'sfp1': sfp1s })
Something along these lines might make it easier: data = [] for path_to_config in path_to_results.iterdir(): row = [] try: row.extend(path_to_config.name[:-4].split('__')) except: print(path_to_config) continue path_to_trackings = sorted([str(el) for el in list(path_to_config.iterdir())])[::-1] row.extend(display_metrics(path_to_gts, path_to_trackings)) data.append(row) df = pd.DataFrame( data, columns=[ "type_pred", "type_model", "type_len", "type_small", "len_small_fil", "ratio_big_fil", "ratio_small_fill", # End of first half "sfp1", "lv2", "lfc3", "lfc2", "lfc1", ]) Then every time you add an extra return variable to either the first or second function you just need to add an extra column to the final DataFrame creation.
Aggregating financial candlestick data
I have 1 minute OHLCV Candlestick data, and I need to aggregate it to create 15m Candlesticks. The database comes from MongoDB; this is a version in clean Python: def get_candela(self,tf): c = dict() candel = dict() candele_finale = list() prov_c = list() db = database("price_data", "1min_OHLC_XBTUSD") col = database.get_collection(db,"1min_OHLC_XBTUSD") db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20) candele = list(db_candela) timestamp_calc = list() open_calc = list() max_calc = list() min_calc = list() close_calc = list() vol_calc = list() #for _ in range(len(candele)): for a in range(tf): if len(candele) == 0: break prov_c.append(candele[a]) c.append(prov_c) candele[:tf]=[] for b in range(len(c)): cndl = c[b] for d in range(tf): print(cndl) cnd = cndl[d] #print(len(cnd)) timestamp_calc.append(cnd["timestamp"]) open_calc.append(cnd["open"]) max_calc.append(cnd["high"]) min_calc.append(cnd["low"]) close_calc.append(cnd["close"]) vol_calc.append(cnd["volume"]) index_close=len(close_calc) candel["timestamp"] = timestamp_calc[d] candel["open"] = open_calc[0] candel["high"] = max(max_calc) candel["low"] = min(min_calc) candel["close"] = close_calc[index_close-1] candel["volume"] = sum(vol_calc) #print(candel) candele_finale.append(candel) max_calc.clear() min_calc.clear() vol_calc.clear() return candele_finale This returns an array with only the last candlestick created. And this is another version in pandas: db = database("price_data", "1min_OHLC_XBTUSD") col = database.get_collection(db,"1min_OHLC_XBTUSD") db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20) prov_c = list() for item in db_candela: cc={"timestamp":item["timestamp"],"open":item["open"],"high":item["high"],"low":item["low"],"close":item["close"],"volume":item["volume"]} prov_c.append(cc) print(prov_c) data = pandas.DataFrame([prov_c], index=[pandas.to_datetime(cc["timestamp"])]) #print(data) df = data.resample('5T').agg({'timestamp':'first','open':'first','high':'max', 'low':'min','close' : 'last','volume': 'sum'}) #print(data.mean()) #with pandas.option_context('display.max_rows', None, 'display.max_columns',None): # more options can be specified also pprint(df) This returns a dataframe with weird/random values.
I had the same question today, and I answered it. Basically there is a pandas function called resample that does all of the work for you. Here is my code: import json import pandas as pd #load the raw data and clean it up data_json = open('./testapiresults.json') #load json object data_dict = json.load(data_json) #convert to a dict object df = pd.DataFrame.from_dict(data_dict) #convert to panel data's dataframe object df['datetime'] = pd.to_datetime(df['datetime'],unit='ms') #convert from unix time (ms from epoch) to human time df = df.set_index(pd.DatetimeIndex(df['datetime'])) #set time series as index format for resample function #resample to aggregate to a different frequency of data candle_summary = pd.DataFrame() candle_summary['open'] = df.open.resample('15Min').first() candle_summary['high'] = df.high.resample('15Min').max() candle_summary['low'] = df.low.resample('15Min').min() candle_summary['close'] = df.close.resample('15Min').last() candle_summary.head() I had to export to csv and recalculate it in excel to doublecheck that it was calculating correctly, but this works. I have not figured out the pandas.DataFrame.resample('30min').ohlc() function, but that looks like it could make some really elegant code if I could figure out how to make it work without tons of errors.
Python - Pandas library returns wrong column values after parsing a CSV file
SOLVED Found the solution by myself. Turns out that when you want to retrieve specific columns by their names you should pass the names in the order they appear inside the csv (which is really stupid for a library that is intended to save some parsing time for a developer IMO). Correct me if I am wrong but i dont see a on option to get a specific columns values by its name if the columns are in a different order... I am trying to read a comma separated value file with python and then parse it using Pandas library. Since the file has many values (columns) that are not needed I make a list of the column names i do need. Here's a look at the csv file format. Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Attendance,Referee,HS,AS,HST,AST,HHW,AHW,HC,AC,HF,AF,HO,AO,HY,AY,HR,AR,HBP,ABP,GBH,GBD,GBA,IWH,IWD,IWA,LBH,LBD,LBA,SBH,SBD,SBA,WHH,WHD,WHA E0,19/08/00,Charlton,Man City,4,0,H,2,0,H,20043,Rob Harris,17,8,14,4,2,1,6,6,13,12,8,6,1,2,0,0,10,20,2,3,3.2,2.2,2.9,2.7,2.2,3.25,2.75,2.2,3.25,2.88,2.1,3.2,3.1 E0,19/08/00,Chelsea,West Ham,4,2,H,1,0,H,34914,Graham Barber,17,12,10,5,1,0,7,7,19,14,2,3,1,2,0,0,10,20,1.47,3.4,5.2,1.6,3.2,4.2,1.5,3.4,6,1.5,3.6,6,1.44,3.6,6.5 E0,19/08/00,Coventry,Middlesbrough,1,3,A,1,1,D,20624,Barry Knight,6,16,3,9,0,1,8,4,15,21,1,3,5,3,1,0,75,30,2.15,3,3,2.2,2.9,2.7,2.25,3.2,2.75,2.3,3.2,2.75,2.3,3.2,2.62 E0,19/08/00,Derby,Southampton,2,2,D,1,2,A,27223,Andy D'Urso,6,13,4,6,0,0,5,8,11,13,0,2,1,1,0,0,10,10,2,3.1,3.2,1.8,3,3.5,2.2,3.25,2.75,2.05,3.2,3.2,2,3.2,3.2 E0,19/08/00,Leeds,Everton,2,0,H,2,0,H,40010,Dermot Gallagher,17,12,8,6,0,0,6,4,21,20,6,1,1,3,0,0,10,30,1.65,3.3,4.3,1.55,3.3,4.5,1.55,3.5,5,1.57,3.6,5,1.61,3.5,4.5 E0,19/08/00,Leicester,Aston Villa,0,0,D,0,0,D,21455,Mike Riley,5,5,4,3,0,0,5,4,12,12,1,4,2,3,0,0,20,30,2.15,3.1,2.9,2.3,2.9,2.5,2.35,3.2,2.6,2.25,3.25,2.75,2.4,3.25,2.5 E0,19/08/00,Liverpool,Bradford,1,0,H,0,0,D,44183,Paul Durkin,16,3,10,2,0,0,6,1,8,8,5,0,1,1,0,0,10,10,1.25,4.1,7.2,1.25,4.3,8,1.35,4,8,1.36,4,8,1.33,4,8 This list is passed to pandas.read_csv()'s names parameter. See code. # Returns an array of the column names needed for our raw data table def cols_to_extract(): cols_to_use = [None] * RawDataCols.COUNT cols_to_use[RawDataCols.DATE] = 'Date' cols_to_use[RawDataCols.HOME_TEAM] = 'HomeTeam' cols_to_use[RawDataCols.AWAY_TEAM] = 'AwayTeam' cols_to_use[RawDataCols.FTHG] = 'FTHG' cols_to_use[RawDataCols.HG] = 'HG' cols_to_use[RawDataCols.FTAG] = 'FTAG' cols_to_use[RawDataCols.AG] = 'AG' cols_to_use[RawDataCols.FTR] = 'FTR' cols_to_use[RawDataCols.RES] = 'Res' cols_to_use[RawDataCols.HTHG] = 'HTHG' cols_to_use[RawDataCols.HTAG] = 'HTAG' cols_to_use[RawDataCols.HTR] = 'HTR' cols_to_use[RawDataCols.ATTENDANCE] = 'Attendance' cols_to_use[RawDataCols.HS] = 'HS' cols_to_use[RawDataCols.AS] = 'AS' cols_to_use[RawDataCols.HST] = 'HST' cols_to_use[RawDataCols.AST] = 'AST' cols_to_use[RawDataCols.HHW] = 'HHW' cols_to_use[RawDataCols.AHW] = 'AHW' cols_to_use[RawDataCols.HC] = 'HC' cols_to_use[RawDataCols.AC] = 'AC' cols_to_use[RawDataCols.HF] = 'HF' cols_to_use[RawDataCols.AF] = 'AF' cols_to_use[RawDataCols.HFKC] = 'HFKC' cols_to_use[RawDataCols.AFKC] = 'AFKC' cols_to_use[RawDataCols.HO] = 'HO' cols_to_use[RawDataCols.AO] = 'AO' cols_to_use[RawDataCols.HY] = 'HY' cols_to_use[RawDataCols.AY] = 'AY' cols_to_use[RawDataCols.HR] = 'HR' cols_to_use[RawDataCols.AR] = 'AR' return cols_to_use # Extracts raw data from the raw data csv and populates the raw match data table in the database def extract_raw_data(csv): # Clear the database table if it has any logs # if MatchRawData.objects.count != 0: # MatchRawData.objects.delete() cols_to_use = cols_to_extract() # Read and parse the csv file parsed_csv = pd.read_csv(csv, delimiter=',', names=cols_to_use, header=0) for col in cols_to_use: values = parsed_csv[col].values for val in values: print(str(col) + ' --------> ' + str(val)) Where RawDataCols is an IntEnum. class RawDataCols(IntEnum): DATE = 0 HOME_TEAM = 1 AWAY_TEAM = 2 FTHG = 3 HG = 4 FTAG = 5 AG = 6 FTR = 7 RES = 8 ... The column names are obtained using it. That part of code works ok. The correct column name is obtained but after trying to get its values using values = parsed_csv[col].values pandas return the values of a wrong column. The wrong column index is around 13 indexes away from the one i am trying to get. What am i missing?
You can select column by name wise.Just use following line values = parsed_csv[["Column Name","Column Name2"]] Or you select Index wise by cols = [1,2,3,4] values = parsed_csv[parsed_csv.columns[cols]]
Slow Data analysis using pandas
I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data. UniqueAPI = Uniquify(API) dummydata = [] #bridge the gaps in the data with zeros for i in range(0,len(UniqueAPI)): DateList = [] DaysList = [] PDaysList = [] OperatorList = [] OGOnumList = [] CountyList = [] MunicipalityList = [] LatitudeList = [] LongitudeList = [] UnconventionalList = [] ConfigurationList = [] HomeUseList = [] ReportingPeriodList = [] RecordSourceList = [] for j in range(0,len(API)): if UniqueAPI[i] == API[j]: #print(str(ProdDate[j])) DateList.append(ProdDate[j]) DaysList = Days[j] OperatorList = Operator[j] OGOnumList = OGOnum[j] CountyList = County[j] MunicipalityList = Municipality[j] LatitudeList = Latitude[j] LongitudeList = Longitude[j] UnconventionalList = Unconventional[j] ConfigurationList = Configuration[j] HomeUseList = HomeUse[j] ReportingPeriodList = ReportingPeriod[j] RecordSourceList = RecordSource[j] df = pd.DataFrame(DateList, columns = ['Date']) df['Date'] = pd.to_datetime(df['Date']) minDate = df.min() maxDate = df.max() Years = int((maxDate - minDate)/np.timedelta64(1,'Y')) Months = int(round((maxDate - minDate)/np.timedelta64(1,'M'))) finalMonths = Months - Years*12 + 1 Y,x = str(minDate).split("-",1) x,Y = str(Y).split(" ",1) for k in range(0,Years + 1): if k == Years: ender = int(finalMonths + 1) else: ender = int(13) full_df = pd.DataFrame() if k > 0: del full_df full_df = pd.DataFrame() full_df['API'] = UniqueAPI[i] full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)] full_df['Days'] = DaysList full_df['Operator'] = OperatorList full_df['OGO_NUM'] = OGOnumList full_df['County'] = CountyList full_df['Municipality'] = MunicipalityList full_df['Latitude'] = LatitudeList full_df['Longitude'] = LongitudeList full_df['Unconventional'] = UnconventionalList full_df['Well_Configuration'] = ConfigurationList full_df['Home_Use'] = HomeUseList full_df['Reporting_Period'] = ReportingPeriodList full_df['Record_Source'] = RecordSourceList dummydata.append(full_df) full_df = pd.concat(dummydata) result = full_df.merge(dataClean,how='left').fillna(0) print(result[:100]) result.to_csv(ResultPath, index_label=False, index=False) This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe: for each unique API i find all occurrences in the main list of apis using that information i build a list of dates I find a min and max date for each list corresponding to an api I then build an empty pandas DataFrame that has every month between the two dates for the associated api I then append this data frame to a list "dummydata" and loop to the next api taking this dummy data list I then concatenate it into a DataFrame this DataFrame is then merged with another dataframe with cleaned data end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe: if k > 0: del full_df full_df = pd.DataFrame() Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this: full_df = pd.concat([UniqueAPI[i], [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)], DaysList, etc... ], axis=1) Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call). full_df.columns = ['API', 'Production Month', 'Days', etc.]