I am trying to convert some Pandas code to Pyspark, which will run on an EMR cluster. This is my first time working with Pyspark, and I am not sure what is the optimal way to code the objective. The job is trying to achieve the following:
There is a base dataframe with schema like so:
institution_id, user_id, st_date
For every unique institution_id, get all users
For every user for the institution_id, take all unique st_dates in sorted order, get the difference between pairs of consecutive st_dates and output a dictionary
Here is what the code looks like as of now:
def process_user(current_user, inst_cycles):
current_user_dates = np.sort(current_user.st_date.unique())
if current_user_dates.size > 1:
prev_date = pd.to_datetime(current_user_dates[0]).date()
for current_datetime in current_user_dates[1:]:
current_date = pd.to_datetime(current_datetime).date()
month = current_date.month
delta = current_date - prev_date
cycle_days = delta.days
inst_cycles[month][cycle_days] += 1
prev_date = current_date
return inst_cycles
def get_inst_monthly_distribution(current_inst):
inst_cycles = defaultdict(lambda: defaultdict(int))
inst_user_ids = current_inst.select('user_id').distinct().collect()
for _, user_id in enumerate(inst_user_ids):
user_id_str = user_id[0]
current_user = current_inst.filter(current_inst.user_id == user_id_str)
inst_cycles = process_user(current_user, inst_cycles)
return inst_cycles
def get_monthly_distributions(inst_ids, df):
cycles = {}
for _, inst_id_str in enumerate(inst_ids.keys()):
current_inst = df.filter(df.inst_id == inst_id_str)
cycles[inst_id_str] = get_inst_monthly_distribution(current_inst)
return cycles
def execute():
df = load_data() # df is a Spark dataframe
inst_names = get_inst_names(df)
monthly_distributions = get_monthly_distributions(inst_names, df)
I think this code is not taking advantage of the parallelism of Spark, and can be coded in a much better way without the for loops. Is that correct?
Related
I have a function minmax, that basically iterates over a dataframe of transactions. I want to calculate a set of calculations including the id, so accountstart,accountend are the two fields calculated. The intention is to make this calculations my month and account.
So when I do:
df1 = df.loc[df['accountNo']==10]
minmax(df1) it works.
What I can't do is:
df.groupby('accountNo').apply(minmax)
When I do:
grouped = df.groupby('accountNo')
for i,j in grouped:
print(minmax(j))
It does the computation, print the result, but without print it complains about KeyError: -1 that is itertools. So akward.
How to tackle that in Pandas?
def minmax(x):
dfminmax = {}
accno = set(x['accountNo'])
accno = repr(accno)
kgroup = x.groupby('monthStart')['cumsum'].sum()
maxt = x['startbalance'].max()
kgroup = pd.DataFrame(kgroup)
kgroup['startbalance'] = 0
kgroup['startbalance'][0] = maxt
kgroup['endbalance'] = 0
kgroup['accountNo'] = accno
kgroup['accountNo'] = kgroup['accountNo'].str.strip('{}.0')
kgroup.reset_index(inplace=True)
for idx, row in kgroup.iterrows():
if kgroup.loc[idx,'startbalance']==0:
kgroup.loc[idx,'startbalance']=kgroup.loc[idx-1,'endbalance'],
if kgroup.loc[idx,'endbalance']==0:
kgroup.loc[idx,'endbalance'] =
kgroup.loc[idx,'cumsum']+kgroup.loc[idx,'startbalance']
dfminmax['monthStart'].append(kgroup['monthStart'])
dfminmax['startbalance'].append(kgroup['startbalance'])
dfminmax['endbalance'].append(kgroup['endbalance'])
dfminmax['accountNo'].append(kgroup['accountNo'])
return dfminmax
.apply() takes pandas Series as inputs, not DataFrames. Using .agg, as in df.groupby('accountNo').agg(yourfunction) should yield better results. Be sure to check out the documentation for details on implementation.
I have 1 minute OHLCV Candlestick data, and I need to aggregate it to create 15m Candlesticks. The database comes from MongoDB; this is a version in clean Python:
def get_candela(self,tf):
c = dict()
candel = dict()
candele_finale = list()
prov_c = list()
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
candele = list(db_candela)
timestamp_calc = list()
open_calc = list()
max_calc = list()
min_calc = list()
close_calc = list()
vol_calc = list()
#for _ in range(len(candele)):
for a in range(tf):
if len(candele) == 0:
break
prov_c.append(candele[a])
c.append(prov_c)
candele[:tf]=[]
for b in range(len(c)):
cndl = c[b]
for d in range(tf):
print(cndl)
cnd = cndl[d]
#print(len(cnd))
timestamp_calc.append(cnd["timestamp"])
open_calc.append(cnd["open"])
max_calc.append(cnd["high"])
min_calc.append(cnd["low"])
close_calc.append(cnd["close"])
vol_calc.append(cnd["volume"])
index_close=len(close_calc)
candel["timestamp"] = timestamp_calc[d]
candel["open"] = open_calc[0]
candel["high"] = max(max_calc)
candel["low"] = min(min_calc)
candel["close"] = close_calc[index_close-1]
candel["volume"] = sum(vol_calc)
#print(candel)
candele_finale.append(candel)
max_calc.clear()
min_calc.clear()
vol_calc.clear()
return candele_finale
This returns an array with only the last candlestick created.
And this is another version in pandas:
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
prov_c = list()
for item in db_candela:
cc={"timestamp":item["timestamp"],"open":item["open"],"high":item["high"],"low":item["low"],"close":item["close"],"volume":item["volume"]}
prov_c.append(cc)
print(prov_c)
data = pandas.DataFrame([prov_c], index=[pandas.to_datetime(cc["timestamp"])])
#print(data)
df = data.resample('5T').agg({'timestamp':'first','open':'first','high':'max', 'low':'min','close' : 'last','volume': 'sum'})
#print(data.mean())
#with pandas.option_context('display.max_rows', None, 'display.max_columns',None): # more options can be specified also
pprint(df)
This returns a dataframe with weird/random values.
I had the same question today, and I answered it. Basically there is a pandas function called resample that does all of the work for you.
Here is my code:
import json
import pandas as pd
#load the raw data and clean it up
data_json = open('./testapiresults.json') #load json object
data_dict = json.load(data_json) #convert to a dict object
df = pd.DataFrame.from_dict(data_dict) #convert to panel data's dataframe object
df['datetime'] = pd.to_datetime(df['datetime'],unit='ms') #convert from unix time (ms from epoch) to human time
df = df.set_index(pd.DatetimeIndex(df['datetime'])) #set time series as index format for resample function
#resample to aggregate to a different frequency of data
candle_summary = pd.DataFrame()
candle_summary['open'] = df.open.resample('15Min').first()
candle_summary['high'] = df.high.resample('15Min').max()
candle_summary['low'] = df.low.resample('15Min').min()
candle_summary['close'] = df.close.resample('15Min').last()
candle_summary.head()
I had to export to csv and recalculate it in excel to doublecheck that it was calculating correctly, but this works. I have not figured out the pandas.DataFrame.resample('30min').ohlc() function, but that looks like it could make some really elegant code if I could figure out how to make it work without tons of errors.
I have a file with around 500K records.
Each record needs to be validated.
Records are de duplicated and store in a list:
with open(filename) as f:
records = f.readlines()
The validation file I used is stored in a Pandas Dataframe
This DataFrame contains around 80K records and 9 columns (myfile.csv).
filename = 'myfile.csv'
df = pd.read_csv(filename)
def check(df, destination):
try:
area_code = destination[:3]
office_code = destination[3:6]
subscriber_number = destination[6:]
if any(df['AREA_CODE'].astype(int) == area_code):
area_code_numbers = df[df['AREA_CODE'] == area_code]
if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]
start = subscriber_number >= matching_records['SUBSCRIBER_START']
end = subscriber_number <= matching_records['SUBSCRIBER_END']
# Perform intersection
record_found = matching_records[start & end]['LABEL'].to_string(index=False)
# We should return only 1 value
if len(record_found) > 0:
return record_found
else:
return 'INVALID_SUBSCRIBER'
else:
return 'INVALID_OFFICE_CODE'
else:
return 'INVALID_AREA_CODE'
except KeyError:
pass
except Exception:
pass
I'm looking for a way to improve the comparisons, as when I run it, it just hangs. If I run it with an small subset (10K) it works fine.
Not sure if there is a more efficient notation/recommendation.
for record in records:
check(df, record)
Using MacOS 8GB/2.3 GHz Intel Core i7.
With Cprofile.run in check function alone shows:
4253 function calls (4199 primitive calls) in 0.017 seconds.
Hence I assume 500K will take around 2 1/2 hours
While no data is available, consider this untested approach with a couple of left join merges of both data pieces and then run the validation steps. This would avoid any looping and run conditional logic across columns:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
I have the dataframe below. bet_pl and co_pl keep track of the daily changes in the 2 balances. I have updated co_balance based on co_pl and the cumsum.
init_balance = D('100.0')
co_thresh = D('1.05') * init_balance
def get_pl_co(row):
if row['eod_bet_balance'] > co_thresh:
diff = row['eod_bet_balance']- co_thresh
return(diff)
else:
return Decimal('0.0')
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df_odds_winloss['sod_bet_balance']= df_odds_winloss['eod_bet_balance'].shift(1).fillna(init_balance)
df_odds_winloss['co_pl'] = df_odds_winloss.apply(get_pl_co, axis=1)
df_odds_winloss['co_balance'] = df_odds_winloss['co_pl'].cumsum()
# trying this
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['eod_bet_balance'] - df_odds_winloss['co_pl']
Now I want the eod_bet_balance to update with negative co_pl as it is a transfer between the 2 balances, but am not getting the right eod (end of day) balances.
Can anyone give a hint?
UPDATED: The eod_balances reflect the change in bet_pl but not the subsequent change in co_pl.
FINAL UPDATE:
initial_balance = D('100.0')
df = pd.DataFrame({ 'SP': res_df['SP'], 'winloss': bin_seq_l}, columns=['SP', 'winloss'])
df['bet_pl'] = df.apply(get_pl_lvl, axis=1)
df['interim_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df['co_pl'] = (df['interim_balance'] - co_thresh).clip_lower(0)
df['co_balance'] = df_odds_winloss['co_pl'].cumsum()
df['post_co_balance'] = df['interim_balance'] - df['co_pl']
bf_r = D('0.05')
df['post_co_deduct_balance'] = df['post_co_balance'] - (df['post_co_balance']* bf_r)
df['sod_bet_balance'] = df['post_co_deduct_balance'].shift(1).fillna(init_balance)
First, you don't need to apply a custom function to get co_pl, it could be done like so:
df['co_pl'] = (df['eod_bet_balance'] - co_thresh).clip_lower(0)
As for updating the other column, if I understand correctly you want something like this:
df['eod_bet_balance'] = df['eod_bet_balance'].clip_upper(co_thresh)
or, equivalently...
df['eod_bet_balance'] -= df['co_pl']
I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]