How to parallelize for loops in Pyspark? - python

I am trying to convert some Pandas code to Pyspark, which will run on an EMR cluster. This is my first time working with Pyspark, and I am not sure what is the optimal way to code the objective. The job is trying to achieve the following:
There is a base dataframe with schema like so:
institution_id, user_id, st_date
For every unique institution_id, get all users
For every user for the institution_id, take all unique st_dates in sorted order, get the difference between pairs of consecutive st_dates and output a dictionary
Here is what the code looks like as of now:
def process_user(current_user, inst_cycles):
current_user_dates = np.sort(current_user.st_date.unique())
if current_user_dates.size > 1:
prev_date = pd.to_datetime(current_user_dates[0]).date()
for current_datetime in current_user_dates[1:]:
current_date = pd.to_datetime(current_datetime).date()
month = current_date.month
delta = current_date - prev_date
cycle_days = delta.days
inst_cycles[month][cycle_days] += 1
prev_date = current_date
return inst_cycles
def get_inst_monthly_distribution(current_inst):
inst_cycles = defaultdict(lambda: defaultdict(int))
inst_user_ids = current_inst.select('user_id').distinct().collect()
for _, user_id in enumerate(inst_user_ids):
user_id_str = user_id[0]
current_user = current_inst.filter(current_inst.user_id == user_id_str)
inst_cycles = process_user(current_user, inst_cycles)
return inst_cycles
def get_monthly_distributions(inst_ids, df):
cycles = {}
for _, inst_id_str in enumerate(inst_ids.keys()):
current_inst = df.filter(df.inst_id == inst_id_str)
cycles[inst_id_str] = get_inst_monthly_distribution(current_inst)
return cycles
def execute():
df = load_data() # df is a Spark dataframe
inst_names = get_inst_names(df)
monthly_distributions = get_monthly_distributions(inst_names, df)
I think this code is not taking advantage of the parallelism of Spark, and can be coded in a much better way without the for loops. Is that correct?

Related

how to iterate with groupby in pandas

I have a function minmax, that basically iterates over a dataframe of transactions. I want to calculate a set of calculations including the id, so accountstart,accountend are the two fields calculated. The intention is to make this calculations my month and account.
So when I do:
df1 = df.loc[df['accountNo']==10]
minmax(df1) it works.
What I can't do is:
df.groupby('accountNo').apply(minmax)
When I do:
grouped = df.groupby('accountNo')
for i,j in grouped:
print(minmax(j))
It does the computation, print the result, but without print it complains about KeyError: -1 that is itertools. So akward.
How to tackle that in Pandas?
def minmax(x):
dfminmax = {}
accno = set(x['accountNo'])
accno = repr(accno)
kgroup = x.groupby('monthStart')['cumsum'].sum()
maxt = x['startbalance'].max()
kgroup = pd.DataFrame(kgroup)
kgroup['startbalance'] = 0
kgroup['startbalance'][0] = maxt
kgroup['endbalance'] = 0
kgroup['accountNo'] = accno
kgroup['accountNo'] = kgroup['accountNo'].str.strip('{}.0')
kgroup.reset_index(inplace=True)
for idx, row in kgroup.iterrows():
if kgroup.loc[idx,'startbalance']==0:
kgroup.loc[idx,'startbalance']=kgroup.loc[idx-1,'endbalance'],
if kgroup.loc[idx,'endbalance']==0:
kgroup.loc[idx,'endbalance'] =
kgroup.loc[idx,'cumsum']+kgroup.loc[idx,'startbalance']
dfminmax['monthStart'].append(kgroup['monthStart'])
dfminmax['startbalance'].append(kgroup['startbalance'])
dfminmax['endbalance'].append(kgroup['endbalance'])
dfminmax['accountNo'].append(kgroup['accountNo'])
return dfminmax
.apply() takes pandas Series as inputs, not DataFrames. Using .agg, as in df.groupby('accountNo').agg(yourfunction) should yield better results. Be sure to check out the documentation for details on implementation.

Aggregating financial candlestick data

I have 1 minute OHLCV Candlestick data, and I need to aggregate it to create 15m Candlesticks. The database comes from MongoDB; this is a version in clean Python:
def get_candela(self,tf):
c = dict()
candel = dict()
candele_finale = list()
prov_c = list()
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
candele = list(db_candela)
timestamp_calc = list()
open_calc = list()
max_calc = list()
min_calc = list()
close_calc = list()
vol_calc = list()
#for _ in range(len(candele)):
for a in range(tf):
if len(candele) == 0:
break
prov_c.append(candele[a])
c.append(prov_c)
candele[:tf]=[]
for b in range(len(c)):
cndl = c[b]
for d in range(tf):
print(cndl)
cnd = cndl[d]
#print(len(cnd))
timestamp_calc.append(cnd["timestamp"])
open_calc.append(cnd["open"])
max_calc.append(cnd["high"])
min_calc.append(cnd["low"])
close_calc.append(cnd["close"])
vol_calc.append(cnd["volume"])
index_close=len(close_calc)
candel["timestamp"] = timestamp_calc[d]
candel["open"] = open_calc[0]
candel["high"] = max(max_calc)
candel["low"] = min(min_calc)
candel["close"] = close_calc[index_close-1]
candel["volume"] = sum(vol_calc)
#print(candel)
candele_finale.append(candel)
max_calc.clear()
min_calc.clear()
vol_calc.clear()
return candele_finale
This returns an array with only the last candlestick created.
And this is another version in pandas:
db = database("price_data", "1min_OHLC_XBTUSD")
col = database.get_collection(db,"1min_OHLC_XBTUSD")
db_candela = col.find({}, sort = [('timestamp', pymongo.ASCENDING)]).limit(20)
prov_c = list()
for item in db_candela:
cc={"timestamp":item["timestamp"],"open":item["open"],"high":item["high"],"low":item["low"],"close":item["close"],"volume":item["volume"]}
prov_c.append(cc)
print(prov_c)
data = pandas.DataFrame([prov_c], index=[pandas.to_datetime(cc["timestamp"])])
#print(data)
df = data.resample('5T').agg({'timestamp':'first','open':'first','high':'max', 'low':'min','close' : 'last','volume': 'sum'})
#print(data.mean())
#with pandas.option_context('display.max_rows', None, 'display.max_columns',None): # more options can be specified also
pprint(df)
This returns a dataframe with weird/random values.
I had the same question today, and I answered it. Basically there is a pandas function called resample that does all of the work for you.
Here is my code:
import json
import pandas as pd
#load the raw data and clean it up
data_json = open('./testapiresults.json') #load json object
data_dict = json.load(data_json) #convert to a dict object
df = pd.DataFrame.from_dict(data_dict) #convert to panel data's dataframe object
df['datetime'] = pd.to_datetime(df['datetime'],unit='ms') #convert from unix time (ms from epoch) to human time
df = df.set_index(pd.DatetimeIndex(df['datetime'])) #set time series as index format for resample function
#resample to aggregate to a different frequency of data
candle_summary = pd.DataFrame()
candle_summary['open'] = df.open.resample('15Min').first()
candle_summary['high'] = df.high.resample('15Min').max()
candle_summary['low'] = df.low.resample('15Min').min()
candle_summary['close'] = df.close.resample('15Min').last()
candle_summary.head()
I had to export to csv and recalculate it in excel to doublecheck that it was calculating correctly, but this works. I have not figured out the pandas.DataFrame.resample('30min').ohlc() function, but that looks like it could make some really elegant code if I could figure out how to make it work without tons of errors.

Pandas optimization for multiple records

I have a file with around 500K records.
Each record needs to be validated.
Records are de duplicated and store in a list:
with open(filename) as f:
records = f.readlines()
The validation file I used is stored in a Pandas Dataframe
This DataFrame contains around 80K records and 9 columns (myfile.csv).
filename = 'myfile.csv'
df = pd.read_csv(filename)
def check(df, destination):
try:
area_code = destination[:3]
office_code = destination[3:6]
subscriber_number = destination[6:]
if any(df['AREA_CODE'].astype(int) == area_code):
area_code_numbers = df[df['AREA_CODE'] == area_code]
if any(area_code_numbers['OFFICE_CODE'].astype(int) == office_code):
matching_records = area_code_numbers[area_code_numbers['OFFICE_CODE'].astype(int) == office_code]
start = subscriber_number >= matching_records['SUBSCRIBER_START']
end = subscriber_number <= matching_records['SUBSCRIBER_END']
# Perform intersection
record_found = matching_records[start & end]['LABEL'].to_string(index=False)
# We should return only 1 value
if len(record_found) > 0:
return record_found
else:
return 'INVALID_SUBSCRIBER'
else:
return 'INVALID_OFFICE_CODE'
else:
return 'INVALID_AREA_CODE'
except KeyError:
pass
except Exception:
pass
I'm looking for a way to improve the comparisons, as when I run it, it just hangs. If I run it with an small subset (10K) it works fine.
Not sure if there is a more efficient notation/recommendation.
for record in records:
check(df, record)
Using MacOS 8GB/2.3 GHz Intel Core i7.
With Cprofile.run in check function alone shows:
4253 function calls (4199 primitive calls) in 0.017 seconds.
Hence I assume 500K will take around 2 1/2 hours
While no data is available, consider this untested approach with a couple of left join merges of both data pieces and then run the validation steps. This would avoid any looping and run conditional logic across columns:
import pandas as pd
import numpy as np
with open('RecordsValidate.txt') as f:
records = f.readlines()
print(records)
rdf = pd.DataFrame({'rcd_id': list(range(1,len(records)+1)),
'rcd_area_code': [int(rcd[:3]) for rcd in records],
'rcd_office_code': [int(rcd[3:6]) for rcd in records],
'rcd_subscriber_number': [rcd[6:] for rcd in records]})
filename = 'myfile.csv'
df = pd.read_csv(filename)
# VALIDATE AREA CODE
mrgdf = pd.merge(df, rdf, how='left', left_on=['AREA_CODE'], right_on=['rcd_area_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_AREA_CODE', np.nan)
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)
# VALIDATE OFFICE CODE
mrgdf = pd.merge(mrgdf, rdf, how='left', left_on=['AREA_CODE', 'OFFICE_CODE'],
right_on=['rcd_area_code', 'rcd_office_code'])
mrgdf['RETURN'] = np.where(pd.isnull('rcd_id'), 'INVALID_OFFICE_CODE', mrgdf['RETURN'])
# VALIDATE SUBSCRIBER
mrgdf['RETURN'] = np.where((mrgdf['rcd_subscriber_number'] < mrgdf['SUBSCRIBER_START']) |
(mrgdf['rcd_subscriber_number'] > mrgdf['SUBSCRIBER_END']) |
(mrgdf['LABEL'].str.len() = 0),
'INVALID_SUBSCRIBER', mrgdf['RETURN'])
mrgdf.drop([c for c in rdf.columns], inplace=True,axis=1)

Pandas apply a change which affects 2 columns at the same time

I have the dataframe below. bet_pl and co_pl keep track of the daily changes in the 2 balances. I have updated co_balance based on co_pl and the cumsum.
init_balance = D('100.0')
co_thresh = D('1.05') * init_balance
def get_pl_co(row):
if row['eod_bet_balance'] > co_thresh:
diff = row['eod_bet_balance']- co_thresh
return(diff)
else:
return Decimal('0.0')
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df_odds_winloss['sod_bet_balance']= df_odds_winloss['eod_bet_balance'].shift(1).fillna(init_balance)
df_odds_winloss['co_pl'] = df_odds_winloss.apply(get_pl_co, axis=1)
df_odds_winloss['co_balance'] = df_odds_winloss['co_pl'].cumsum()
# trying this
df_odds_winloss['eod_bet_balance'] = df_odds_winloss['eod_bet_balance'] - df_odds_winloss['co_pl']
Now I want the eod_bet_balance to update with negative co_pl as it is a transfer between the 2 balances, but am not getting the right eod (end of day) balances.
Can anyone give a hint?
UPDATED: The eod_balances reflect the change in bet_pl but not the subsequent change in co_pl.
FINAL UPDATE:
initial_balance = D('100.0')
df = pd.DataFrame({ 'SP': res_df['SP'], 'winloss': bin_seq_l}, columns=['SP', 'winloss'])
df['bet_pl'] = df.apply(get_pl_lvl, axis=1)
df['interim_balance'] = df_odds_winloss['bet_pl'].cumsum()+initial_balance
df['co_pl'] = (df['interim_balance'] - co_thresh).clip_lower(0)
df['co_balance'] = df_odds_winloss['co_pl'].cumsum()
df['post_co_balance'] = df['interim_balance'] - df['co_pl']
bf_r = D('0.05')
df['post_co_deduct_balance'] = df['post_co_balance'] - (df['post_co_balance']* bf_r)
df['sod_bet_balance'] = df['post_co_deduct_balance'].shift(1).fillna(init_balance)
First, you don't need to apply a custom function to get co_pl, it could be done like so:
df['co_pl'] = (df['eod_bet_balance'] - co_thresh).clip_lower(0)
As for updating the other column, if I understand correctly you want something like this:
df['eod_bet_balance'] = df['eod_bet_balance'].clip_upper(co_thresh)
or, equivalently...
df['eod_bet_balance'] -= df['co_pl']

Slow Data analysis using pandas

I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data.
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe:
for each unique API i find all occurrences in the main list of apis
using that information i build a list of dates
I find a min and max date for each list corresponding to an api
I then build an empty pandas DataFrame that has every month between the two dates for the associated api
I then append this data frame to a list "dummydata" and loop to the next api
taking this dummy data list I then concatenate it into a DataFrame
this DataFrame is then merged with another dataframe with cleaned data
end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated!
You could start by cleaning up the code a bit. These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. So try something like this:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call).
full_df.columns = ['API', 'Production Month', 'Days', etc.]

Categories

Resources