I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code is shown below.
from mrjob.job import MRJob
import numpy as np
import time
class MRTimeSeriesFrequencyCount(MRJob):
def mapper(self, _, line):
series = [float(i) for i in line.split(',')]
diff = list(np.diff(series))
avg = sum(diff) / len(diff)
std = np.std(diff)
fit = np.polyfit(list(range(len(series))), series, deg=1)
yield "Down", 1 if (series[len(series)-1]-series[0]) < 0 else 0
yield "Up", 1 if (series[len(series)-1]-series[0]) > 0 else 0
yield "Reverse", 1 if (fit[0]*(series[len(series)-1]-series[0])) < 0 else 0
yield "Volatile", 1 if std/avg > 0.33 else 0
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
start_time = time.time()
MRTimeSeriesFrequencyCount.run()
print("--- %s seconds ---" % (time.time() - start_time))
I am new to mapreduce and hadoop. When I scale up the number of rows, which are stored in a csv, my laptop which is an HP Elitebook 8570w still performs faster than running the code in Hadoop (456 seconds vs 628.29 seconds for 1 million rows). The cluster has 4 worker nodes with 4 cores each and 2 head nodes with 4 cores each. Shouldn't it perform faster? Is there some other bottleneck such as reading in the data? Is mrjob running it on only one node? Thanks in advance for the help.
As I known, Hadoop need some time to prepare startup for M/R job & data on HDFS. So you can't get faster performance for a small data set on Hadoop cluster than on local single machine.
You have 1 million rows data. I assume that the data size of one row is 1 KB, so the data size of 1 million rows is about 1 GB. It's a small data set for Hadoop so that the time saved not enough to make up for the latency time of startup before running really on Hadoop.
As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that its marked answer explained the latency of your issue.
Related
Was looking for advice on optimizing this more, I'm working with data that has about 150,000 rows.
The basic operating principle is that I use the devSep to hold the indexes where the data switches from one device to another. Then loop through and calculate using the indexes and smooth the data using the loess function call.
I've tried utilizing threads to speed this up but I have less than impressive results:
Threads:
1 -> 1:44
2 -> 1:24
4 -> 1:25
8 -> 1:43
Any advice would be much appreciated.
import pandas as pd
from loess.loess_1d import loess_1d
import numpy as np
import threading as th
import multiprocessing as mp
import math
#import time
#import datetime
#Global Variables
devSep=[]
#Convert to Allow the Loess to work
colLen=len(EFF)
smoothEFF=[None]*colLen
I_VISO=np.array(I_VISO)
EFF=np.array(EFF)
def threadJob(tID, partSZ):
beg=partSZ*tID
end=partSZ*(tID+1)
if(end>len(devSep)):
prev=devSep[beg]
for i in devSep[beg+1:]:
smoothEFF[prev:i]= loess_1d(I_VISO[prev:i], EFF[prev:i])[1]
prev=i
else:
prev=devSep[beg]
for i in devSep[beg+1,end+1]:
smoothEFF[prev:i]= loess_1d(I_VISO[prev:i], EFF[prev:i])[1]
prev=i
return 0
#Find Separations of data with different devices.
#search through DEV column and find indexes
#where the value changes
prev=DEV.iat[0]
for index,value in DEV.items():
if(value != prev):
devSep.append(index)
prev=value
devSep.append(index)
#start_time=time.time()
#Only Spawn Threads if there are a considerable amount of entries
if(1):
#Spawn Threads
numProcess=2
tList=[]
partSZ=math.ceil(colLen/numProcess)
for tID in range(0,numProcess):
tList.append(th.Thread(target=threadJob, args=(tID,partSZ)))
tList[tID].start()
#Join Threads once Complete
for tID in tList:
tID.join()
else:
prev=-1
for i in devSep:
smoothEFF[prev+1:i]= loess_1d(I_VISO[prev+1:i], EFF[prev+1:i])[1]
prev=i
#end_time=(time.time() - start_time)
#exit("--- %s ---" % str(datetime.timedelta(seconds=end_time)))
I'm a beginner at Hadoop and Linux.
The Problem
Hadoop reduce stuck (or move really really slow) when the input data is large (e.x. 600k rows or 6M rows) even though the Map and Reduce functions are quite simple, 2021-08-08 22:53:12,350 INFO mapreduce.Job: map 100% reduce 67%.
In Linux System Monitor I can see when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping :) see this picture
What ran successfully
I ran the MapReduce job with small input data (600 rows) fast and successfully without any issue map 100% reduce 100%, 2021-08-08 19:44:13,350 INFO mapreduce.Job: map 100% reduce 100%.
Mapper (Python)
#!/usr/bin/env python3
import sys
from itertools import islice
from operator import itemgetter
def read_input(file):
# read file except first line
for line in islice(file, 1, None):
# split the line into words
yield line.split(',')
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
data_row[7] = data_row[7].replace('\n', '')
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
key = str(date[0]) + '_' + str(date[1])
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
if __name__ == "__main__":
main()
Reducer (Python)
#!/usr/bin/env python3
from itertools import groupby
from operator import itemgetter
import sys
import pandas as pd
import numpy as np
import time
def read_mapper_output(file):
for line in file:
yield line
def main(separator='\t'):
all_rows_2015 = []
all_rows_2016 = []
start_time = time.time()
names = ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'total_amount']
df = pd.DataFrame(columns=names)
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin)
for words in data:
# get key & value from Mapper
key, value = words.split(separator)
row = value.split(',')
# split data with belong to 2015 from data belong to 2016
if key in '2015_01 2015_02 2015_03':
all_rows_2015.append(row)
if len(all_rows_2015) >= 10:
df=df.append(pd.DataFrame(all_rows_2015, columns=names))
all_rows_2015 = []
elif key in '2016_01 2016_02 2016_03':
all_rows_2016.append(row)
if len(all_rows_2016) >= 10:
df=df.append(pd.DataFrame(all_rows_2016, columns=names))
all_rows_2016 = []
print(df.to_string())
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == "__main__":
main()
More Info
I'm using Hadoop v3.2.1 on Linux installed on VMware to run MapReduce job in Python.
Reduce Job in Numbers:
Input Data Size
Number of rows
Reduce job time
~98 Kb
600 rows
~0.1 sec
good
~953 Kb
6,000 rows
~1 sec
good
~9.5 Mb
60,000 rows
~52 sec
good
~94 Mb
600,000 rows
~5647 sec (~94 min)
very slow
~11 Gb
76,000,000 rows
??
impossible
The goal is running on ~76M rows input data, it's impossible with this issue remaining.
"when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping" - you have skew. One key has far more values than any other key.
I see some problems here.
In the reduce phase you don't make any summarization, just fiter 2015Q1 and 2015Q2 - reduce is supposed to be used for summarization like grouping by key or doing some calculations based on the keys.
If you just need to filter data, do it on the map phase to save cycles (assume you're billed for all data):
You store a lot of stuff in RAM inside a dataframe. Since you don't know how big is the key, you are experiencing trashing. This combined with heavy keys will make your process do a page fault on every DataFrame.append after some time.
There are some fixes:
Do you really need a reduce phase? Since you are just filtering the first three months os 2015 and 2016 you cand do this on the Map phase. This will make the process go a bit faster if you need a reduce later since it will need less data for the reduce phase.
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
# Find out first if you are filtering this data
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
# Filter out
if (date[1] in [1,2,3]) and (date[0] in [2015,2016]):
# We keep this data. Calulate key and clean up data_row[7]
key = str(date[0]) + '_' + str(date[1])
data_row[7] = data_row[7].replace('\n', '')
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
Try not to store data in memory during the reduce. Since you are filtering, print() the results as soon as you have it. If your source data is not sorted, the reduce will serve as a way to have all data from the same month together.
You've got a bug in your reduce phase: you're losing number_of_records_per_key modulo 10 because you don't append the results to the dataframe. Dont' append to the dataframe and print the result asap.
I have 600 csv files each file contains around 1500 rows of data. I have to run a function on the every row of data. I have define the function.
def query_prepare(data):
"""function goes here"""
"""here input data is list of single row of dataframe"""
the above function is perform some function like strip(), replace() based on conditions. Above function takes every single row data as list.
data = ['apple$*7','orange ','bananna','-'].
this is my initial dataframe looklike
a b c d
0 apple$*7 orange bananna -
1 apple()*7 flower] *bananna -
I checked with the function for one row of data processing it takes around 0.04s. and if I run this on one csv file which contains 1500 row of data it takes almost 1500*0.04s. I have tried with some of the methods....
# normal in built apply function
t = time.time()
a = df.apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 52.519816637039185
# with swifter
t = time.time()
a = df.swifter.allow_dask_on_strings().apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 160.31028127670288
# with pandarallel
pandarallel.initialize()
t = time.time()
a = df.parallel_apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 55.000578
I did everything with my query_prepare function to reduce the time so there are no way to change or modify it. Any other suggestion suggestions?
P.S by the way I'm running it on google colab
EDIT: If we have 1500 row data, split it into 15 then apply the function. can we decrease the time by 15 times if we do something like this?. (I'm sorry I'm not sure its possible or not guide me in a good way)
For example you could roughly do the following:
def sanitize_column(s: pd.Series):
return s.str.strip().str.strip('1234567890()*[]')
then you could do:
df.apply(sanitize_column, axis=0)
with:
df = pd.DataFrame({'a': ['apple7', 'apple()*7'], 'b': [" asd ", ']asds89']})
this will give
a b
0 apple asd
1 apple asds
This should be faster than your solution. For proper benchmarking, we'd need your full solution.
Cannot speed up total calculation time by adding more processors to multiprocessing. Takes just as long to run with 3 processors vs 7 processors.
I've tried chunking the data so each processors works on a much larger set of calculation. Same result.
I've initialized the static data to each process instead of passing as argument.
I've tried returning DataFrame to pool.map vs writing it out to file.
I've timed the EvalContract section. With 3 processors it takes 1 contract with 1 scenario 35 seconds to complete. With 7 processors running the same contract and scenario takes 55 seconds.
import pandas as pd
import numpy as np
import multiprocessing as mp
import itertools as it
def initializer(Cont,Scen,RandWD,dicS):
global dfCleanCont
global dfScen
global dfWithdrawalRandom
global dicSensit
dfCleanCont = Cont
dfScen = Scen
dfWithdrawalRandom = RandWD
dicSensit = dicS
def ValueProj(ContScen):
Contract = dfCleanCont.loc[ContScen[0]]
PTS = Contract.name
ProjWDs = dfWithdrawalRandom[Contract['WD_ID']]
dfScenOneSet = dfScen[dfScen["Trial"]==ContScen[1]]
'''Do various projection calculations. All calculation in numpy arrays then converted to DataFrame before returning. Dataframe shape[601,35]'''
return dfContProj
def ReserveProjectionPreprocess(Scen,dfBarclayRates,dicProjOuterSeries,liProjValContract):
Timestep = liProjValContract[0]['Outer_t']
dfInnerLoopScen = SetupInnerLoopScenarios(Timestep,Scen,dicSensit)
BBC = BuildBarclayCurve(Timestep,Scen[Scen['Timestep']==Timestep][dicSensit['irCols']].iloc[0].to_list(),dfBarclayRates.loc[Timestep],dicSensit)
'''Do various inner loop projection calculations, up to 601 timesteps. All calculation in numpy arrays.'''
return pd.Series({'PTS': Contract.name,
'OuterScenNum': ContractValProjOne['OuterScenNum'],
'Outer_t': ContractValProjOne['Outer_t'],
'Reserve': max(PVL-ContractValProjOne['MV']-AssetHaircut,0)})
def EvalContract(liCS):
for CS in liCS:
'''Evaluate single contract with single scenario'''
start_time = time.time()
dfOuterLoop = ValueProj(CS)
Contract = dfCleanCont.loc[CS[0]]
PTS = Contract.name
dfScenOneSet = dfScen[dfScen["Trial"]==CS[1]]
dfOuterLoopCut = dfOuterLoop[dfOuterLoop['BV']!=0][:-1]
MinMVt = dicSensit['ProjectionYrs']*12 if sum(dfOuterLoop[(dfOuterLoop['MV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])==0 else min(dfOuterLoop[(dfOuterLoop['MV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])
MinBVt = dicSensit['ProjectionYrs']*12 if sum(dfOuterLoop[(dfOuterLoop['BV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])==0 else min(dfOuterLoop[(dfOuterLoop['BV']==0) & (dfOuterLoop['Outer_t']>0)]['Outer_t'])
dicProjOuterSeries = {'Contract': Contract,
'BaseLapsePartContribution': dfOuterLoop['BaseLapsePartContribution'].values,
'BaseLapsePartNetTransfer': dfOuterLoop['BaseLapsePartNetTransfer'].values,
'BaseLapsePartWithdrawal': dfOuterLoop['BaseLapsePartWithdrawal'].values,
'PrudentEstDynPartWDPct': dfOuterLoop['PrudentEstDynPartWDPct'].values,
'KnownPutQueueWD': dfOuterLoop['KnownPutQueueWD'].values,
'BaseLapsePlanSponsor': dfOuterLoop['BaseLapsePlanSponsor'].values,
'PrudentEstDynPlanWDPct': dfOuterLoop['PrudentEstDynPlanWDPct'].values,
'MonthlyDefaultCharge': dfOuterLoop['MonthlyDefaultCharge'].values,
'Outer_t_Maturity': min(MinMVt,MinBVt)-1}
liProjValContract=[]
for _,row in dfOuterLoopCut.iterrows():
liProjValContract.append([row])
func=partial(ReserveProjectionPreprocess,dfScenOneSet,dicProjOuterSeries)
dfReserve = pd.concat(map(func,liProjValContract),axis=1,ignore_index=True).T
dfOuterLoopwRes = pd.merge(dfOuterLoop,dfReserve,how='left',on=['PTS','OuterScenNum','Outer_t'])
dfOuterLoopwRes['Reserve'].fillna(value=0,inplace=True)
fname='OuterProjection_{0}_{1}.parquet'.format(PTS,CS[1])
dfOuterLoopwRes.to_parquet(os.path.join(dicSensit['OutputDirOuterLoop'],fname),index=False)
return 1
if __name__ == '__main__':
dfCleanCont = 'DataFrame of 150 contract data. Each row is a contract with various info such as market value, interest rate, maturity date, etc. Indentifier index is "PTS". Shape[150,41]
dfScen = 'DataFrame of interest rate scenarios. 100 scenarios("Trial"). Each scenario has 601 timesteps and 11 interest rate term points. Shape[60100,13]'
liContID = list(dfCleanCont.index)
liScenID = dfScen["Trial"].unique()
liCS = list(it.product(liContID,liScenID))
pool=mp.Pool(7,initializer,(ns.dfCleanCont,ns.dfScen,dfWithdrawalRandom,dicSensit,))
n=10
liCSGroup=[liCS[x:x+n] for x in range(0,len(liCS),n)]
dfCombOuterProj = pd.concat(pool.map(func=EvalContract,iterable=liCSGroup))
pool.close()
pool.join()
Expecting significant speed gain with more processors. There's a bottleneck somewhere but I can't seem to find it. Tried cProfile but getting the same cumulative time with 3 or 7 processors.
I am using postgres dedupe example code.
For 10,000 rows, it is consuming 163 seconds. I found that it is consuming most of the time in this part:
full_data = []
cluster_membership = collections.defaultdict(lambda : 'x')
for cluster_id, (cluster, score) in enumerate(clustered_dupes):
for record_id in cluster:
for row in data:
if record_id == int(row[0]):
row = list(row)
row.insert(0,cluster_id)
row = tuple(row)
full_data.append(row)
Is there any possible optimization for this part such that it produces the same result in less time complexity? Will this script work for 150 million records?