I'm a beginner at Hadoop and Linux.
The Problem
Hadoop reduce stuck (or move really really slow) when the input data is large (e.x. 600k rows or 6M rows) even though the Map and Reduce functions are quite simple, 2021-08-08 22:53:12,350 INFO mapreduce.Job: map 100% reduce 67%.
In Linux System Monitor I can see when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping :) see this picture
What ran successfully
I ran the MapReduce job with small input data (600 rows) fast and successfully without any issue map 100% reduce 100%, 2021-08-08 19:44:13,350 INFO mapreduce.Job: map 100% reduce 100%.
Mapper (Python)
#!/usr/bin/env python3
import sys
from itertools import islice
from operator import itemgetter
def read_input(file):
# read file except first line
for line in islice(file, 1, None):
# split the line into words
yield line.split(',')
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
data_row[7] = data_row[7].replace('\n', '')
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
key = str(date[0]) + '_' + str(date[1])
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
if __name__ == "__main__":
main()
Reducer (Python)
#!/usr/bin/env python3
from itertools import groupby
from operator import itemgetter
import sys
import pandas as pd
import numpy as np
import time
def read_mapper_output(file):
for line in file:
yield line
def main(separator='\t'):
all_rows_2015 = []
all_rows_2016 = []
start_time = time.time()
names = ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'total_amount']
df = pd.DataFrame(columns=names)
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin)
for words in data:
# get key & value from Mapper
key, value = words.split(separator)
row = value.split(',')
# split data with belong to 2015 from data belong to 2016
if key in '2015_01 2015_02 2015_03':
all_rows_2015.append(row)
if len(all_rows_2015) >= 10:
df=df.append(pd.DataFrame(all_rows_2015, columns=names))
all_rows_2015 = []
elif key in '2016_01 2016_02 2016_03':
all_rows_2016.append(row)
if len(all_rows_2016) >= 10:
df=df.append(pd.DataFrame(all_rows_2016, columns=names))
all_rows_2016 = []
print(df.to_string())
print("--- %s seconds ---" % (time.time() - start_time))
if __name__ == "__main__":
main()
More Info
I'm using Hadoop v3.2.1 on Linux installed on VMware to run MapReduce job in Python.
Reduce Job in Numbers:
Input Data Size
Number of rows
Reduce job time
~98 Kb
600 rows
~0.1 sec
good
~953 Kb
6,000 rows
~1 sec
good
~9.5 Mb
60,000 rows
~52 sec
good
~94 Mb
600,000 rows
~5647 sec (~94 min)
very slow
~11 Gb
76,000,000 rows
??
impossible
The goal is running on ~76M rows input data, it's impossible with this issue remaining.
"when reduce hit the 67% only one CPU keep running at the time at 100% and the rest of them are sleeping" - you have skew. One key has far more values than any other key.
I see some problems here.
In the reduce phase you don't make any summarization, just fiter 2015Q1 and 2015Q2 - reduce is supposed to be used for summarization like grouping by key or doing some calculations based on the keys.
If you just need to filter data, do it on the map phase to save cycles (assume you're billed for all data):
You store a lot of stuff in RAM inside a dataframe. Since you don't know how big is the key, you are experiencing trashing. This combined with heavy keys will make your process do a page fault on every DataFrame.append after some time.
There are some fixes:
Do you really need a reduce phase? Since you are just filtering the first three months os 2015 and 2016 you cand do this on the Map phase. This will make the process go a bit faster if you need a reduce later since it will need less data for the reduce phase.
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# for each row we take only the needed columns
data_row = list(itemgetter(*[1, 2, 4, 5, 6, 9, 10, 18])(words))
# Find out first if you are filtering this data
# taking year and month No.from first column to create the
# key that will send to reducer
date = data_row[0].split(' ')[0].split('-')
# Filter out
if (date[1] in [1,2,3]) and (date[0] in [2015,2016]):
# We keep this data. Calulate key and clean up data_row[7]
key = str(date[0]) + '_' + str(date[1])
data_row[7] = data_row[7].replace('\n', '')
# value that will send to reducer
value = ','.join(data_row)
# print here will send the output pair (key, value)
print('%s%s%s' % (key, separator, value))
Try not to store data in memory during the reduce. Since you are filtering, print() the results as soon as you have it. If your source data is not sorted, the reduce will serve as a way to have all data from the same month together.
You've got a bug in your reduce phase: you're losing number_of_records_per_key modulo 10 because you don't append the results to the dataframe. Dont' append to the dataframe and print the result asap.
Related
Disclaimer !! This is my first post ever, so sorry if I don't meet certain standards of the community. _________________ _________________ _________________ _________________ _________________
I use python3, Jupyter Notebooks, Pandas
I used KMC kmer counter to count kmers of 60,000 DNA sequences in a reasonable amount of time. I want to use these kmer counts as input to ML algorithms as part of a Bag Of Words model.
The shape of a file containing kmer counts is as below, or as in image here and I have 60K files:
AAAAAC 2
AAAAAG 6
AAAAAT 2
AAAACC 4
AAAACG 2
AAAACT 3
AAAAGA 5
I want to create a single DataFrame from all the 60K files with one line per DNA sequence kmer counts which would have this form:
The target DataFrame shape
A first approach was successful and I managed to import 100 sequences(100 txt files) in 58 seconds, using this code:
import time
countsPath = r'D:\DataSet\MULTI\bow\6mer'
start = time.time()
for i in range(0, 60000):
sample = pd.read_fwf(countsPath + r'\kmers-' + str(k) +'-seqNb-'+ str(i) + '.txt',sep=" ", header=None).T
new_header = sample.iloc[0] #grab the first row for the header
sample = sample[1:] #take the data less the header row
sample.columns = new_header #set the header row as the df header
df= df.append(sample, ignore_index=True) #APPEND Sample to df DataSet
end = time.time()
# total time taken
print(f"Runtime of the program is {end - start} secs")
# display(sample)
display(df)
However, this was very slow, and took 59 secs on 100 files. On the full dataset, take a factor of x600.
I tried dask DataFrames Bag to accelerate the process because it reads dictionary-like data, but I couldn't append each file as a row. The resulting Dask DataFrame is as follows or as in this image:
0 AAAAA 18
1 AAAAC 16
2 AAAAG 13
...
1023 TTTTT 14
0 AAAAA 5
1 AAAAC 4
...
1023 TTTTT 9
0 AAAAA 18
1 AAAAC 16
2 AAAAG 13
3 AAAAT 12
4 AAACA 11
So the files are being inserted in a single column.
Anyone has a better way of efficiently creating a DataFrame from 60k txt Files?
Love the disclaimer. I have a similar one - this is the first time I'm trying to answer a question. But I'm pretty certain I got this...and so will you:
dict_name = dict(zip(df['column_name'],df['the_other_column_name']))
I have 600 csv files each file contains around 1500 rows of data. I have to run a function on the every row of data. I have define the function.
def query_prepare(data):
"""function goes here"""
"""here input data is list of single row of dataframe"""
the above function is perform some function like strip(), replace() based on conditions. Above function takes every single row data as list.
data = ['apple$*7','orange ','bananna','-'].
this is my initial dataframe looklike
a b c d
0 apple$*7 orange bananna -
1 apple()*7 flower] *bananna -
I checked with the function for one row of data processing it takes around 0.04s. and if I run this on one csv file which contains 1500 row of data it takes almost 1500*0.04s. I have tried with some of the methods....
# normal in built apply function
t = time.time()
a = df.apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 52.519816637039185
# with swifter
t = time.time()
a = df.swifter.allow_dask_on_strings().apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 160.31028127670288
# with pandarallel
pandarallel.initialize()
t = time.time()
a = df.parallel_apply(lambda x: query_prepare(x.to_list()),axis=1)
print('time taken',time.time()-t)
# time taken 55.000578
I did everything with my query_prepare function to reduce the time so there are no way to change or modify it. Any other suggestion suggestions?
P.S by the way I'm running it on google colab
EDIT: If we have 1500 row data, split it into 15 then apply the function. can we decrease the time by 15 times if we do something like this?. (I'm sorry I'm not sure its possible or not guide me in a good way)
For example you could roughly do the following:
def sanitize_column(s: pd.Series):
return s.str.strip().str.strip('1234567890()*[]')
then you could do:
df.apply(sanitize_column, axis=0)
with:
df = pd.DataFrame({'a': ['apple7', 'apple()*7'], 'b': [" asd ", ']asds89']})
this will give
a b
0 apple asd
1 apple asds
This should be faster than your solution. For proper benchmarking, we'd need your full solution.
I am currently occupied with a dataset consisting of 90 .csv files. There are three types of .csv files (30 of each type).
Each csv has from 20k to 30k rows average and 3 columns(timestamp in linux format, Integer,Integer).
Here's an example of the header and a row:
Timestamp id1 id2
151341342 324 112
I am currently using 'os' to list all files in the directory.
The process for each CSV file is as follows:
Read it through pandas into a dataframe
iterate the rows of the file and for each row convert the timestamp to readable format.
Use the converted timestamp and Integers to create a relationship-type of object and add it on a list of relationships
The list will later be looped to create the relationships in my neo4j database.
The problem I am having is that the process takes too much time. I have asked and searched for ways to do it faster (I got answers like PySpark, Threads) but I did not find something that really fits my needs. I am really stuck as with my resources it takes around 1 hour and 20 minutes to do all that process for one of the big .csv file(meaning one with around 30k rows)
Converting to readable format:
ts = int(row['Timestamp'])
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
And I pass the parameters to the Relationship func of py2neo to create my relationships. Later that list will be looped .
node1 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row["id1"]))
node2 = graph.evaluate('MATCH (n:User) WHERE n.id={id} RETURN n', id=int(row['id2']))
rels.append(Relationship(node1, rel_type, node2, date=date, time=time))
time to compute row: 0:00:00.001000
time to create relationship: 0:00:00.169622
time to compute row: 0:00:00.001002
time to create relationship: 0:00:00.166384
time to compute row: 0:00:00
time to create relationship: 0:00:00.173672
time to compute row: 0:00:00
time to create relationship: 0:00:00.171142
I calculated the time for the two parts of the process as shown above. It is fast and there really seems to not be a problem except the size of the files. This is why the only things that comes to mind is that Parallelism would help to compute those files faster(by computing lets say 4 files in the same time instead of one)
sorry for not posting everything
I am really looking forward for replies
Thank you in advance
That sounds fishy to me. Processing csv files of that size should not be that slow.
I just generated a 30k line csv file of the type you described (3 columns filled with random numbers of the size you specified.
import random
with open("file.csv", "w") as fid:
fid.write("Timestamp;id1;id2\n")
for i in range(30000):
ts = int(random.random()*1000000000)
id1 = int(random.random()*1000)
id2 = int(random.random()*1000)
fid.write("{};{};{}\n".format(ts, id1, id2))
Just reading the csv file into a list using plain Python takes well under a second. Printing all the data takes about 3 seconds.
from datetime import datetime
def convert_date(string):
ts = int(string)
formatted_ts = datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
split_ts = formatted_ts.split()
date = split_ts[0]
time = split_ts[1]
return date
with open("file.csv", "r") as fid:
header = fid.readline()
lines = []
for line in fid.readlines():
line_split = line.strip().split(";")
line_split[0] = convert_date(line_split[0])
lines.append(line_split)
for line in lines:
print(line)
Could you elaborate what you do after reading the data? Especially "create a relationship-type of object and add it on a list of relationships"
That could help pinpoint your timing issue. Maybe there is a bug somewhere?
You could try timing different parts of your code to see which one takes the longest.
Generally, what you describe should be possible within seconds, not hours.
I am trying to run a simple mapreduce program on HDInight through Azure. My program is written in python and simply counts the how many rows of numbers (timeseries) meet certain criteria. The final result are just counts for each category. My code is shown below.
from mrjob.job import MRJob
import numpy as np
import time
class MRTimeSeriesFrequencyCount(MRJob):
def mapper(self, _, line):
series = [float(i) for i in line.split(',')]
diff = list(np.diff(series))
avg = sum(diff) / len(diff)
std = np.std(diff)
fit = np.polyfit(list(range(len(series))), series, deg=1)
yield "Down", 1 if (series[len(series)-1]-series[0]) < 0 else 0
yield "Up", 1 if (series[len(series)-1]-series[0]) > 0 else 0
yield "Reverse", 1 if (fit[0]*(series[len(series)-1]-series[0])) < 0 else 0
yield "Volatile", 1 if std/avg > 0.33 else 0
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
start_time = time.time()
MRTimeSeriesFrequencyCount.run()
print("--- %s seconds ---" % (time.time() - start_time))
I am new to mapreduce and hadoop. When I scale up the number of rows, which are stored in a csv, my laptop which is an HP Elitebook 8570w still performs faster than running the code in Hadoop (456 seconds vs 628.29 seconds for 1 million rows). The cluster has 4 worker nodes with 4 cores each and 2 head nodes with 4 cores each. Shouldn't it perform faster? Is there some other bottleneck such as reading in the data? Is mrjob running it on only one node? Thanks in advance for the help.
As I known, Hadoop need some time to prepare startup for M/R job & data on HDFS. So you can't get faster performance for a small data set on Hadoop cluster than on local single machine.
You have 1 million rows data. I assume that the data size of one row is 1 KB, so the data size of 1 million rows is about 1 GB. It's a small data set for Hadoop so that the time saved not enough to make up for the latency time of startup before running really on Hadoop.
As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that its marked answer explained the latency of your issue.
I have a vector that contain stock tickers like tickers = ['AAPL','XOM','GOOG'] and in my "traditional" python program I would loop over this tickers vector, select one ticker string like AAPL, import a csv file that contains AAPL stock returns, use the returns as an input to a common function, and finally generate a csv file as an output. I have over 4000 tickers and the function to apply to each ticker takes time to process. I have access to a computer cluster with the mpi4py package with access to about 100 processors per job. I understand well (and was able to implement) this mpi example in python:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
data = [i for i in range(8)]
# dividing data into chunks
chunks = [[] for _ in range(size)]
for i, chunk in enumerate(data):
chunks[i % size].append(chunk)
else:
data = None
chunks = None
data = comm.scatter(chunks, root=0)
print str(rank) + ': ' + str(data)
[cha#cluster] ~/utils> mpirun -np 3 ./mpi.py
2: [2, 5]
0: [0, 3, 6]
1: [1, 4, 7]
So in this example, we have a data vector of size 8 and assign to each processor (3 in total) an equal number of elements of the data. How can I use the similar above example and assign to each processor one stock ticker and apply the function that needs to be run for each ticker? How can I tell python that once a processor get free, to go back in the tickers vector and process a ticker that has not yet been processed?
There's another way to think of this. You have 100 processors processing 4000 chunks of data. One way you can look at this is that each processor gets a block of data on which to operate. Evenly split, each processor will get 40 tickers to process. Processor 1 will get 0-39, processor 2 will get 40-79, etc.
Thinking this way, you don't need to worry about what happens when a processor finishes its tasks. Just have a loop:
block_size = len(tickers) / size # this will be 40 in your example
for i in range(block_size):
ticker = tickers[rank * block_size + i]
process(ticker)
def process(ticker):
# load data
# process data
# output data
Does this make sense?
[edit]
If you're wanting to read more, this is really just a variation on row-major order indexing, a common method for accessing multidimensional data that's stored in a single dimension of memory.