Input split for Map function in Hadoop

Input split for Map function in Hadoop - python

This is my first implementation in Hadoop. I am trying to implement my algorithm for probabilistic dataset in Map Reduce. In my dataset, last column will have some id(number of unique id's in the dataset is equal to the number of nodes in my cluster). I have to divide my dataset based on this column value and each set of records should be processed by each nodes in my cluster.
For example, if i have three nodes in my cluster, for the below dataset, one node should process all the records with id=1, another one with id=2, another one with id=3
name time dept id
--------------------
b1 2:00pm z1 1
b2 3:00pm z2 2
c1 4:00pm y2 1
b3 3:00pm z3 3
c4 4:00pm x2 2
My map function should take each split as an input and process it in parallel in each node.
I am just trying to understand, which approach is possible to do in Hadoop. Either to input this dataset as a input for my map function and pass an additional argument with map to split the data based on id value.
Or split the data beforehand to "n"(number of nodes) subsets and load it in to the nodes, if this is the correct approach, how it is possible to split the data based on value and load in different nodes. Because, what i understood from my readings is that hadoop split the data in to blocks based on the specified size. How can we specify a particular condition while loading. Just to add up, I am writing my program in python.
Someone please advise. Thanks

The simplest thing for you would probably be to have the mapper output the data with the id as key, which will guarantee that one reducer will get all the records for a specific id and then do your processing in the reducer phase.
For example,
Input data:
b1 2:00pm z1 1
b2 3:00pm z2 2
c1 4:00pm y2 1
b3 3:00pm z3 3
c4 4:00pm x2 2
Mapper code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[-1]
print key + "\t" + line
Map output:
1 b1 2:00pm z1 1
2 b2 3:00pm z2 2
1 c1 4:00pm y2 1
3 b3 3:00pm z3 3
2 c4 4:00pm x2 2
Reducer 1 input:
1 b1 2:00pm z1 1
1 c1 4:00pm y2 1
Reducer 2 input:
2 b2 3:00pm z2 2
Reducer 3 input:
3 b3 3:00pm z3 3
Reducer code:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
orig_line = "\t".join(cols[1:])
# do stuff...
Note that this way a single reducer might get several keys, but the data will be ordered and you can control the number of reducers with the mapred.reduce.tasks option.
EDIT
If you want to collect your data in the reducer per key you can do something like this (not sure it will run as-is but you get the idea)
#!/usr/bin/env python
import sys
def process_data(key_id, data_list):
# data_list has all the lines for key_id
last_key = None
data = []
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[0]
if last_key and key != last_key:
process_data(last_key, data)
data = []
orig_line = "\t".join(cols[1:])
data.append(orig_line)
last_key = key
process_data(last_key, data)
If you aren't worried about running out of memory in the reducer step you can simplify the code like this:
#!/usr/bin/env python
import sys
from collections import defaultdict
def process_data(key_id, data_list):
# data_list has all the lines for key_id
all_data = defaultdict(list)
for line in sys.stdin:
line = line.strip()
cols = line.split("\t")
key = cols[0]
orig_line = "\t".join(cols[1:])
all_data[key].append(orig_line)
for key, data in all_data.iteritems():
process_data(key, data)

If I understood your question, the best way is to load your dataset into a hive table, then write UDF in python. After that, do something like this:
select your_python_udf(name, time, dept, id) from table group by id;
This is look like reduce phase, so you, maybe, need this before launching the query
set mapred.reduce.tasks=50;
How to create custom UDF:
Hive Plugins
Create Function

Related

Data Scraping from txt file with consistent structure

I'm working with a very old program that outputs the results for a batch query in a very odd format (at least for me).
Imagine having queried info for the objects A, B and C.
The output will look like this:
name : A
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : B
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
======
name : C
------
Group 1
p1 : 11
p2 : 12
Group 2
p3 : 23
p4 : 24
Do you have any idea of how to put the data in a more useful format?
A possible good format would be a table with columns A B C and rows p1, p2...
I had few ideas but I don't really know how to implement those:
Every object is separated by a ====== string, that means i can use this to separate in many .txt files the output
Then I can read the files with excel setting : as separator, obtaining a csv file with 2 columns (1 containing the p descriptors and one with the actual values)
Now i need to merge all the csv files into one single csv with as many columns as objects and px rows
I'd like to do this in python but I really don't know any package for this situation. Also the objects are a few hundreds so I need an automatized algorithm for doing that.
Any tip, advice or idea you can think of is welcome.

Here's a quick solution putting the data you say you need - not all labels - in a csv file. Each output line starts with the name A/B/C and then comes the values p1..x.
It has no handling of missing values, so in that case just the present values will be listed, thus column 5 will not always be p4. It relies on the assumption that there's a name line starting every item/entry, and that all other a:b lines have a value b to be stored. This should be a good start to put it into another structure should you need so. The format is truly special, more of a report structure, so I'd guess there's no suitable general purpose lib. Flat format is another similarly tricky old format type for which there are libraries - I've used it when calculating how much money each swedish participator in the interrail program should receive. Tricky business but fun! :-)
The code:
import re
import csv
with open('input.txt') as f:
lines = f.readlines()
f.close()
entries = []
entry = []
for line in lines:
parts = re.split(r':', line)
if len(parts) >= 2:
label = parts[0]
value = parts[1].strip()
if label.startswith('name'):
print('got name: ' + value)
# start new entry with the name as first value
entry = [value]
entries.append(entry)
else:
print('got value: ' + value)
entry.append(value)
print('collected {} entries'.format(len(entries)))
with open('output.csv', 'w', newline='') as output:
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
wr.writerows(entries)

Python- replace last n chars of a specific section of a specific row found in a text file

I have 1000s of text files where I want to replace a very specific section of text with a predefined string. These files contain data like this:
Type Basemap 20221118202211
QSNGAGL1 20221120209912300111111 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1290BOB2044911451145B T1
QI1200BOB2014411451145B T1
QI1200BOB2014611451145B T1
QT1200DOY385621145 T1
QSNGAGL2 20221120209912300100110 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1200DOY2932411451145B T1
QI1200DOA2517511451145B T1
QT1200DOY385621145 T1
QSNFB 1 20221009209912300101100 1 Bus O
QO1290BOB203871115 T1
QI1290BOA0587611151115B T1
QI1290BOB2044911151115B T1
#(and so on... for ~60,000 rows per file...)
The first row is a header which only appears once per file. The spacing in the data is not consistent. The number of 'non-QS*' rows between each 'QS*' row varies.
I want to be able to:
iterate through each file
find each row starting with 'QS'
find the 2nd section of text in this row (the number usually starting 2022... This is a date range, with 7 numbers on the end representing each 7 days of the week with a 1 or a 0)
replace these last 7 characters of this section with specific text ('1111100')
save this as a new file with the prefix 'fixed_' on the file name (as to not overwrite original file)
I've thought about exploring pandas but I can't get it to read the data correctly. It doesn't help that on row 55,000 and on (in some files), there appears to be another column of data where a text string has spilled over to the right of its row. I also can't use a simple find and replace as these last 7 values could be any combination of 1s and 0s.
Using the second 'QS' row from the example above, I'd want '20221120209912300100110' changed to '20221120209912301111100'. Note how the last 7 characters are the '1111100' I desire.
UPDATE: I've changed the sample text above to include a differently laid out 'QS*' rows which can occur.

Try (regex demo):
import re
pat = re.compile(r"(^\s*QS\S+\s*)(\d+?)\d{7}\b")
with open("input.txt", "r") as f_in, open("fixed_output.txt", "w") as f_out:
for line in f_in:
line = pat.sub(r"\g<1>\g<2>1111100", line)
f_out.write(line)
If input.txt contains the text in the question then fixed_output.txt will contain:
Type Basemap 20221118202211
QSNGAGL1 20221120209912301111100 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1290BOB2044911451145B T1
QI1200BOB2014411451145B T1
QI1200BOB2014611451145B T1
QT1200DOY385621145 T1
QSNGAGL2 20221120209912301111100 1B Bus O
QO1290BOB203871145 T1
QI1290BOA0587611451145B T1
QI1200DOY2932411451145B T1
QI1200DOA2517511451145B T1
QT1200DOY385621145 T1
QSNGAGL3 20221120209912301111100 1B Bus O
QO1290BOB203871115 T1
QI1290BOA0587611151115B T1
QI1290BOB2044911151115B T1

calculate median of a list of values parallely using Hadoop map-reduce

I'm new to Hadoop mrjob. I have a text file which consists of data "id groupId value" in each line. I am trying to calculate a median of all values in the text file using Hadoop map-reduce. But i'm stuck when it comes to calculate only the median value. What I get is a median value for each id like:
"123213" 5.0
"123218" 2
"231532" 1
"234634" 7
"234654" 2
"345345" 9
"345445" 4.5
"345645" 2
"346324" 2
"436324" 6
"436456" 2
"674576" 10
"781623" 1.5
The output should be like "median value of all values is: ####". I got influnced by this article https://computehustle.com/2019/09/02/getting-started-with-mapreduce-in-python/
My python file median-mrjob.py :
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats, combiner=self.reducer_count_stats),
MRStep(reducer=self.reducer_sort_by_values),
MRStep(reducer=self.reducer_retrieve_median)
]
def mapper_get_stats(self, _, line):
line_arr = line.split(" ")
values = int(float(line_arr[-1]))
id = line_arr[0]
yield id, values
def reducer_count_stats(self, key, values):
yield str(sum(values)).zfill(2), key
def reducer_sort_by_values(self, values, ids):
for id in ids:
yield id, values
def reducer_retrieve_median(self, id, values):
valList=[]
median = 0
for val in values:
valList.append(int(val))
N = len(valList)
#find the median
if N % 2 == 0:
#if N is even
m1 = N / 2
m2 = (N / 2) + 1
#Convert to integer, match post
m1 = int(m1) - 1
m2 = int(m2) - 1
median = (valList[m1] + valList[m2]) / 2
else:
m = (N + 1) / 2
# Convert to integer, match position
m = int(m) - 1
median = valList[m]
yield (id, median)
if __name__ == '__main__':
MRMedian.run()
My original text files is about 1million and 1billion line of data, but I have created a test file which has arbitrary data. It has the name input.txt :
781623 2 2.3243
781623 1 1.1243
234654 1 2.122
123218 8 2.1245
436456 22 2.26346
436324 3 6.6667
346324 8 2.123
674576 1 10.1232
345345 1 9.56135
345645 7 2.1231
345445 10 6.1232
231532 1 1.1232
234634 6 7.124
345445 6 3.654376
123213 18 8.123
123213 2 2.1232
What I care about is the values. Considering that might be duplicates. I run the command line in the terminal to run the code python median-mrjob.py input.txt
Update: The point of the assignment is not to use any libraries, so I need to sort the list manually(or maybe some of it as I understood) and calculate the median manually(hardcoding). Otherwise the goal of using MapReduce disappears. Using PySpark is not allowed in this assignment. Check this link for more inspiration Computing median in map reduce

The output should be like "median value of all values is: ####"
Then you need to force all data to one reducer first (effectively defeating the purpose of using MapReduce).
You'd do that by not using the ID as the key and discarding it
def mapper_get_stats(self, _, line):
line_arr = line.split()
if line_arr: # prevent empty lines
value = float(line_arr[-1])
yield None, value
After that, sort and find the median (I fixed your parameter order)
def reducer_retrieve_median(self, key, values):
import statistics
yield None, f"median value of all values is: {statistics.median(values)}" # automatically sorts the data
So, only two steps
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats),
MRStep(reducer=self.reducer_retrieve_median)
]
For the given file, you should see
null "median value of all values is: 2.2938799999999997"
original text files is about 1million and 1billion line of data
Not that it matters, but which is it?
You should upload the file to HDFS first, then you can use better tools than MrJob for this like Hive or Pig.

Reading excel and storing data with xlrd

I have this data in excel sheet
FT_NAME FC_NAME C_NAME
FT_NAME1 FC1 C1
FT_NAME2 FC21 C21
FC22 C22
FT_NAME3 FC31 C31
FC32 C32
FT_NAME4 FC4 C4
where column names are
FT_NAME,FC_NAME,C_NAME
and I want to store this values in a data structure for further use, currently I am trying to store them in a list of list but could not do so with following code
i=4
oc=sheet.cell(i,8).value
fcl,ocl=[],[]
while oc:
ft=sheet.cell(i,6).value
fc=sheet.cell(i,7).value
oc=sheet.cell(i,8).value
if ft:
self.foreign_tables.append(ft)
fcl.append(fc)
ocl.append(oc)
self.foreign_col.append(fcl)
self.own_col.append(ocl)
fcl,ocl=[],[]
else:
fcl.append(fc)
ocl.append(oc)
i+=1
i expect output as
ft=[FT_NAME1,FT_NAME2,FT_NAME3,FT_NAME4]
fc=[FC1, [FC21,FC22],[FC31,FC32],FC4]
oc=[C1,[C21,C22],[C31,C32],C4]
could anyone please help for better pythonic solution ?

You can use pandas. It reads the data into a DataFrame which is essentially a big dictionary.
import pandas as pd
data =pd.read_excel('file.xlsx', 'Sheet1')
data = data.fillna(method='pad')
print(data)
it gives the following output:
FT_NAME FC_NAME C_NAME
0 FT_NAME1 FC1 C1
1 FT_NAME2 FC21 C21
2 FT_NAME2 FC22 C22
3 FT_NAME3 FC31 C31
4 FT_NAME3 FC32 C32
5 FT_NAME4 FC4 C4
To get the sublist structure try using this function:
def group(data):
output = []
names = list(set(data['FT_NAME'].values))
names.sort()
output.append(names)
headernames = list(data.columns)
headernames.pop(0)
for ci in list(headernames):
column_group = []
column_data = data[ci].values
for name in names:
column_group.append(list(column_data[data['FT_NAME'].values == name]))
output.append(column_group)
return output
If you call it like this:
ft, fc, oc = group(data)
print(ft)
print(fc)
print(oc)
you get the following output:
['FT_NAME1', 'FT_NAME2', 'FT_NAME3', 'FT_NAME4']
[['FC1'], ['FC21', 'FC22'], ['FC31', 'FC32'], ['FC4']]
[['C1'], ['C21', 'C22'], ['C31', 'C32'], ['C4']]
which is what you want except for the single element now also being in a list.
It is not the cleanest method but it gets the job done.
Hope it helps.

Improve python code in terms of speed

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?

You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2

I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.