operation inside map function in pyspark - python

I want to take the data from file name(as it contains some info.) and write those in csvfile_info file without using loop .
I am new in pyspark. Please some one help me in code and let me know how can i proceed.
This is what i tried...
Code:
c = os.path.join("-------")
input_file = sc.textFile(fileDir)
file1= input_file.split('_')
csvfile_info= open(c,'a')
details= file1.map(lambda p:
name=p[0],
id=p[1],
from_date=p[2],
to_date=p[3],
TimestampWithExtension=p[4]\
file_timestamp=TimestampWithExtension.split('.')[0]\
info = '{0},{1},{2},{3},{4},{5} \n'.\
format(name,id,from_date,to_date,file_timestamp,input_file)\
csvfile_info.write(info)
)

Don't try to write the data inside of the map() function. You should instead map each record to the appropriate string, and then dump the resultant rdd to a file. Try this:
input_file = sc.textFile(fileDir) # returns an RDD
def map_record_to_string(x):
p = x.split('_')
name=p[0]
id=p[1]
from_date=p[2]
to_date=p[3]
TimestampWithExtension=p[4]
file_timestamp=TimestampWithExtension.split('.')[0]
info = '{0},{1},{2},{3},{4},{5} \n'.format(
name,
id,
from_date,
to_date,
file_timestamp,
input_file
)
return info
details = input_file.map(map_record_to_string) # returns a different RDD
details.saveAsTextFile("path/to/output")
Note: I haven't tested this code, but this is one approach you could take.
Explanation
From the docs, input_file = sc.textFile(fileDir) will return an RDD of strings with the file contents.
All of the operations you want to do are on the contents of the RDD, the elements of the file. Calling split() on the RDD doesn't make sense, because split() is a string function. What you want to do instead is call split() and the other operations on each record (line in the file) of the RDD. This is exactly what map() does.
An RDD is like an iterable, but you don't operate on it with a traditional loop. It's an abstraction that allows for parallelization. From the user's perspective the map(f) function applies the function f to each element in the RDD, as it would be done in a loop. Functionally calling input_file.map(f) is equivalent to the following:
# let rdd_as_list be a list of strings containing the contents of the file
map_output = []
for record in rdd_as_list:
map_output.append(f(record))
Or equivalently:
# let rdd_as_list be a list of strings containing the contents of the file
map_output = map(f, rdd_as_list)
Calling map() on an RDD returns a new RDD, whose contents are the results of applying the function. In this case, details is a new RDD and it contains the rows of input_file after they have been processed by map_record_to_string.
You could have also written the map() step as details = input_file.map(lambda x: map_record_to_string(x)) if that makes it easier to understand.

Related

how do i convert ['121341']['132324'] (type string) into 2 separate lists python

i am trying to add some file operation capabilities to a example program, i am strugling with the reading from the file. here is the code that is modified.
def read(fn):
fileout=open(f"{fn}","a+")
fileout.seek(0,0)
s=fileout.readlines()
if s==[]:
print("the file specified does not appear to exists or is empty. if the file does not exist, it will be created")
else:
last=s[-1]
print(last)
print(type(last))
convert(last)
def find(last):
tup=last.partition(".")
fi=tup[0:1]
return fi[0]
def convert(last):
tup=last.partition(".")
part=tup[2:]
print(part)
part=part[0]
print(part)
part=part.split("\n")
print(part)
part=part[0]
print(part)
print(type(part))
#__main__
file(fn)
the write functionality writes in the form of
(fileindex number).[(planned campaign)][(conducted campaign)]
example:- some random data writen to the file by the program(first two number are dates)
0.['12hell']['12hh']
1.['12hell']['12hh']
2.['121341']['132324']
but i am strugling to write the read function, i don't understand how i could convert the data back.
with the current read function i get back
['121341']['132324']
as a string type, i have brainstormed many ideas but could not figureout how to convert string to list(they need to be 2 separate lists)
edit: the flaw as actually in the format that i was writing in, i added a , between the two lists and used eval as suggested in an answer, thanks
Insert a ',' inbetween the brackets, then use eval. This will return a tuple of lists.
strLists = "['121341']['132324']['abcdf']"
strLists = strLists.replace('][', '],[')
evalLists = eval(strLists)
for each in evalLists:
print(each)
Output:
['121341']
['132324']
['abcdf']

FIltering rows of an rdd in map phase using pyspark

i am filtering a dataset using this code in pyspark :
rdd = sc.textFile("location...").map(lambda line: line.split(",")). \
filter(lambda line :condition...)
My problem is this: In my pseudo-code for the solution the filtering of the lines that don't meet my condition can be done in map phase an thus parse the whole dataset once.However in this case the dataset is parsed 2 times which is more expensive.
Is there a way to do this with one parse?
As from your code map part is done before filtering, if you want to provide more optimization and your mapping function output is not required for filtering, In such case, it is advised to do filtering before mapping, so this way it reduces the number of the input element to map function
filtering before mapping
rdd = sc.textFile("location...").filter(lambda line: line.split(",")). \
map(lambda line :condition...)
Also, if you want to provide some filtering logic in the mapping function this can be done, However in this case you need to filter the NONE type element at the end.
words = sc.parallelize(
["my code",
"java is a required",
"hadoop is a framework",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
def mapCondition(line):
if(line.startswith("p")):
return line
tokenized = words.map(lambda line: mapCondition(line))
print tokenized.collect()
You can operate directly on line.split(",") in the filter lambda function. For example, you can compare the string before the first comma as below:
rdd = sc.textFile("location...").filter(lambda line: line.split(",")[0] = "string")

map() with partial arguments: save up space

I have a very large list of dictionaries, which keys are a triple of (string, float, string) and whose values are again lists.
cols_to_aggr is basically a list(defaultdict(list))
I wish I could pass to my function _compute_aggregation not only the list index i but also exclusively the data contained by that index, namely cols_to_aggr[i] instead of the whole data structure cols_to_aggr and having to get the smaller chunk inside my parallelized functions.
This because the problem is that this passing of the whole data structures cause my Pool to eat up all my memory with no efficiency at all.
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation, cols_to_aggr=cols_to_aggr,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr)
def _compute_aggregation(index, cols_to_aggr, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = cols_to_aggr[index]
To give a patch to my memory issue I tried to set a maxtasksperchild but without success, I have no clue how to optimally set it.
Using dict.values(), you can iterate over the values of a dictionary.
So you could change your code to:
with multiprocessing.Pool(processes=n_workers, maxtasksperchild=500) as pool:
results = pool.map(
partial(_compute_aggregation,
aggregations=aggregations, pivot_ladetag_pos=pivot_ladetag_pos,
to_ix=to_ix), cols_to_aggr.values())
def _compute_aggregation(value, aggregations, pivot_ladetag_pos, to_ix):
data_to_process = value
If you still need the keys in your _compute_aggregation function, use dict.items() instead.

Python 3 - cumulative functions alternatives

I was wondering if there was a more pythonic, or alternative, way to do this. I want to compare results out of cumulative functions. Each functions modifies the output of the previous and I would like to see, after each of the functions, what the effect is. Beware that in order to get the actual results after running the main functions, one last function is needed to calculate something. In code, the thing looks like this (just kind of pseudocode):
for textfile in path:
data = doStuff1(textfile)
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
data = doStuff3(data )
calculateandPrint()
As you can see, for n functions I would need 1/2(n(n+1)) manually made loops. Is there, like I said, something more pythonic (for example a list with functions?) that would clean up the code and make it much shorter and manageable when added more and more functions?
The actual code, where documents is a custom object:
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
doc.list_strippedtext = abbreviations(doc.list_strippedtext)
bow = createBOW(documents)
while this is only a small part, more functions need to be added.
You could define a set of chains, applied with functools.reduce()
from functools import reduce
chains = (
(doStuff1,),
(doStuff1, doStuff2),
(doStuff1, doStuff2, doStuff3),
)
for textfile in path:
for chain in chains:
data = reduce(lambda data, func: func(data), chain, textfile)
calculateandPrint(data)
The reduce() call effectively does func3(func2(func1(textfile)) if chain contained 3 functions.
I assumed here that you wanted to apply calculateandPrint() per textfile in path after the chain of functions has been applied.
Each iteration of the for chain in chains loop represents one of your doStuffx loop bodies in your original example, but we only loop through for textfile in path once.
You can also swap the loops; adjusting to your example:
for chain in chains:
for doc in documents:
doc.list_strippedtext = reduce(lambda data, func: func(data), chain, doc.text)
bow = createBOW(documents)

How can I pass the line I'm iterating over into a sort/sorted function in a for loop in Python?

I'm trying to loop over a number of log-files and need to sort file entries (lines) across all files being looped
This is what I'm doing:
import glob
f = glob.glob('logs/')
for line in sorted(fileinput.input(f), key=stringsplit(line)):
print line
So, I'm opening all files and then want to use the stringsplit function (which extracts a date from the file entry) as sorting criteria.
Problem is, doing this gives me an error saying:
name 'line' is not defined
Question:
Is it not possible to pass the line being loop-ed as parameter into a sorting function? How can this be done?
Thanks!
try key=lambda line: stringsplit(line).
The sorting is done before you start iterating in the for-loop.
The key keyword must be a callable. It is called for every entry in the input sequence.
A lambda is an easy way to create such a callable:
sorted(..., key=lambda line: stringsplit(line))
I would be extremely wary of sorting the output of fileinput with many, large files though. sorted() must read all lines into memory to be able to sort them. If your files are many and / or large, you'll use up all memory, eventually leading to a MemoryError exception.
Use a different method to pre-sort your logs first. You can use a the UNIX tool sort, or use a external sorting technique instead.
If your input files are already sorted, you can merge them using the same key:
import operator
def mergeiter(*iterables, **kwargs):
"""Given a set of sorted iterables, yield the next value in merged order"""
iterables = [iter(it) for it in iterables]
iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
if 'key' not in kwargs:
key = operator.itemgetter(0)
else:
key = lambda item, key=kwargs['key']: key(item[0])
while True:
value, i, it = min(iterables.values(), key=key)
yield value
try:
iterables[i][0] = next(it)
except StopIteration:
del iterables[i]
if not iterables:
raise
then pass in your open file objects:
files = [open(f) for f in glob.glob('logs/*')]
for line in mergeiter(*files, key=lambda line: stringsplit(line)):
# lines are looped over in merged order.
but you need to make certain that the stringsplit() function returns values as they are ordered in the input log files.

Categories

Resources