I was wondering if there was a more pythonic, or alternative, way to do this. I want to compare results out of cumulative functions. Each functions modifies the output of the previous and I would like to see, after each of the functions, what the effect is. Beware that in order to get the actual results after running the main functions, one last function is needed to calculate something. In code, the thing looks like this (just kind of pseudocode):
for textfile in path:
data = doStuff1(textfile)
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
data = doStuff3(data )
calculateandPrint()
As you can see, for n functions I would need 1/2(n(n+1)) manually made loops. Is there, like I said, something more pythonic (for example a list with functions?) that would clean up the code and make it much shorter and manageable when added more and more functions?
The actual code, where documents is a custom object:
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
doc.list_strippedtext = abbreviations(doc.list_strippedtext)
bow = createBOW(documents)
while this is only a small part, more functions need to be added.
You could define a set of chains, applied with functools.reduce()
from functools import reduce
chains = (
(doStuff1,),
(doStuff1, doStuff2),
(doStuff1, doStuff2, doStuff3),
)
for textfile in path:
for chain in chains:
data = reduce(lambda data, func: func(data), chain, textfile)
calculateandPrint(data)
The reduce() call effectively does func3(func2(func1(textfile)) if chain contained 3 functions.
I assumed here that you wanted to apply calculateandPrint() per textfile in path after the chain of functions has been applied.
Each iteration of the for chain in chains loop represents one of your doStuffx loop bodies in your original example, but we only loop through for textfile in path once.
You can also swap the loops; adjusting to your example:
for chain in chains:
for doc in documents:
doc.list_strippedtext = reduce(lambda data, func: func(data), chain, doc.text)
bow = createBOW(documents)
Related
I have a function called plot_ih_il that receives two data frames in order to generate a plot. I also have a set of folders that each contain a .h5 file with the data I need to give to the function plot_ih_il... I'm trying to feed the function two datasets at a time but unsuccessfully.
I've been using pathlib to do so
path = Path("files")
for log in path.glob("log*"):
for file in log.glob("log*.h5"):
df = pd.DataFrame(file, key = "log")
but using this loop, I can only feed one data frame at a time, I need two of them.
The structure of the folders is something like,
files->log1-> log1.h5
log2-> log2.h5
log3-> log3.h5
log4-> log4.h5
I would like to feed the function plot_il_ih the following sequence,
plot_il_ih(dataframeof_log1.h5, dataframeof_log2.h5) then
plot_il_ih(dataframeof_log2.h5, dataframeof_log3.h5) and so on.
I have tried to use zip
def pairwise(iterable):
a = iter(iterable)
return zip(a, a)
for l1, l2 in pairwise(list(path.glob('log*'))):
plot_il_ih(l1, l2)
but it doesn't move forward, just opens the 2 firsts.
What is wrong with my logic?
consider something like this. You might have to play around with the indexing
filelist = list(path.glob('log*'))
for i in range(1, len(filelist)):
print(filelist[i-1])
print(filelist[i])
print('\n')
An example is artificial, but I had similar problems many times.
db_file_names = ['f1', 'f2'] # list of database files
def make_report(filename):
# read the database and prepare some report object
return report_object
Now I want to construct a dictionary: db_version -> number_of_tables. The report object contains all the information I need.
The dictionary comprehension could look like:
d = {
make_report(filename).db_version: make_report(filename).num_tables
for filename in db_file_names
}
This approach sometimes works, but is very inefficient: the report is prepared twice for each database.
To avoid this inefficiency I usually use one of the following approaches:
Use temporary storage:
reports = [make_report(filename) for filename in db_file_names]
d = {r.db_version: r.num_tables for r in reports}
Or use some adaptor-generator:
def gen_data():
for filename in db_file_names:
report = make_report(filename)
yield report.db_version, report.num_tables
d = {dat[0]: dat[1] for dat in gen_data()}
But it's usually only after I write some wrong comprehension, think over and realize, that clean and simple comprehension isn't possible in this case.
The question is, is there a better way to create required dictionary in such situations?
Since yesterday (when I decided to post this question) I invented one more approach, which I like more then all others:
d = {
report.db_version: report.num_tables
for filename in db_file_names
for report in [make_report(filename), ]
}
but even this one looks not very good.
You can use:
d = {
r.db_version: r.num_tables
for r in map(make_report, db_file_names)
}
Note that in Python 3, map gives an iterator, thus there is no unnecessary storage cost.
Here's a functional way:
from operator import attrgetter
res = dict(map(attrgetter('db_version', 'num_tables'),
map(make_report, db_file_names)))
Unfortunately, functional composition is not part of the standard library, but the 3rd party toolz does offer this feature:
from toolz import compose
foo = compose(attrgetter('db_version', 'num_tables'), make_report)
res = dict(map(foo, db_file_names))
Conceptually, you can think of these functional solutions outputting an iterable of tuples, which can then be fed directly to dict.
I have a function which accepts a two inputs provided by itertools combinations, and outputs a solution. The two inputs should be stored as a tuple forming the key in the dict, while the result is the value.
I can pool this and get all of the results as a list, which I can then insert into a dictionary one-by-one, but this seems inefficient. Is there a way to get the results as each job finishes, and directly add it to the dict?
Essentially, I have the code below:
all_solutions = {}
for start, goal in itertools.combinations(graph, 2):
all_solutions[(start, goal)] = search(graph, start, goal)
I am trying to parallelize it as follows:
all_solutions = {}
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
results = pool.starmap(search, zip(itertools.repeat(graph_pool),
itertools.combinations(graph, 2)))
for i, start_goal in enumerate(itertools.combinations(graph, 2)):
start, goal = start_goal[0], start_goal[1]
all_solutions[(start, goal)] = results[i]
Which actually works, but iterates twice, once in the pool, and once to write to a dict (not to mention the clunky tuple unpacking).
This is possible, you just need to switch to using a lazy mapping function (not map or starmap, which have to finish computing all the results before you can begin using any of them):
from functools import partial
from itertools import tee
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
# Since you're processing in order and in parallel, tee might help a little
# by only generating the dict keys/search arguments once. That said,
# combinations of n choose 2 are fairly cheap; the overhead of tee's caching
# might overwhelm the cost of just generating the combinations twice
startgoals1, startgoals2 = tee(itertools.combinations(graph, 2))
# Use partial binding of search with graph_pool to be able to use imap
# without a wrapper function; using imap lets us consume results as they become
# available, so the tee-ed generators don't store too many temporaries
results = pool.imap(partial(search, graph_pool), startgoals2))
# Efficiently create the dict from the start/goal pairs and the results of the search
# This line is eager, so it won't complete until all the results are generated, but
# it will be consuming the results as they become available in parallel with
# calculating the results
all_solutions = dict(zip(startgoals1, results))
I want to take the data from file name(as it contains some info.) and write those in csvfile_info file without using loop .
I am new in pyspark. Please some one help me in code and let me know how can i proceed.
This is what i tried...
Code:
c = os.path.join("-------")
input_file = sc.textFile(fileDir)
file1= input_file.split('_')
csvfile_info= open(c,'a')
details= file1.map(lambda p:
name=p[0],
id=p[1],
from_date=p[2],
to_date=p[3],
TimestampWithExtension=p[4]\
file_timestamp=TimestampWithExtension.split('.')[0]\
info = '{0},{1},{2},{3},{4},{5} \n'.\
format(name,id,from_date,to_date,file_timestamp,input_file)\
csvfile_info.write(info)
)
Don't try to write the data inside of the map() function. You should instead map each record to the appropriate string, and then dump the resultant rdd to a file. Try this:
input_file = sc.textFile(fileDir) # returns an RDD
def map_record_to_string(x):
p = x.split('_')
name=p[0]
id=p[1]
from_date=p[2]
to_date=p[3]
TimestampWithExtension=p[4]
file_timestamp=TimestampWithExtension.split('.')[0]
info = '{0},{1},{2},{3},{4},{5} \n'.format(
name,
id,
from_date,
to_date,
file_timestamp,
input_file
)
return info
details = input_file.map(map_record_to_string) # returns a different RDD
details.saveAsTextFile("path/to/output")
Note: I haven't tested this code, but this is one approach you could take.
Explanation
From the docs, input_file = sc.textFile(fileDir) will return an RDD of strings with the file contents.
All of the operations you want to do are on the contents of the RDD, the elements of the file. Calling split() on the RDD doesn't make sense, because split() is a string function. What you want to do instead is call split() and the other operations on each record (line in the file) of the RDD. This is exactly what map() does.
An RDD is like an iterable, but you don't operate on it with a traditional loop. It's an abstraction that allows for parallelization. From the user's perspective the map(f) function applies the function f to each element in the RDD, as it would be done in a loop. Functionally calling input_file.map(f) is equivalent to the following:
# let rdd_as_list be a list of strings containing the contents of the file
map_output = []
for record in rdd_as_list:
map_output.append(f(record))
Or equivalently:
# let rdd_as_list be a list of strings containing the contents of the file
map_output = map(f, rdd_as_list)
Calling map() on an RDD returns a new RDD, whose contents are the results of applying the function. In this case, details is a new RDD and it contains the rows of input_file after they have been processed by map_record_to_string.
You could have also written the map() step as details = input_file.map(lambda x: map_record_to_string(x)) if that makes it easier to understand.
I have a set of filenames coming from two different directories.
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3', etc.])
My code is processing the files, and need to change currList
by comparing it to its content at the former iteration, say processLst.
For that, I compute a symmetric difference:
toProcess=set(currList).symmetric_difference(set(processList))
Actually, I need the symmetric_difference to operate on the basename (file1...) not
on the complete filename (pathA/file1).
I guess I need to reimplement the __eq__ operator, but I have no clue how to do that in python.
is reimplementing __eq__ the right approach?
or
is there another better/equivalent approach?
Here is a token (and likely poorly constructed) itertools version that should run a little bit faster if speed ever becomes a concern (although agree that #Zarkonnen's one-liner is pretty sweet, so +1 there :) ).
from itertools import ifilter
currList = set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
# This can also be a lambda inside the map functions - the speed stays the same
def FileName(f):
return f.split('/')[-1]
# diff will be a set of filenames with no path that will be checked during
# the ifilter process
curr = map(FileName, list(currList))
process = map(FileName, list(processList))
diff = set(curr).symmetric_difference(set(process))
# This filters out any elements from the symmetric difference of the two sets
# where the filename is not in the diff set
results = set(ifilter(lambda x: x.split('/')[-1] in diff,
currList.symmetric_difference(processList)))
You can do this with the magic of generator expressions.
def basename(x):
return x.split("/")[-1]
result = set(x for x in set(currList).union(set(processList)) if (basename(x) in [basename(y) for y in currList]) != (basename(x) in [basename(y) for y in processList]))
should do the trick. It gives you all the elements X that appear in one list or the other, and whose basename-presence in the two lists is not the same.
Edit:
Running this with:
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
returns:
set(['pathA/file2', 'pathA/file9'])
which would appear to be correct.