re-implement __eq__ to compare sets with symmetric_difference in python - python

I have a set of filenames coming from two different directories.
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3', etc.])
My code is processing the files, and need to change currList
by comparing it to its content at the former iteration, say processLst.
For that, I compute a symmetric difference:
toProcess=set(currList).symmetric_difference(set(processList))
Actually, I need the symmetric_difference to operate on the basename (file1...) not
on the complete filename (pathA/file1).
I guess I need to reimplement the __eq__ operator, but I have no clue how to do that in python.
is reimplementing __eq__ the right approach?
or
is there another better/equivalent approach?

Here is a token (and likely poorly constructed) itertools version that should run a little bit faster if speed ever becomes a concern (although agree that #Zarkonnen's one-liner is pretty sweet, so +1 there :) ).
from itertools import ifilter
currList = set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
# This can also be a lambda inside the map functions - the speed stays the same
def FileName(f):
return f.split('/')[-1]
# diff will be a set of filenames with no path that will be checked during
# the ifilter process
curr = map(FileName, list(currList))
process = map(FileName, list(processList))
diff = set(curr).symmetric_difference(set(process))
# This filters out any elements from the symmetric difference of the two sets
# where the filename is not in the diff set
results = set(ifilter(lambda x: x.split('/')[-1] in diff,
currList.symmetric_difference(processList)))

You can do this with the magic of generator expressions.
def basename(x):
return x.split("/")[-1]
result = set(x for x in set(currList).union(set(processList)) if (basename(x) in [basename(y) for y in currList]) != (basename(x) in [basename(y) for y in processList]))
should do the trick. It gives you all the elements X that appear in one list or the other, and whose basename-presence in the two lists is not the same.
Edit:
Running this with:
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
returns:
set(['pathA/file2', 'pathA/file9'])
which would appear to be correct.

Related

For loops and conditionals in Python

I am new to Python and I was wondering if there was a way I could shorten/optimise the below loops:
for breakdown in data_breakdown:
for data_source in data_source_ids:
for camera in camera_ids:
if (camera.get("id") == data_source.get("parent_id")) and (data_source.get("id") == breakdown.get('parent_id')):
for res in result:
if res.get("camera_id") == camera.get("id"):
res.get('data').update({breakdown.get('name'): breakdown.get('total')})
I tried this oneliner, but it doesn't seem to work:
res.get('data').update({breakdown.get('name'): breakdown.get('total')}) for camera in camera_ids if (camera.get("id") == data_source.get("parent_id")) and (data_source.get("id") == breakdown.get('parent_id'))
You can use itertools.product to handle the nested loops for you, and I think (although I'm not sure because I can't see your data) you can skip all the .get and .update and just use the [] operator:
from itertools import product
for b, d, c in product(data_breakdown, data_source_ids, camera_ids):
if c["id"] != d["parent_id"] or d["id"] != b["parent_id"]:
continue
for res in result:
if res["camera_id"] == c["id"]:
res['data'][b['name']] = b['total']
If anything, to optimize the performance of those loops, you should make them longer and more nested, with the data_source.get("id") == breakdown.get('parent_id') happening outside of the camera loop.
But there is perhaps an alternative, where you could change the structure of your data so that you don't need to loop nearly as much to find matching ID values. Convert each of your current lists (of dicts) into a single dict with its keys equal to the 'id' value you'll be trying to match in that loop, and the value being whole dict.
sources_dict = {source.get("id"): source for source in data_source_ids}
cameras_dict = {camera.get("id"): camera for camera in camera_ids}
results_dict = {res.get("camera_id"): res for res in result}
Now the whole loop only needs one level:
for breakdown in data_breakdown:
source = sources_dict[breakdown["parent_id"]]
camera = cameras_dict[source["parent_id"]]
res = results_dict[camera["id"]]
res.data[breakdown["name"]] = breakdown["total"]
This code assumes that all the lookups with get in your current code were going to succeed in getting a value. You weren't actually checking if any of the values you were getting from a get call was None, so there probably wasn't much benefit to it.
I'd further note that it's not clear if the camera loop in your original code was at all necessary. You might have been able to skip it and just directly compare data_source['parent_id'] against res['camera_id'] without comparing them both to a camera['id'] in between. In my updated version, that would translate to leaving out the creation of the cameras_dict and just directly indexing results_dict with source["parent_id"] rather than indexing to find camera first.

Lambda that searches list and increments

Using a Python lambda can you check whether an element exists in another list (of maps) and also increment a variable? I'm attempting to optimise/refactor my code using a lambda but I've gone and confused myself.
Below is my existing code that I want to convert to a lambda. Is it possible to do this using one lambda or will I need to use 2 lambdas? Any advice how can I convert it to a lambda/s?
current_orders = auth.get_orders()
# returns [{'id': 'foo', 'price': 1.99, ...}, ...]
deleted_orders = auth.cancel_orders()
# returns id's of all cancelled orders [{'id': 'foo'}, {'id': 'bar'}, ...]
# Attempting to convert to lambda
n_deleted = 0
for del_order in deleted_orders:
for order in current_orders:
if del_order['id'] == order['id']:
n_deleted += 1
# lambda
n_deleted = filter(lambda order, n: n += order['id'] in current_orders, deleted_orders)
# end
if n_deleted != len(orders):
logger.error("Failed to cancel all limit orders")
Note: I know I can say if len(deleted_orders) < len(current_orders): logger.error("Failed to delete ALL orders") but I want to expand my lambda eventually to say ...: logger.error("Failed to delete ORDER with ID: %s")
You can't use += (or assignment of any kind) in a lambda at all, and using filter for side-effects is a terrible idea (this pattern looks kind of like how reduce is used, but it's hard to tell what you're trying to do).
It looks like you're trying to count how many order['id'] values appear in current_orders. You shouldn't use a lambda for this at all. To improve efficiency, get the ids from out as a set and use set operations to check if all the ids were found in both list:
from future_builtins import map # Only on Py2, to get generator based map
from operator import itemgetter
... rest of your code ...
getid = itemgetter('id')
# Creating the `set`s requires a single linear pass, and comparison is
# roughly linear as well; your original code had quadratic performance.
if set(map(getid, current_orders)) != set(map(getid, deleted_orders)):
logger.error("Failed to cancel all limit orders")
If you want to know which orders weren't canceled, a slight tweak, replacing the if check and logger output with:
for oid in set(map(getid, current_orders)).difference(map(getid, deleted_orders)):
logger.error("Failed to cancel order ID %s", oid)
If you want the error logs ordered by oid, wrap the set.difference call in sorted, if you want it in the same order returned in current_orders, change to:
from itertools import filterfalse # On Py2, it's ifilterfalse
# Could inline deletedids creation in filterfalse if you prefer; frozenset optional
deletedids = frozenset(map(getid, deleted_orders))
for oid in filterfalse(deletedids.__contains__, map(getid, current_orders)):
logger.error("Failed cancel order ID %s", oid)
It is possible to hack around it but lambdas should not mutate, they should return a new result. Also you should not overcomplicate lambdas, they are meant for short quick functions such a key for a sort method
Probably you should be using a list comprehension. eg
current_order_ids = {order['id'] for order in current_orders}
not_del = [order for order in deleted_orders if order['id'] not in current_order_ids]
for order in not_del:
logger.error("Failed to delete ORDER with ID: %s", order['id'])

Creating a list of dictionaries

I have code that generates a list of 28 dictionaries. It cycles thru 28 files and links data points from each file in the appropriate dictionary. In order to make my code more flexible I wanted to use:
tegDics = [dict() for x in range(len(files))]
But when I run the code the first 27 dictionaries are blank and only the last, tegDics[27], has data. Below is the code including the clumsy, yet functional, code I'm having to use that generates the dictionaries:
x=0
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}] # THIS WORKS!!!
#tegDics = [dict() for x in range(len(files))] - THIS WON'T WORK!!!
allRads=[]
while x<len(tegDics): # now builds dictionaries
for line in open(files[x]):
z=line.split('\t')
allRads.append(z[2])
tegDics[x][z[2]]=z[4] # pairs catNo with locNo
x+=1
Does anybody know why the more elegant code doesn't work.
Since you're using x within the list comprehension, it will no longer be zero by the time you reach the while loop - it will be len(files)-1 instead. I suggest changing the variable you use to something else. It's traditional to use a single underscore for a value you don't care about.
tegDics = [dict() for _ in range(len(files))]
It could be useful to eliminate your use of x entirely. It's customary in python to iterate directly over the objects in a sequence, rather than using a counter variable. You might do something like:
for tegDic in tegDics:
#do stuff with tegDic here
Although it's slightly trickier in your case, since you want to simultaneously iterate through tegDics and files at the same time. You can use zip to do that.
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [dict() for _ in range(len(files))]
allRads=[]
for file, tegDic in zip(files,tegDics):
for line in open(file):
z=line.split('\t')
allRads.append(z[2])
tegDic[z[2]]=z[4] # pairs catNo with locNo
Anyway there is a simplest way imho:
taegDics = [{}]*len(files)

Python 3 - cumulative functions alternatives

I was wondering if there was a more pythonic, or alternative, way to do this. I want to compare results out of cumulative functions. Each functions modifies the output of the previous and I would like to see, after each of the functions, what the effect is. Beware that in order to get the actual results after running the main functions, one last function is needed to calculate something. In code, the thing looks like this (just kind of pseudocode):
for textfile in path:
data = doStuff1(textfile)
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
calculateandPrint()
for textfile in path:
data = doStuff1(textfile)
data = doStuff2(data )
data = doStuff3(data )
calculateandPrint()
As you can see, for n functions I would need 1/2(n(n+1)) manually made loops. Is there, like I said, something more pythonic (for example a list with functions?) that would clean up the code and make it much shorter and manageable when added more and more functions?
The actual code, where documents is a custom object:
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
bow = createBOW(documents)
for doc in documents:
doc.list_strippedtext = prepareData(doc.text)
doc.list_strippedtext = preprocess(doc.list_strippedtext)
doc.list_strippedtext = abbreviations(doc.list_strippedtext)
bow = createBOW(documents)
while this is only a small part, more functions need to be added.
You could define a set of chains, applied with functools.reduce()
from functools import reduce
chains = (
(doStuff1,),
(doStuff1, doStuff2),
(doStuff1, doStuff2, doStuff3),
)
for textfile in path:
for chain in chains:
data = reduce(lambda data, func: func(data), chain, textfile)
calculateandPrint(data)
The reduce() call effectively does func3(func2(func1(textfile)) if chain contained 3 functions.
I assumed here that you wanted to apply calculateandPrint() per textfile in path after the chain of functions has been applied.
Each iteration of the for chain in chains loop represents one of your doStuffx loop bodies in your original example, but we only loop through for textfile in path once.
You can also swap the loops; adjusting to your example:
for chain in chains:
for doc in documents:
doc.list_strippedtext = reduce(lambda data, func: func(data), chain, doc.text)
bow = createBOW(documents)

Python generator expression if-else

I am using Python to parse a large file. What I want to do is
If condition =True
append to list A
else
append to list B
I want to use generator expressions for this - to save memory. I am putting in the actual code.
def is_low_qual(read):
lowqual_bp=(bq for bq in phred_quals(read) if bq < qual_threshold)
if iter_length(lowqual_bp) > num_allowed:
return True
else:
return False
lowqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==True)
highqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==False)
SeqIO.write(highqual,flt_out_handle,"fastq")
SeqIO.write(lowqual,junk_out_handle,"fastq")
def iter_length(the_gen):
return sum(1 for i in the_gen)
You can use itertools.tee in conjunction with itertools.ifilter and itertools.ifilterfalse:
import itertools
def is_condition_true(x):
...
gen1, gen2 = itertools.tee(sequences)
low = itertools.ifilter(is_condition_true, gen1)
high = itertools.ifilterfalse(is_condition_true, gen2)
Using tee ensures that the function works correctly even if sequences is itself a generator.
Note, though, that tee could itself use a fair bit of memory (up to a list of size len(sequences)) if low and high are consumed at different rates (e.g. if low is exhausted before high is used).
I think you're striving to avoid iterating over your collection twice. If so, this type of approach works:
high, low = [], []
_Nones = [high.append(x) if is_condition_true() else low.append(x) for x in sequences]
This is probably less than advised because it's using a list comprehension for a side-effect. That's generally anti-pythonic.
Just to add a more general answer: If your main concern is memory, you should use one generator that loops over the whole file, and handle each item as low or high as it comes. Something like:
for r in sequences:
if condition_true(r):
handle_low(r)
else:
handle_high(r)
If you need to collect all high/low elements before using either, then you can't guard against a potential memory hit. The reason is that you can't know which elements are high/low until you read them. If you have to process low first, and it turns out all the elements are actually high, you have no choice but to store them in a list as you go, which will use memory. Doing it with one loop allows you to handle each element one at a time, but you have to balance this against other concerns (i.e., how cumbersome it is to do it this way, which will depend on exactly what you're trying to do with the data).
This is surprisingly difficult to do elegantly. Here's something that works:
from itertools import tee, ifilter, ifilterfalse
low, high = [f(condition, g) for f, g in zip((ifilter, ifilterfalse), tee(seq))]
Note that as you consume items from one resulting iterator (say low), the internal deque in tee will have to expand to contain any items that you have not yet consumed from high (including, unfortunately, those which ifilterfalse will reject). As such this might not save as much memory as you're hoping.
Here's an implementation that uses as little additional memory as possible:
def filtertee(func, iterable, codomain=(False, True)):
it = iter(iterable)
deques = dict((r, deque()) for r in codomain)
def gen(mydeque):
while True:
while not mydeque: # as long as the local deque is empty
newval = next(it) # fetch a new value,
result = func(newval) # find its image under `func`,
try:
d = deques[result] # find the appropriate deque, and
except KeyError:
raise ValueError("func returned value outside codomain")
d.append(newval) # add it.
yield mydeque.popleft()
return dict((r, gen(d)) for r, d in deques.items())
This returns a dict from the codomain of the function to a generator providing the items that take that value under func:
gen = filtertee(condition, seq)
low, high = gen[True], gen[False]
Note that it's your responsibility to ensure that condition only returns values in codomain.

Categories

Resources