I have generator function that creates a Cartesian product of lists. The real application uses more complex objects, but they can be represented by strings:
import itertools
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
names = [''.join(s) for s in p]
In this example, the result is 32 combinations of characters:
names
['accg', 'acdg', 'aceg', 'acfg', 'adcg', 'addg', 'adeg', 'adfg', 'aecg',
'aedg', 'aeeg', 'aefg', 'afcg', 'afdg', 'afeg', 'affg', 'bccg', 'bcdg',
'bceg', 'bcfg', 'bdcg', 'bddg', 'bdeg', 'bdfg', 'becg', 'bedg', 'beeg',
'befg', 'bfcg', 'bfdg', 'bfeg', 'bffg']
Now, let's say I have some constraints such that certain character combinations are illegal. For example, let's say that only strings that contain the regex '[ab].c' are allowed. ('a' or 'b' followed by any letter followed by 'c')
After applying these constraints, we are left with a reduced set of just 8 strings:
import re
r = re.compile('[ab].c')
filter(r.match, names)
['accg', 'adcg', 'aecg', 'afcg', 'bccg', 'bdcg', 'becg', 'bfcg']
In the real application the chains are longer, there could be thousands of combinations and applying the hundreds of constraints is fairly computationally intensive so I'm concerned about scalability.
Right now I'm going through every single combination and checking its validity. Does an algorithm/data structure exist that could speed up this process?
EDIT:
Maybe this will clarify a little: In the real application I am assembling random 2D pictures of buildings from simple basic blocks (like pillars, roof segments, windows, etc.). The constraints limit what kind of blocks (and their rotations) can be grouped together, so the resulting random image looks realistic, and not a random jumble.
A given constraint can contain many combinations of patterns. But of all those combinations, many are not valid because a different constraint would prohibit some portion of it. So in my example, one constraint would contain the full Cartesian product of characters above. And a second constraint is the '[ab].c'; this second constraint reduces the number of valid combinations of the first constraint that I need to consider.
Because these constraints are difficult to create; I looking to visualize what all the combinations of blocks in each constraint look like, but only the valid combinations that pass all constraints. Hence my question. Thanks!
Try just providing the iterator that generates the names directly to filter, like so:
import itertools
import re
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
r = re.compile('[ab].c')
l = filter(r.search, (''.join(s) for s in p))
print(list(l))
That way, it shouldn't assemble the full set of combinations in memory, it will only keep the ones that fit the criteria. There is probably another way as well.
EDIT:
One of the primary differences from the original, is that instead of:
[''.join(s) for s in p]
Which is a list comprehension, we use:
(''.join(s) for s in p)
Which is a generator.
The important difference here is that a list comprehension creates a list using the designated criteria and generator, while only providing the generator allows the filter to generate values as needed. The important mechanism is lazy evaluation, which really just boils down to only evaluating expressions as their values become necessary.
By switching from a list to a generator, Rob's answer saves space but not time (at least, not asymptotically). You've asked a very broad question about how to enumerate all solutions to what is essentially a constraint satisfaction problem. The big wins are going to come from enforcing local consistency, but it's difficult to give you advice without specific knowledge of your constraints.
Related
I am new to data wrangling in python.
I have a column in a dataframe that has text like:
I really like Product A!
I think Product B is for me!
I will go with Product C.
My objective is to create a new column with Product Name (Including the word 'Product'). I do not want to use Regex. Product name is unique in a row. So there will be no row with string such as
I really like Product A and Product B
Problem in generic form: I have a list of unique items. lets call it list A. I have another list of strings where each string includes atmost one of the items from list A. How do I create a new list with matched item.
I have written the following code. It works fine. But even I (new to progamming) can tell this is highly inefficient.
Any better and elegant solution?
product_type = ['Product A', 'Product B', 'Product C', 'Product D']
product_list = [None] * len(fed_df['product_line'])
for i in range(len(product_list)):
for product in product_type:
if product in fed_df['product_line'][i]:
product_list[i] = product
fed_df['product_line'] = product_list
Short Background
Fundamentally, at some point, each element of each list will need to be compared similarly to how you've written it (although you can skip to the next loop once a match has been found). But the trick with writing good python code, is to utilise functionality written on a lower level for efficiency, rather than trying to write it yourself. For example: You should try to avoid using
for i in range(len(myList)): #code which accesses myList[i]
when you can use
for myListElement in myList: #code which uses myListElement
since in the latter, the accessing of myList is handled internally, and more efficiently than python calculating i manually, then accessing the ith element of myList. This fact is true of some other high-level programming languages too.
Actual Answer
Anyway, to answer your question, I came up with the following and I believe it would be more efficient:
answer = map(lambda product_line_element: next(filter(lambda product: product in product_line_element,product_type),None), fed_df['product_line'])
What this does is it maps each line (map) of the fed_df['product_line'] and replaces that element with the first element (next) in a list containing the product types found in each line of products in fed_df['product_line'] (filter).
How I tested
To test this I made a list of lists to use as fed_df['productline']
[['h', 'a', 'g'], ['k', 'b', 'l'], ['u', 't', 'a'], ['r', 'e', 'p'], ['g', 'e', 'b']]
and searched for "a" and "b" "product_types", which gave
['a', 'b', 'a', None, 'b']
as a result, which I think is what you are after...
These mapping functions are usually preferred over for loops, since it promotes no mutation, and can be made multi-threaded/multi-process more easily.
Another bonus of this solutions is that the result isn't calculated until future code attempts to access answer, which spreads the CPU usage a bit better. You can force it to be calculated by converting answer into a list (list(answer)), but it shouldn't be necessary.
I hope I understood your problem correctly. Let me know if you have any questions :)
So, I've decided that it's time to learn regular expressions. Thus, I set out to solve various problems, and after a bit of smooth sailing, I seem to have hit a wall and need help getting unstuck.
The task:
Given a list of characters and logical operators, find all possible combinations of these characters and operators that are not gibberish.
For example, given:
my_list = ['p', 'q', '&', '|']
the output would be:
answers = ['p', 'q', 'p&q', 'p|q'...]
However, strings like 'pq&' and 'p&|' are gibberish and therefore not allowed.
Naturally, as more elements are added to my_list, the more complicated the process becomes.
My current approach:
(I'd like to learn how to solve it with regex, but I am also curious if there exists a better way, too... but again, my focus is regex)
step 1:
find all permutations of the elements such that each permutation is 3 <= x <= len(my_list) long.
step 2:
Loop over the list, and if a regex match is found, pull that element out and put it in the answers list.
(I'm not married to this 2-step approach, it is just what seemed most logical to me)
My current code, minus the regex:
import re
from itertool import permutations
my_list = ['p', 'q', '~r', 'r', '|', '&']
foo = []
answers = []
count = 3
while count < 7:
for i in permutations(a, count):
i = ''.join(k for k in i)
foo.append(i)
count +=1
for i in foo:
if re.match(r'insert_regex', i):
answers.append(i)
else:
None
print answers
Now, I have tried a vast slew of different regex's to get this to work (too many to list them all here) but some of the main ones are:
A straightforward approach by finding all the cases that have two letters side by side, or two operators side by side, then instead of appending 'answers', I just removed them from 'foo'. This is the regex I tried:
r'(\w\w)[&\|]{2,}'
and did not even come close.
I then decided to try and find the strings that I wanted, as opposed to the ones I did not want.
First I tested:
r'^[~\w]'
to make sure I could get the strings whose first character were a letter or a negation. This worked. I was happy.
I then tried:
r'^[~\w][&\|]'
to try and get the next logical operator; however, it only picked up strings whose first character was a letter, and ignored all of the strings whose first character was a negation.
I then tried a conditional so that if the first character was a negation, the next character would be a letter, otherwise it would be an operator:
r'^(?(~)\w|[&\|])'
but this thew me "error: bad character in group name".
I then tried to resolve this error by:
r'^(?:(~)\w|[&\|])'
But that returned only strings that started with '~' or an operator.
I then tried a slew of other things related to conditionals and groupings (2 days worth, actually), but I can't seem to find a solution. Part of the problem is that I don't know enough about regex to know where to go to find the solution, so I have kind of been wandering around the internet aimlessly.
I have run through a lot of tutorials and explanation pages, but they are all rather opaque and don't piece things together in a way is conducive to understanding... they just sort of throw out code for you to copy and paste or mimic.
Any insights you have would be much appreciated, and as much as I would love an answer to the problem, if possible, an ELI5 explanation of what the solution does would be excellent for my own progress.
In a bitter twist of irony, it turns out that I had the solution written down (I documented all the regex's I tried), but it originally failed because I forgot to remove strings from the original list, not the copy.
If anyone is looking for a solution to the problem, the following code worked on all of my test cases (can't promise beyond that, however).
import re
from itertools import permutations
import copy
a = ['p', 'q', 'r', '~r', '|', '&']
foo = []
count = 3
while count < len(a)+1:
for j in permutations(a, count):
j = ''.join(k for k in j)
foo.append(j)
count +=1
foo_copy = copy.copy(foo)
for i in foo:
if re.search(r'(^[&\|])|(\w\w)|(\w~)|([&\|][&\|])|([&\|]$)', i):
foo_copy.remove(i)
else:
None
print foo_copy
You have a list of variables (characters), binary operators, and/or variables prefixed with a unitary operator (like ~). The last case can be dealt with just like a variable.
As binary operators need a variable at either side, we can conclude that a valid expression is an alternation of variables and operators, starting and ending with a variable.
So, you could first divide the input list into two lists based on whether an item is a variable or an operator. Then you could increase the size of the output you will generate, and for each size, get the permutations of both lists and zip these in order to build a valid expression each time. This way you don't need a regular expression to verify the validity.
Here is the suggested function:
from itertools import permutations, zip_longest, chain
def expressions(my_list):
answers = []
variables = [x for x in my_list if x[-1].isalpha()]
operators = [x for x in my_list if not x[-1].isalpha()]
max_var_count = min(len(operators) + 1, len(variables))
for var_count in range(1, max_var_count+1):
for vars in permutations(variables, var_count):
for ops in permutations(operators, var_count-1):
answers.append(''.join(list(chain.from_iterable(zip_longest(vars, ops)))[:-1]))
return answers
print(expressions(['p', 'q', '~r', 'r', '|', '&']))
Alright, so I have a question. I am working on creating a script that grabs a random name from a list of provided names, and generates them in a list of 5. I know that you can use the command
items = ['names','go','here']
rand_item = items[random.randrange(len(items))]
This, if I am not mistaken, should grab one random item from the list. Though if I am wrong correct me, but my question is how would I get it to generate, say a list of 5 names, going down like below;
random
names
generated
using
code
Also is there a way to make it where if I run this 5 days in a row, it doesn't repeat the names in the same order?
I appreciate any help you can give, or any errors in my existing code.
Edit:
The general use for my script will be to generate task assignments for a group of users every day, 5 days a week. What I am looking for is a way to generate these names in 5 different rotations.
I apologize for any confusion. Though some of the returned answers will be helpful.
Edit2:
Alright so I think I have mostly what I want, thank you Markus Meskanen & mescalinum, I used some of the code from both of you to resolve most of this issue. I appreciate it greatly. Below is the code I am using now.
import random
items = ['items', 'go', 'in', 'this', 'string']
rand_item = random.sample(items, 5)
for item in random.sample(items, 5):
print item
random.choice() is good for selecting on element at random.
However if you want to select multiple elements at random without repetition, you could use random.sample():
for item in random.sample(items, 5):
print item
For the last question, you should trust the (pseudo-) random generator to not give the same sequence on two consecutive days. The random seed is initialized with current time by default, so it's unlikely to observe the same sequence on two consecutive days, altough not impossible, especially if the number of items is small.
If you absolutely need to avoid this, save the last sequence to a file, and load it before shuffling, and keep shuffling until it gives you a different order.
You could use random.choice() to get one item only:
items = ['names','go','here']
rand_item = random.choice(items)
Now just repeat this 5 times (a for loop!)
If you want the names just in a random order, use random.shuffle() to get a different result every time.
It is not clear in your question if you simply want to shuffle the items or make choose a subset. From what I've made sense you want the second case.
You can use random.sample, to get a given number of random items from a list in python. If I wanted to get 3 randomly items from a list of five letters, I would do:
>>> import random
>>> random.sample(['a', 'b', 'c', 'd', 'e'], 3)
['b', 'a', 'e']
Note that the letters are not necessarily returned in the same order - 'b' is returned before 'a', although that wasn't the case in the original list.
Regarding the second part of your question, preventing it from generating
the same letters in the same order, you can append every new generated sublists in a file, retrieving this file during your script execution and generating a new sublist until it is different from every past generated sublist.
random.shuffle(items) will handle the random order generation
In [15]: print items
['names', 'go', 'here']
In [16]: for item in items: print item
names
go
here
In [17]: random.shuffle(items)
In [18]: for item in items: print item
here
names
go
For completeness, I agree with the above poster on random.choice().
I have several file names that I am trying to compare. Here are some examples:
files = ['FilePrefix10.jpg', 'FilePrefix11.jpg', 'FilePrefix21.jpg', 'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg']
What I need to do is extract "FilePrefix" from each file name, which changes depending on the directory. I have several folders containing many jpg's. Within each folder, each jpg has a FilePrefix in common with every other jpg in that directory. I need the variable portion of the jpg file name. I am unable to predict what FilePrefix is going to be ahead of time.
I had the idea to just compare two file names using difflib (in Python) and extract FilePrefix (and subsequently the variable portion) that way. I've run into the following issue:
>>>> comp1 = SequenceMatcher(None, files[0], files[1])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=11), Match(a=12, b=12, size=4), Match(a=16, b=16, size=0)]
>>>> comp1 = SequenceMatcher(None, files[1], files[2])
>>>> comp1.get_matching_blocks()
[Match(a=0, b=0, size=10), Match(a=11, b=11, size=5), Match(a=16, b=16, size=0)]
As you can see, the first size does not match up. It's confusing the ten's and digit's place, making it hard for me to match a difference between more than two files. Is there a correct way to find a minimum size among all files within the directory? Or alternatively, is there a better way to extract FilePrefix?
Thank you.
It's not that it's "confusing the ten's and digit's place", it's that in the first matchup the ten's place isn't different, so it's considered part of the matching prefix.
For your use case, there seems to be a pretty easy solution to this ambiguity: just match all adjacent pairs, and take the minimum. Like this:
def prefix(x, y):
comp = SequenceMatcher(None, x, y)
matches = comp.get_matching_blocks()
prefix_match = matches[0]
prefix_size = prefix_match[2]
return prefix_size
pairs = zip(files, files[1:])
matches = (prefix(x, y) for x, y in pairs)
prefixlen = min(matches)
prefix = files[0][:prefixlen]
The prefix function is pretty straightforward, except for one thing: I made it take a single tuple of two values instead of two arguments, just to make it easier to call with map. And I used the [2] instead of .size because there's an annoying bug in 2.7 difflib where the second call to get_matching_blocks may return a tuple instead of a namedtuple. This won't affect the code as-is, but if you add some debugging prints it will break.
Now, pairs is a list of all adjacent pairs of names, created by zipping together names and names[1:]. (If this isn't clear, print(zip(names, names[1:]). If you're using Python 3.x, you'll need to print(list(zip(names, names[1:])) instead, because zip returns a lazy iterator instead of a printable list.)
Now we just want to call prefix on each of the pairs, and take the smallest value we get back. That's what min is for. (I'm passing it a generator expression, which can be a tricky concept at first—but if you just think of it as a list comprehension that doesn't build the list, it's pretty simple.)
You could obviously compact this into two or three lines while still leaving it readable:
prefixlen = min(SequenceMatcher(None, x, y).get_matching_blocks()[0][2]
for x, y in zip(files, files[1:]))
prefix = files[0][:prefixlen]
However, it's worth considering that SequenceMatcher is probably overkill here. It's looking for the longest matches anywhere, not just the longest prefix matches, which means it's essentially O(N^3) on the length of the strings, when it only needs to be O(NM) where M is the length of the result. Plus, it's not inconceivable that there could be, say, a suffix that's longer than the longest prefix, so it would return the wrong result.
So, why not just do it manually?
def prefixes(name):
while name:
yield name
name = name[:-1]
def maxprefix(names):
first, names = names[0], names[1:]
for prefix in prefixes(first):
if all(name.startswith(prefix) for name in names):
return prefix
prefixes(first) just gives you 'FilePrefix10.jpg', 'FilePrefix10.jp','FilePrefix10.j, etc. down to'F'`. So we just loop over those, checking whether each one is also a prefix of all of the other names, and return the first one that is.
And you can do this even faster by thinking character by character instead of prefix by prefix:
def maxprefix(names):
for i, letters in enumerate(zip(*names)):
if len(set(letters)) > 1:
return names[0][:i]
Here, we're just checking whether the first character is the same in all names, then whether the second character is the same in all names, and so on. Once we find one where that fails, the prefix is all characters up to that (from any of the names).
The zip reorganizes the list of names into a list of tuples, where the first one is the first character of each name, the second is the second character of each name, and so on. That is, [('F', 'F', 'F', 'F'), ('i', 'i', 'i', 'i'), …].
The enumerate just gives us the index along with the value. So, instead of getting ('F', 'F', 'F', 'F') you get 0, ('F, 'F', F', 'F'). We need that index for the last step.
Now, to check that ('F', 'F', 'F', 'F') are all the same, I just put them in a set. If they're all the same, the set will have just one element—{'F'}, then {'i'}, etc. If they're not, it'll have multiple elements—{'1', '2'}—and that's how we know we've gone past the prefix.
The only way to be certain is to check ALL the filenames. So just iterate through them all, checking against the kept maximum matching string as you go.
You might try something like this:
files = ['FilePrefix10.jpg',
'FilePrefix11.jpg',
'FilePrefix21.jpg',
'FilePrefixOoufhgonstdobgfohj#lwghkoph[]**^.jpg',
'FileProtector354.jpg
]
prefix=files[0]
max = 0
for f in files:
for c in range(0, len(prefix)):
if prefix[:c] != f[:c]:
prefix = f[:c-1]
max = c - 1
print prefix, max
Please pardon the 'un-Pythonicness' of the solution, but I wanted the algorithm to be obvious to any level programmer.
I can't seem to find a question on SO about my particular problem, so forgive me if this has been asked before!
Anyway, I'm writing a script to loop through a set of URL's and give me a list of unique urls with unique parameters.
The trouble I'm having is actually comparing the parameters to eliminate multiple duplicates. It's a bit hard to explain, so some examples are probably in order:
Say I have a list of URL's like this
hxxp://www.somesite.com/page.php?id=3&title=derp
hxxp://www.somesite.com/page.php?id=4&title=blah
hxxp://www.somesite.com/page.php?id=3&c=32&title=thing
hxxp://www.somesite.com/page.php?b=33&id=3
I have it parsing each URL into a list of lists, so eventually I have a list like this:
sort = [['id', 'title'], ['id', 'c', 'title'], ['b', 'id']]
I nee to figure out a way to give me just 2 lists in my list at that point:
new = [['id', 'c', 'title'], ['b', 'id']]
As of right now I've got a bit to sort it out a little, I know I'm close and I've been slamming my head against this for a couple days now :(. Any ideas?
Thanks in advance! :)
EDIT: Sorry for not being clear! This script is aimed at finding unique entry points for web applications post-spidering. Basically if a URL has 3 unique entry points
['id', 'c', 'title']
I'd prefer that to the same link with 2 unique entry points, such as:
['id', 'title']
So I need my new list of lists to eliminate the one with 2 and prefer the one with 3 ONLY if the smaller variables are in the larger set. If it's still unclear let me know, and thank you for the quick responses! :)
I'll assume that subsets are considered "duplicates" (non-commutatively, of course)...
Start by converting each query into a set and ordering them all from largest to smallest. Then add each query to a new list if it isn't a subset of an already-added query. Since any set is a subset of itself, this logic covers exact duplicates:
a = []
for q in sorted((set(q) for q in sort), key=len, reverse=True):
if not any(q.issubset(Q) for Q in a):
a.append(q)
a = [list(q) for q in a] # Back to lists, if you want