Related
I have the string 'ABBA'. If I turn it into a set, I get this:
In: set('ABBA')
Out: {'A', 'B'}
However, if I join them as a string, it reverses the order:
In: ''.join(set('ABBA'))
Out: 'BA'
The same happens if I try to turn the string into a list:
In: list(set('ABBA'))
Out: ['B', 'A']
Why is this happening and how do I address it?
EDIT
The reason applying sorted doesn't work is that if I make a set out of 'CDA', it will return 'ACD', thus losing the order of the original string. My question pertains to preserving the original order of the set itself.
Sets are unordered collection i.e. you will get a different order every time you run the command and sets also have unique elements therefore there will be no repetition in the elements of the set.
if you try running this command for a few times set('ABBA') sometimes you will get the output as {'A', 'B'} and sometimes as {'B', 'A'} and that what happens when you are using the join command the output is sometimes taken as BA and sometimes it will show AB.
There is an ordered set recipe for this which is referred to from the Python 2 Documentation. This runs on Py2.6 or later and 3.0 or later without any modifications. The interface is almost exactly the same as a normal set, except that initialisation should be done with a list.
OrderedSet([1, 2, 3])
This is a MutableSet, so the signature for .union doesn't match that of set, but since it includes or something similar can easily be added
b = "AAEEBBCCDD"
a = set(b)#unordered
print(a)#{'B', 'D', 'C', 'A', 'E'}/{'A', 'E', 'B', 'D', 'C'}/,,,
#do not have reverses the order,only random
print(''.join(a))
print(list(a))
print(sorted(a, key=b.index))#Save original sequence(b)
I am new to data wrangling in python.
I have a column in a dataframe that has text like:
I really like Product A!
I think Product B is for me!
I will go with Product C.
My objective is to create a new column with Product Name (Including the word 'Product'). I do not want to use Regex. Product name is unique in a row. So there will be no row with string such as
I really like Product A and Product B
Problem in generic form: I have a list of unique items. lets call it list A. I have another list of strings where each string includes atmost one of the items from list A. How do I create a new list with matched item.
I have written the following code. It works fine. But even I (new to progamming) can tell this is highly inefficient.
Any better and elegant solution?
product_type = ['Product A', 'Product B', 'Product C', 'Product D']
product_list = [None] * len(fed_df['product_line'])
for i in range(len(product_list)):
for product in product_type:
if product in fed_df['product_line'][i]:
product_list[i] = product
fed_df['product_line'] = product_list
Short Background
Fundamentally, at some point, each element of each list will need to be compared similarly to how you've written it (although you can skip to the next loop once a match has been found). But the trick with writing good python code, is to utilise functionality written on a lower level for efficiency, rather than trying to write it yourself. For example: You should try to avoid using
for i in range(len(myList)): #code which accesses myList[i]
when you can use
for myListElement in myList: #code which uses myListElement
since in the latter, the accessing of myList is handled internally, and more efficiently than python calculating i manually, then accessing the ith element of myList. This fact is true of some other high-level programming languages too.
Actual Answer
Anyway, to answer your question, I came up with the following and I believe it would be more efficient:
answer = map(lambda product_line_element: next(filter(lambda product: product in product_line_element,product_type),None), fed_df['product_line'])
What this does is it maps each line (map) of the fed_df['product_line'] and replaces that element with the first element (next) in a list containing the product types found in each line of products in fed_df['product_line'] (filter).
How I tested
To test this I made a list of lists to use as fed_df['productline']
[['h', 'a', 'g'], ['k', 'b', 'l'], ['u', 't', 'a'], ['r', 'e', 'p'], ['g', 'e', 'b']]
and searched for "a" and "b" "product_types", which gave
['a', 'b', 'a', None, 'b']
as a result, which I think is what you are after...
These mapping functions are usually preferred over for loops, since it promotes no mutation, and can be made multi-threaded/multi-process more easily.
Another bonus of this solutions is that the result isn't calculated until future code attempts to access answer, which spreads the CPU usage a bit better. You can force it to be calculated by converting answer into a list (list(answer)), but it shouldn't be necessary.
I hope I understood your problem correctly. Let me know if you have any questions :)
Let me have a dictionary:
P={'S':['dB','N'],'B':['C','CcB','bA']}
How can I get second value o the second key from dictionary P ?
Also, if the value is a string with more than one character like 'bA' (third value of key 'B'), can I somehow return first character of this value ?
Like #jonrsharpe has stated before, dictionaries aren't ordered by design.
What this means, is that everytime you attempt to access a dictionary "by order" you may encounter a different result.
Observe the following (python interactive interpreter):
>>>P={'S':['dB','N'],'B':['C','CcB','bA'], 'L':["qqq"]}
>>>P.keys()
['S', 'B', 'L']
Its easy to see that in this notice, the "order" as we defined is, matches the order that we receive from the Dictionary.keys() function.
However, you may also observe this result:
>>> P={'S':['dB','N'],'B':['C','CcB','bA'], 'L':["qqq"], 'A':[]}
>>> P.keys()
['A', 'S', 'B', 'L']
In this example, the value 'A' should be fourth in our list, but, it is actually the first value.
This is just a small example why you may not treat dictionaries as ordered lists.
Maybe you could go ahead and tell us what your intentions are and an alternative may be suggested.
I have generator function that creates a Cartesian product of lists. The real application uses more complex objects, but they can be represented by strings:
import itertools
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
names = [''.join(s) for s in p]
In this example, the result is 32 combinations of characters:
names
['accg', 'acdg', 'aceg', 'acfg', 'adcg', 'addg', 'adeg', 'adfg', 'aecg',
'aedg', 'aeeg', 'aefg', 'afcg', 'afdg', 'afeg', 'affg', 'bccg', 'bcdg',
'bceg', 'bcfg', 'bdcg', 'bddg', 'bdeg', 'bdfg', 'becg', 'bedg', 'beeg',
'befg', 'bfcg', 'bfdg', 'bfeg', 'bffg']
Now, let's say I have some constraints such that certain character combinations are illegal. For example, let's say that only strings that contain the regex '[ab].c' are allowed. ('a' or 'b' followed by any letter followed by 'c')
After applying these constraints, we are left with a reduced set of just 8 strings:
import re
r = re.compile('[ab].c')
filter(r.match, names)
['accg', 'adcg', 'aecg', 'afcg', 'bccg', 'bdcg', 'becg', 'bfcg']
In the real application the chains are longer, there could be thousands of combinations and applying the hundreds of constraints is fairly computationally intensive so I'm concerned about scalability.
Right now I'm going through every single combination and checking its validity. Does an algorithm/data structure exist that could speed up this process?
EDIT:
Maybe this will clarify a little: In the real application I am assembling random 2D pictures of buildings from simple basic blocks (like pillars, roof segments, windows, etc.). The constraints limit what kind of blocks (and their rotations) can be grouped together, so the resulting random image looks realistic, and not a random jumble.
A given constraint can contain many combinations of patterns. But of all those combinations, many are not valid because a different constraint would prohibit some portion of it. So in my example, one constraint would contain the full Cartesian product of characters above. And a second constraint is the '[ab].c'; this second constraint reduces the number of valid combinations of the first constraint that I need to consider.
Because these constraints are difficult to create; I looking to visualize what all the combinations of blocks in each constraint look like, but only the valid combinations that pass all constraints. Hence my question. Thanks!
Try just providing the iterator that generates the names directly to filter, like so:
import itertools
import re
s1 = ['a', 'b']
s2 = ['c', 'd', 'e', 'f']
s3 = ['c', 'd', 'e', 'f']
s4 = ['g']
p = itertools.product(*[s1,s2,s3,s4])
r = re.compile('[ab].c')
l = filter(r.search, (''.join(s) for s in p))
print(list(l))
That way, it shouldn't assemble the full set of combinations in memory, it will only keep the ones that fit the criteria. There is probably another way as well.
EDIT:
One of the primary differences from the original, is that instead of:
[''.join(s) for s in p]
Which is a list comprehension, we use:
(''.join(s) for s in p)
Which is a generator.
The important difference here is that a list comprehension creates a list using the designated criteria and generator, while only providing the generator allows the filter to generate values as needed. The important mechanism is lazy evaluation, which really just boils down to only evaluating expressions as their values become necessary.
By switching from a list to a generator, Rob's answer saves space but not time (at least, not asymptotically). You've asked a very broad question about how to enumerate all solutions to what is essentially a constraint satisfaction problem. The big wins are going to come from enforcing local consistency, but it's difficult to give you advice without specific knowledge of your constraints.
Trying to learn Python I encountered the following:
>>> set('spam') - set('ham')
set(['p', 's'])
Why is it set(['p', 's']) - i mean: why is 'h' missing?
The - operator on python sets is mapped to the difference method, which is defined as the members of set A which are not members of set B. So in this case, the members of "spam" which are not in "ham"are "s" and "p". Notice that this method is not commutative (that is, a - b == b - a is not always true).
You may be looking for the symmetric_difference or ^ method:
>>> set("spam") ^ set("ham")
{'h', 'p', 's'}
This operator is commutative.
Because that is the definition of a set difference. In plain English, it is equivalent to "what elements are in A that are not also in B?".
Note the reverse behavior makes this more obvious
>>> set('spam') - set('ham')
{'s', 'p'}
>>> set('ham') - set('spam')
{'h'}
To get all unique elements, disregarding the order in which you ask, you can use symmetric_difference
>>> set('spam').symmetric_difference(set('ham'))
{'s', 'h', 'p'}
There are two different operators:
Set difference. This is defined as the elements of A not present in B, and is written as A - B or A.difference(B).
Symmetric set difference. This is defined as the elements of either set not present in the other set, and is written as A ^ B or A.symmetric_difference(B).
Your code is using the former, whereas you seem to be expecting the latter.
The set difference is the set of all characters in the first set that are not in the second set. 'p' and 's' appear in the first set but not in the second, so they are in the set difference. 'h' does not appear in the first set, so it is not in the set difference (regardless of whether or not it is in the first set).
You can also obtain the desired result as:
>>> (set('spam') | set('ham')) - (set('spam') & set('ham'))
set(['p', 's', 'h'])
Create union using | and intersection using & and then do the set difference, i.e. differences between all elements and common elements.