Particular List Comprehension - python

Could any one explain how to understand this particular list comprehension.
I have tried to decode the below list comprehension using How to read aloud Python List Comprehensions?, but still not able to understand.
words = "".join([",",c][int(c.isalnum())] for c in sen).split(",")
lets say:
sen='i love dogs'
So the output would be,
['i', 'love', 'dogs']

Here is a better way with split:
print(sen.split())
Output:
['i', 'love', 'dogs']
Explaining (your code):
Iterates the string, and if the letter is nothing, like i.e space etc... , make it a comma.
After all of that use split to split the commas out.

Basically, you've got this:
For each character (c) in the sentence (sen), create a list [',', character].
If character is a letter or number (.isalnum()), add the character to the list being built by the comprehension. Or rather:
`[',', character][1]`.
If not, take the comma (","), and add that to the list being built by the comprehension.
Or rather:
`[',', character][0]`
Now, join the list together into a string:
`"".join(['I', ',', 'l', 'o', 'v', 'e', ',', 'd', 'o', 'g', 's', ','])`
becomes
`"I,love,dogs,"`
Now and split that string using commas as the break into a list:
"I,love,dogs,".split(",")
becomes
`['I', 'love', 'dogs', '']`
The trick in here is that [",",c][int(c.isalnum())] is actually a slice, using the truth value of isalnum(), converted to an int, as either the zero index or the one index for the slice.
So, basically, if c, is the character "b", for example, you have [',', character][1].
Hope this helps.
PS In my example, I'm using 'sen = 'i love dogs.' Can you spot the difference between your result and mine, and understand why it happens?
Here's code:
sen = 'I love dogs.'
words = "".join([",",character][int(character.isalnum())] for character in sentence).split(",")
print(words)
Result:
['I', 'love', 'dogs', '']

Related

Drop Duplicate Substrings from String with NO Spaces

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.
You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group
An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

remove the last two characters from strings in list using python

I have a very complicated problem. At least my python skills are not enough to solve it and I need help or at least ideas on how to solve this problem.
I have a huge list of words that looks like this:
words_articles=['diao','carrosos', 'cidadea', cidadesas']
I need to append into another list the last or the last two characters of each string into a new list, because they are these words' articles: 'a', 'o', 'as', 'os'
my result should be two lists like the following:
words=['dia','carros', 'cidade', 'cidades']
articles=['o', 'os','a','as']
I have no idea how to solve this. I just know that I have to loop through each string but from this stage on I don't know what to do.
words_articles=['diao','carrosos', 'cidadea', 'cidadesas']
words=[]
articles=[]
for y in words:
for x in y:
What should I do next after this?
You can test the last letter
words_articles=['diao','carrosos', 'cidadea', 'cidadesas']
words=[]
articles=[]
for word in words_articles:
if word[-1] == 's':
words.append(word[:-2])
articles.append(word[-2:])
else:
words.append(word[:-1])
articles.append(word[-1:])
print(words)
print(articles)
Out:
['dia', 'carros', 'cidade', 'cidades']
['o', 'os', 'a', 'as']
You probably want the 'words' and 'articles' variables to be sets to prevent duplication:
words_articles=['diao','carrosos', 'cidadea', 'cidadesas']
words=set()
articles=set()
for word in words_articles:
suffix_size = 2 if word.endswith("s") else 1
words.add(word[:-suffix_size])
articles.add(word[-suffix_size:])
Output would be:
{'carros', 'cidade', 'dia', 'cidades'}
{'a', 'as', 'os', 'o'}

How to convert entire string to one item in list, not one character per item?

I'm not sure of the terminology so bear with me. I have self.desc.
>>> print (self.desc)
This is a string.
It can be multiple sentences and sometimes, but not always, this string needs to be split in sentences. Because of all the intricacies in determining what a sentence is, I used NLTK for this.
if splitme==True:
self.desc = sent_tokenize(self.desc)
The problem is when I don't tokenize it I get "This is a string." and when I do I get ["This is a string."]. So later on when I have to reference it I get:
Tokenized:
>>> print(self.desc[0])
[This is a string.]
Not tokenized:
>>> print(self.desc[0])
T
I know it's because when it is tokenized it's a list and not tokenized it's a string. I just don't know how to fix it. I tried converting the string to a list but then it just puts each character in a list:
>>> self.desc = list(self.desc)
>>> print (self.desc)
['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 'l', 'i', 's', 't', '.']
I'm not quite sure how to go about fixing this. I've looked for answers but maybe I'm not using the right search terms.
You can check if self.desc is a list and, if it is not, you turn it into a list:
self.desc = [self.desc] if isinstance(self.desc, str) else self.desc
or you can do the opposite, and turn lists into strings:
self.desc = self.desc[0] if isinstance(self.desc, list) else self.desc
The rest of your code must act accordingly to the transformation you have performed.

Python, splitting strings on middle characters with overlapping matches using regex

In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (e.g. ^(\w{0,2})o(\w{0,2})$).
This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. The issue I am having stems from the possibility of multiple overlapping matches inside a string (e.g. I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m)).
Both re.search().groups() and re.findall() did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods.
Here is a snippet showing the undesired behaviour:
import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
matches = regex.findall(word)
if matches:
halves.append(matches)
I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. Some notes on how improve this answer (i.e. making more "pythonic" or simply more efficient would be very welcomed).
The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using there and the new regex modules involves using multiple regexes. This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character.
dictionary = ('room', 'roam', 'flow', 'door', 'window',
'desk', 'for', 'fo', 'foo', 'of', 'sorrow')
char = 'o'
word_len = (3, 6)
char_pos = (2, 3)
regex_str = '(?=^\w{'+str(word_len[0])+','+str(word_len[1])+'}$)(?=\w{'
+str(char_pos[0]-1)+','+str(char_pos[1]-1)+'}'+char+')'
halves = []
for word in dictionary:
matches = re.match(regex_str, word)
if matches:
matched_halves = []
for pos in xrange(char_pos[0]-1, char_pos[1]):
split_regex_str = '(?<=^\w{'+str(pos)+'})'+char
split_word =re.split(split_regex_str, word)
if len(split_word) == 2:
matched_halves.append(split_word)
halves.append(matched_halves)
The output is:
[[['r', 'om'], ['ro', 'm']], [['r', 'am']], [['fl', 'w']], [['d', 'or'], ['do', 'r']], [['f', 'r']], [['f', 'o'], ['fo', '']], [['s', 'rrow']]]
At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal char. Anyhow, any remark is extremely appreciated.
EDIT: Fixed.
Does a simple while loop work?
What you want is re.search and then loop with a 1 shift:
https://docs.python.org/2/library/re.html
>>> dictionary = ('room', 'door', 'window', 'desk', 'for')
>>> regex = re.compile('(\w{0,2})o(\w{0,2})')
>>> halves = []
>>> for word in dictionary:
>>> start = 0
>>> while start < len(word):
>>> match = regex.search(word, start)
>>> if match:
>>> start = match.start() + 1
>>> halves.append([match.group(1), match.group(2)])
>>> else:
>>> # no matches left
>>> break
>>> print halves
[['ro', 'm'], ['o', 'm'], ['', 'm'], ['do', 'r'], ['o', 'r'], ['', 'r'], ['nd', 'w'], ['d', 'w'], ['', 'w'], ['f', 'r'], ['', 'r']]

Solving jumbled word puzzles with python?

I have an interesting programming puzzle for you:
You will be given two things:
A word containing a list of English words put together, e.g:
word = "iamtiredareyou"
Possible subsets:
subsets = [
'i', 'a', 'am', 'amt', 'm', 't', 'ti', 'tire', 'tired', 'i',
'ire', 'r', 're', 'red', 'redare', 'e', 'd', 'da', 'dar', 'dare',
'a', 'ar', 'are', 'r', 're', 'e', 'ey', 'y', 'yo', 'you', 'o', 'u'
]
Challenges:
Level-1: I need to pragmatically find the members in subsets which together in an order will make "iamtiredareyou" i.e. ['i', 'am', 'tired', 'are', 'you']
Level-2: The original string may consist of some extra characters in sequence which are not present in the subset. e.g. "iamtired12aareyou". The subset given is same as above, the solution should automatically include this subset in the right place in the result array. i.e. ['i', 'am', 'tired', '12a', 'are', 'you']
How can I do this?
Generally, a recursive algorithm would do.
Start with checking all subsets against start of a given word, if found — add (append) to found values and recurse with remaining part of the word and current found values.
Or if it's an end of the string — print found values.
something like that:
all=[]
def frec(word, values=[]):
gobal all
if word == "": # got result.
all+=[values]
for s in subsets:
if word.startswith(s):
frec(word[len(s):], values+[s])
frec(word)
note that there are lots of possible solutions since subsets include many one-character strings. You might want to find some shortest of results. (13146 solutions... use “all.sort(cmp=lambda x, y: cmp(len(x), len(y)))” to get shortest)
For a level2 — you need another loop if no subset matches that adds more and more symbols to next value (and recurses into that) until match is found.
all=[]
def frec(word, values=[]):
global all
if word == "": # got result.
all+=[values]
return true
match = False
for s in subsets:
if word.startswith(s):
match = True
frec(word[len(s):], values+[s])
if not match:
return frec(word[1:], values+[word[0]])
frec(word)
This does not try to combine non-subset values into one string, though.
i think you should do your own programming excercises....
For the Level 1 challenge you could do it recursively. Probably not the most efficient solution, but the easiest:
word = "iamtiredareyou"
subsets = ['i', 'a', 'am', 'amt', 'm', 't', 'ti', 'tire', 'tired', 'i', 'ire', 'r', 're', 'red', 'redare', 'e', 'd', 'da', 'dar', 'dare', 'a', 'ar', 'are', 'r', 're', 'e', 'ey', 'y', 'yo', 'you', 'o', 'u']
def findsubset():
global word
for subset in subsets:
if word.startswith(subset):
setlist.append(subset)
word = word[len(subset):]
if word == "":
print setlist
else:
findsubset()
word = subset + word
setlist.pop()
# Remove duplicate entries by making a set
subsets = set(subsets)
setlist = []
findsubset()
Your list of subsets has duplicates in it - e.g. 'a' appears twice - so my code makes it a set to remove the duplicates before searching for results.
Sorry about the lack of programming snippet, but I'd like to suggest dynamic programming. Attack level 1 and level 2 at the same time by giving each word a cost, and adding all the single characters not present as single character high cost words. The problem is then to find the way of splitting the sequence up into words that gives the least total cost.
Work from left to right along the sequence, at each point working out and saving the least cost solution up to and including the current point, and the length of the word that ends that solution. To work out the answer for the next point in the sequence, consider all of the known words that are suffixes of the sequence. For each such word, work out the best total cost by adding the cost of that word to the (already worked out) cost of the best solution ending just before that word starts. Note the smallest total cost and the length of the word that produces it.
Once you have the best cost for the entire sequence, use the length of the last word in that sequence to work out what the last word is, and then step back that number of characters to inspect the answer worked out at that point and get the word just preceding the last word, and so on.
Isn't it just the same as finding the permutations, but with some conditions? Like you start the permutation algorithm (a recursive one) you check if the string you already have matches the first X characters of your to find word, if yes you continue the recursion until you find the whole word, otherwise you go back.
Level 2 is a bit silly if you ask me, because then you could actually write anything as the "word to be found", but basically it would be just like level1 with the exception that if you can't find a substring in your list you simply add it (letter by letter i.e. you have "love" and a list of ['l','e'] you match 'l' but you lack 'o' so you add it and check if any of your words in the list start with a 'v' and match your word to be found, they don't so you add 'v' to 'o' etc.).
And if you're bored you can implement a genetical algorithm, it's really fun but not really efficient.
Here is a recursive, inefficient Java solution:
private static void findSolutions(Set<String> fragments, String target, HashSet<String> solution, Collection<Set<String>> solutions) {
if (target.isEmpty()) {
solutions.add(solution);
return;
}
for (String frag : fragments) {
if (target.startsWith(frag)) {
HashSet<String> solution2 = new HashSet<String>(solution);
solution2.add(frag);
findSolutions(fragments, target.substring(frag.length()), solution2, solutions);
}
}
}
public static Collection<Set<String>> findSolutions(Set<String> fragments, String target) {
HashSet<String> solution = new HashSet<String>();
Collection<Set<String>> solutions = new ArrayList<Set<String>>();
findSolutions(fragments, target, solution, solutions);
return solutions;
}

Categories

Resources