Splitting strings in python based on index

Splitting strings in python based on index - python

This sounds pretty basic but I ca't think of a neat straightforward method to do this in Python yet
I have a string like "abcdefgh" and I need to create a list of elements picking two characters at a time from the string to get ['ab','cd','ef','gh'].
What I am doing right now is this
output = []
for i in range(0,len(input),2):
output.append(input[i:i+2])
Is there a nicer way?

In [2]: s = 'abcdefgh'
In [3]: [s[i:i+2] for i in range(0, len(s), 2)]
Out[3]: ['ab', 'cd', 'ef', 'gh']

Just for the fun of it, if you hate for
>>> s='abcdefgh'
>>> map(''.join, zip(s[::2], s[1::2]))
['ab', 'cd', 'ef', 'gh']

Is there a nicer way?
Sure. List comprehension can do that.
def n_chars_at_a_time(s, n=2):
return [s[i:i+n] for i in xrange(0, len(s), n)]
should do what you want. The s[i:i+n] returns the substring starting at i and ending n characters later.
n_chars_at_a_time("foo bar baz boo", 2)
produces
['fo', 'o ', 'ba', 'r ', 'ba', 'z ', 'bo', 'o']
in the python REPL.
For more info see Generator Expressions and List Comprehensions:
Two common operations on an iterator’s output are
performing some operation for every element,
selecting a subset of elements that meet some condition.
For example, given a list of strings, you might want to strip off trailing whitespace from each line or extract all the strings containing a given substring.
List comprehensions and generator expressions (short form: “listcomps” and “genexps”) are a concise notation for such operations...

Related

Drop Duplicate Substrings from String with NO Spaces

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.

You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group

An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

Particular List Comprehension

Could any one explain how to understand this particular list comprehension.
I have tried to decode the below list comprehension using How to read aloud Python List Comprehensions?, but still not able to understand.
words = "".join([",",c][int(c.isalnum())] for c in sen).split(",")
lets say:
sen='i love dogs'
So the output would be,
['i', 'love', 'dogs']

Here is a better way with split:
print(sen.split())
Output:
['i', 'love', 'dogs']
Explaining (your code):
Iterates the string, and if the letter is nothing, like i.e space etc... , make it a comma.
After all of that use split to split the commas out.

Basically, you've got this:
For each character (c) in the sentence (sen), create a list [',', character].
If character is a letter or number (.isalnum()), add the character to the list being built by the comprehension. Or rather:
`[',', character][1]`.
If not, take the comma (","), and add that to the list being built by the comprehension.
Or rather:
`[',', character][0]`
Now, join the list together into a string:
`"".join(['I', ',', 'l', 'o', 'v', 'e', ',', 'd', 'o', 'g', 's', ','])`
becomes
`"I,love,dogs,"`
Now and split that string using commas as the break into a list:
"I,love,dogs,".split(",")
becomes
`['I', 'love', 'dogs', '']`
The trick in here is that [",",c][int(c.isalnum())] is actually a slice, using the truth value of isalnum(), converted to an int, as either the zero index or the one index for the slice.
So, basically, if c, is the character "b", for example, you have [',', character][1].
Hope this helps.
PS In my example, I'm using 'sen = 'i love dogs.' Can you spot the difference between your result and mine, and understand why it happens?
Here's code:
sen = 'I love dogs.'
words = "".join([",",character][int(character.isalnum())] for character in sentence).split(",")
print(words)
Result:
['I', 'love', 'dogs', '']

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?

To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']

Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']

Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.

Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

Python, splitting strings on middle characters with overlapping matches using regex

In Python, I am using regular expressions to retrieve strings from a dictionary which show a specific pattern, such as having some repetitions of characters than a specific character and another repetitive part (e.g. ^(\w{0,2})o(\w{0,2})$).
This works as expected, but now I'd like to split the string in two substrings (eventually one might be empty) using the central character as delimiter. The issue I am having stems from the possibility of multiple overlapping matches inside a string (e.g. I'd want to use the previous regex to split the string room in two different ways, (r, om) and (ro, m)).
Both re.search().groups() and re.findall() did not solve this issue, and the docs on the re module seems to point out that overlapping matches would not be returned by the methods.
Here is a snippet showing the undesired behaviour:
import re
dictionary = ('room', 'door', 'window', 'desk', 'for')
regex = re.compile('^(\w{0,2})o(\w{0,2})$')
halves = []
for word in dictionary:
matches = regex.findall(word)
if matches:
halves.append(matches)

I am posting this as an answer mainly not to leave the question answered in the case someone stumbles here in the future and since I've managed to reach the desired behaviour, albeit probably not in a very pythonic way, this might be useful as a starting point from someone else. Some notes on how improve this answer (i.e. making more "pythonic" or simply more efficient would be very welcomed).
The only way of getting all the possible splits of the words having length in a certain range and a character in certain range of positions, using the characters in the "legal" positions as delimiters, both using there and the new regex modules involves using multiple regexes. This snippet allows to create at runtime an appropriate regex knowing the length range of the word, the char to be seek and the range of possible positions of such character.
dictionary = ('room', 'roam', 'flow', 'door', 'window',
'desk', 'for', 'fo', 'foo', 'of', 'sorrow')
char = 'o'
word_len = (3, 6)
char_pos = (2, 3)
regex_str = '(?=^\w{'+str(word_len[0])+','+str(word_len[1])+'}$)(?=\w{'
+str(char_pos[0]-1)+','+str(char_pos[1]-1)+'}'+char+')'
halves = []
for word in dictionary:
matches = re.match(regex_str, word)
if matches:
matched_halves = []
for pos in xrange(char_pos[0]-1, char_pos[1]):
split_regex_str = '(?<=^\w{'+str(pos)+'})'+char
split_word =re.split(split_regex_str, word)
if len(split_word) == 2:
matched_halves.append(split_word)
halves.append(matched_halves)
The output is:
[[['r', 'om'], ['ro', 'm']], [['r', 'am']], [['fl', 'w']], [['d', 'or'], ['do', 'r']], [['f', 'r']], [['f', 'o'], ['fo', '']], [['s', 'rrow']]]
At this point I might start considering using a regex just to find the to words to be split and the doing the splitting in 'dumb way' just checking if the characters in the range positions are equal char. Anyhow, any remark is extremely appreciated.

EDIT: Fixed.
Does a simple while loop work?
What you want is re.search and then loop with a 1 shift:
https://docs.python.org/2/library/re.html
>>> dictionary = ('room', 'door', 'window', 'desk', 'for')
>>> regex = re.compile('(\w{0,2})o(\w{0,2})')
>>> halves = []
>>> for word in dictionary:
>>> start = 0
>>> while start < len(word):
>>> match = regex.search(word, start)
>>> if match:
>>> start = match.start() + 1
>>> halves.append([match.group(1), match.group(2)])
>>> else:
>>> # no matches left
>>> break
>>> print halves
[['ro', 'm'], ['o', 'm'], ['', 'm'], ['do', 'r'], ['o', 'r'], ['', 'r'], ['nd', 'w'], ['d', 'w'], ['', 'w'], ['f', 'r'], ['', 'r']]

Create new list of substrings from list of strings

Is there an easy way in python of creating a list of substrings from a list of strings?
Example:
original list: ['abcd','efgh','ijkl','mnop']
list of substrings: ['bc','fg','jk','no']
I know this could be achieved with a simple loop but is there an easier way in python (Maybe a one-liner)?

Use slicing and list comprehension:
>>> lis = ['abcd','efgh','ijkl','mnop']
>>> [ x[1:3] for x in lis]
['bc', 'fg', 'jk', 'no']
Slicing:
>>> s = 'abcd'
>>> s[1:3] #return sub-string from 1 to 2th index (3 in not inclusive)
'bc'

With a mix of slicing and list comprehensions you can do it like this
listy = ['abcd','efgh','ijkl','mnop']
[item[1:3] for item in listy]
>> ['bc', 'fg', 'jk', 'no']

You can use a one-liner list-comprehension.
Using slicing, and relative positions, you can then trim the first and last character in each item.
>>> l = ['abcd','efgh','ijkl','mnop']
>>> [x[1:-1] for x in l]
['bc', 'fg', 'jk', 'no']
If you are doing this many times, consider using a function:
def trim(string, trim_left=1, trim_right=1):
return string[trim_left:-trim_right]
def trim_list(lst, trim_left=1, trim_right=1):
return [trim(x, trim_left, trim_right) for x in lst]
>>> trim_list(['abcd','efgh','ijkl','mnop'])
['bc', 'fg', 'jk', 'no']

If you want to do this in one line you could try this:
>>> map(lambda s: s[1:-1], ['abcd','efgh','ijkl','mnop'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting strings in python based on index - python

In [2]: s = 'abcdefgh' In [3]: [s[i:i+2] for i in range(0, len(s), 2)] Out[3]: ['ab', 'cd', 'ef', 'gh']

Just for the fun of it, if you hate for >>> s='abcdefgh' >>> map(''.join, zip(s[::2], s[1::2])) ['ab', 'cd', 'ef', 'gh']

Related

Drop Duplicate Substrings from String with NO Spaces

Particular List Comprehension

Find all strings in nested brackets

Python, splitting strings on middle characters with overlapping matches using regex

Create new list of substrings from list of strings

Categories

Resources