How to split a string between all chars and int in Python

How to split a string between all chars and int in Python - python

For example I want to convert "2pL11H10K" into [2, p, L, 11, H, 10, K]

Use regular expression.
Example
your_string = "2pL11H10K"
items = re.findall(r'[A-Za-z]|\d+', your_string)
print(items)
then you got
['2', 'p', 'L', '11', 'H', '10', 'K']

Regular expressions are necessary and there have already been some quality answers given, but you will also need to convert the numbers from str() to int(). This can also be achieved using regular expressions, for example with [0-9]+ to identify one or more digits.

You can implement the logic for this in a loop checking the if the previous element is a digit or char if you do not wish to import any modules. However regex would likely be the most elegant solution.
result = []
for e in string:
if result:
if result[-1].isdigit() and e.isdigit():
result[-1] = result[-1] + e
else:
result.append(e)
else:
result.append(e)

Related

Split a string after multiple delimiters and include it

Hello I'm trying to split a string without removing the delimiter and it can have multiple delimiters.
The delimiters can be 'D', 'M' or 'Y'
For example:
>>>string = '1D5Y4D2M'
>>>re.split(someregex, string) #should ideally return
['1D', '5Y', '4D', '2M']
To keep the delimiter I use Python split() without removing the delimiter
>>> re.split('([^D]+D)', '1D5Y4D2M')
['', '1D', '', '5Y4D', '2M']
For multiple delimiters I use In Python, how do I split a string and keep the separators?
>>> re.split('(D|M|Y)', '1D5Y4D2M')
['1', 'D', '5', 'Y', '4', 'D', '2', 'M', '']
Combining both doesn't quite make it.
>>> re.split('([^D]+D|[^M]+M|[^Y]+Y)', string)
['', '1D', '', '5Y4D', '', '2M', '']
Any ideas?

I'd use findall() in your case. How about:
re.findall(r'\d+[DYM]', string
Which will result in:
['1D', '5Y', '4D', '2M']

(?<=(?:D|Y|M))
You need 0 width assertion split.Can be done using regex module python.
See demo.
https://regex101.com/r/aKV13g/1

You can split at the locations right after D, Y or M but not at the end of the string with
re.split(r'(?<=[DYM])(?!$)', text)
See the regex demo. Details:
(?<=[DYM]) - a positive lookbehind that matches a location that is immediately preceded with D or Y or M
(?!$) - a negative lookahead that fails the match if the current position is the string end position.
Note
In the current scenario, (?<=[DYM]) can be used instead of a more verbose (?<=D|Y|M) since all alternatives are single characters. If you have multichar delimiters, you would have to use a non-capturing group, (?:...), with lookbehind alternatives inside it. For example, to separate right after Y, DX and MZB you would use (?:(?<=Y)|(?<=DX)|(?<=MZB)). See Python Regex Engine - "look-behind requires fixed-width pattern" Error

I think it will work fine without regex or split
time complexity O(n)
string = '1D5Y4D2M'
temp=''
res = []
for x in string:
if x=='D':
temp+='D'
res.append(temp)
temp=''
elif x=='M':
temp+='M'
res.append(temp)
temp=''
elif x=='Y':
temp+='Y'
res.append(temp)
temp=''
else:
temp+=x
print(res)

using translate
string = '1D5Y4D2M'
delimiters = ['D', 'Y', 'M']
result = string.translate({ord(c): f'{c}*' for c in delimiters}).strip('.*').split('*')
print(result)
>>> ['1D', '5Y', '4D', '2M']

Drop Duplicate Substrings from String with NO Spaces

Given a Pandas DF column that looks like this:
...how can I turn it into this:
XOM
ZM
AAPL
SOFI
NKLA
TIGR
Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.

You can use regex pattern like r'\b(\w+)\1\b' with str.extract like below:
df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI',
'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w+)\1\b'))
Output:
0
0 ZOM
1 ZM
2 SOFI
3 ABCDEFGHIJ
4 NaN # <- from `NOTDUPLICATED`
Explanation:
\b is a word boundary
(w+) capture a word
\1 references to captured (w+) of the first group

An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.
Does not handle leading or trailing characters. that are not part of the repetition.
When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.
import re
def remove_repeated(str):
for i in range(len(str)):
substr = str[i:]
pattern = re.compile(f"({substr}){{2,}}")
if pattern.search(str):
return pattern.sub(substr, str)
return str
>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'
If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.
def substrings(str):
return (str[i:i+l] for l in range(len(str), 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.
def substrings(str):
return (str[i:i+l] for l in range(len(str)//2, 0, -1)
for i in range(len(str) - l + 1))
>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']
Now, a little tweak to the original function:
def remove_repeated(str):
for s in substrings(str):
pattern = re.compile(f"({s}){{2,}}")
if pattern.search(str):
return pattern.sub(s, str)
return str
And now:
>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'

How to remove alphabets and extract numbers using regex in python?

How to remove alphabets and extract numbers using regex in python?
import re
l=["098765432123 M","123456789012"]
s = re.findall(r"(?<!\d)\d{12}", l)
print(s)
Expected Output:
123456789012

If all you want is to have filtered list, consisting elements with pure digits, use filter with str.isdigit:
list(filter(str.isdigit, l))
Or as #tobias_k suggested, list comprehension is always your friend:
[s for s in l if s.isdigit()]
Output:
['123456789012']

I would suggest to use a negative lookahead assertion, if as stated you want to use regex only.
l=["098765432123 M","123456789012"]
res=[]
for a in l:
s = re.search(r"(?<!\d)\d{12}(?! [a-zA-Z])", a)
if s is not None:
res.append(s.group(0))
The result would then be:
['123456789012']

To keep only digits you can do re.findall('\d',s), but you'll get a list:
s = re.findall('\d', "098765432123 M")
print(s)
> ['0', '9', '8', '7', '6', '5', '4', '3', '2', '1', '2', '3']

So to be clear, you want to ignore the whole string if there is a alphabetic character in it? Or do you still want to extract the numbers of a string with both numbers and alphabetic characters in it?
If you want to find all numbers, and always find the longest number use this:
regex = r"\d+"
matches = re.finditer(regex, test_str, re.MULTILINE)
\d will search for digits, + will find one or more of the defined characters, and will always find the longest consecutive line of these characters.
If you only want to find strings without alphabets:
import re
regex = r"[a-zA-Z]"
test_str = ("098765432123 M", "123456789012")
for x in test_str:
if not re.search(regex, x):
print(x)

Find all strings in nested brackets

How do i find string in nested brackets
Lets say I have a string
uv(wh(x(yz))
and I want to find all string in brackets (so wh, x, yz)
import re
s="uuv(wh(x(yz))"
regex = r"(\(\w*?\))"
matches = re.findall(regex, s)
The above code only finds yz
Can I modify this regex to find all matches?

To get all properly parenthesized text:
import re
def get_all_in_parens(text):
in_parens = []
n = "has something to substitute"
while n:
text, n = re.subn(r'\(([^()]*)\)', # match flat expression in parens
lambda m: in_parens.append(m.group(1)) or '', text)
return in_parens
Example:
>>> get_all_in_parens("uuv(wh(x(yz))")
['yz', 'x']
Note: there is no 'wh' in the result due to the unbalanced paren.
If the parentheses are balanced; it returns all three nested substrings:
>>> get_all_in_parens("uuv(wh(x(yz)))")
['yz', 'x', 'wh']
>>> get_all_in_parens("a(b(c)de)")
['c', 'bde']

Would a string split work instead of a regex?
s='uv(wh(x(yz))'
match=[''.join(x for x in i if x.isalpha()) for i in s.split('(')]
>>>print(match)
['uv', 'wh', 'x', 'yz']
>>> match.pop(0)
You could pop off the first element because if it was contained in a parenthesis, the first position would be blank, which you wouldn't want and if it wasn't blank that means it wasn't in the parenthesis so again, you wouldn't want it.
Since that wasn't flexible enough something like this would work:
def match(string):
unrefined_match=re.findall('\((\w+)|(\w+)\)', string)
return [x for i in unrefined_match for x in i if x]
>>> match('uv(wh(x(yz))')
['wh', 'x', 'yz']
>>> match('a(b(c)de)')
['b', 'c', 'de']

Using regex a pattern such as this might potentially work:
\((\w{1,})
Result:
['wh', 'x', 'yz']
Your current pattern escapes the ( ) and doesn't treat them as a capture group.

Well if you know how to covert from PHP regex to Python , then you can use this
\(((?>[^()]+)|(?R))*\)

Regular expression using finder

I trying to figure out this expression:
p = re.compile ("[I need this]")
for m in p.finditer('foo, I need this, more foo'):
print m.start(), m.group()
I need to understand why I'm getting "e" in count 22
and re-write this correctly.

[] denotes a character class, that is, in your case, [I need this] would stand for: match a character that is one of: I, n, e, d, t, h, i, s, and, (maybe) a space. It is equivalent to [Inedthis ]. If you would like to match the whole phrase, omit the brackets. If you want to match the brackets, as well, escape them: \[I ... \].

By using [], you are searching for the character class [ Idehinst], that is the set of the characters ' ', 'I', 'd', 'e', 'h', 'i', 'n', 's', 't'.
Using (...) matches whatever regular expression is inside the parentheses, and indicates the start and end of a group.
If you want to search for the group: (I need this).
>>> import re
>>> p = re.compile ("(I need this)")
>>> for m in p.finditer('foo, I need this, more foo'):
... print m.start(), m.group()
...
5 I need this
For more information, see 7.2.1. Regular Expression Syntax in the official documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to split a string between all chars and int in Python - python

For example I want to convert "2pL11H10K" into [2, p, L, 11, H, 10, K]

Use regular expression. Example your_string = "2pL11H10K" items = re.findall(r'[A-Za-z]|\d+', your_string) print(items) then you got ['2', 'p', 'L', '11', 'H', '10', 'K']

Regular expressions are necessary and there have already been some quality answers given, but you will also need to convert the numbers from str() to int(). This can also be achieved using regular expressions, for example with [0-9]+ to identify one or more digits.

Related

Split a string after multiple delimiters and include it

Drop Duplicate Substrings from String with NO Spaces

How to remove alphabets and extract numbers using regex in python?

Find all strings in nested brackets

Regular expression using finder

Categories

Resources