Split string on first occurrence of separator only - python

The POS tagger that I use processes the following string
3+2
as shown below.
3/num++/sign+2/num
I'd like to split this result as follows using python.
['3/num', '+/sign', '2/num']
How can I do that?

Use re.split -
>>> import re
>>> re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
['3/num', '+/sign', '2/num']
The regex pattern will split on a + sign as long as no other + precedes it.
(?<! # negative lookbehind
\+ # plus sign
)
\+ # plus sign
Note that lookbehinds (in general) do not support varying length patterns.

The tricky part I believe is the double + sign. You can replace the signs with special characters and get it done.
This should work,
st = '3/num++/sign+2/num'
st = st.replace('++', '#$')
st = st.replace('+', '#')
st = st.replace('$', '+')
print (st.split('#'))
One issue with this is that, your original string cannot contain those special characters # & $. So you will need to carefully choose them for your use case.
Edit: This answer is naive. The one with regex is better
That is, as pointed out by COLDSPEED, you should use the following regex approach with lookbehind,
import re
print re.split(r'(?<!\+)\+', '3/num++/sign+2/num')

Although the ask was to use regex, here is an example on how to do this with standard .split():
my_string = '3/num++/sign+2/num'
my_list = []
result = []
# enumerate over the split string
for e in my_string.split('/'):
if '+' in e:
if '++' in e:
#split element on double + and add in + as well
my_list.append(e.split('++')[0])
my_list.append('+')
else:
#split element on single +
my_list.extend(e.split('+'))
else:
#add element
my_list.append(e)
# at this point my_list contains
# ['3', 'num', '+', 'sign', '2', 'num']
# enumerate on the list, steps of 2
for i in range(0, len(my_list), 2):
#add result
result.append(my_list[i] + '/' + my_list[i+1])
print('result', result)
# result returns
# ['3/num', '+/sign', '2/num']

Related

how to extract words from the string in the list in python?

I have a string of type
string = "[A] Assam[B] Meghalaya[C] West Bengal[D] Odisha "
Output = ['Assam', 'Meghalaya','West Bengal','Odhisa']
I tried many ways, but I always end up splitting the substring West Bengal into two halves...
I am not able to cover the edge case mentioned above.
What I tried was pass the string into the below function and then split it.. But not working!!!!
def remove_alpha(string):
option = ['[A]', '[B]', '[C]', '[D]']
res = ""
for i in option:
res = string.replace(i, '')
string = res
return res
You can use regex for this:
import re
string = "[A] Assam[B] Meghalaya[C] West Bengal[D] Odisha "
pattern = re.compile(r"] (.*?)(?:\[|$)")
output = pattern.findall(string.strip())
print(output)
# ['Assam', 'Meghalaya', 'West Bengal', 'Odisha']
How it works: https://regex101.com/r/5peFyC/1
re module
You can split on regex patterns using re.split:
import re
string = "[A] Assam[B] Meghalaya[C] West Bengal[D] Odisha "
print(re.split(r"\s*\[\w\]\s*", string.strip())[1:])
Note that we first eliminate the spaces around the string by strip(), then we use r"\s*\[\w\]\s*" to match up options like [A] with possible spaces. Since the first element of the result is empty, we remove that by slicing [1:] at the end.
This can be done in a one-line list-comprehension plus a special case of the last option:
[string[string.find(option[i]):string.find(option[i+1])].split(option[i])[1].strip() for i in range(len(option) - 1)] + [string.split(option[-1])[1].strip()]
Broken down into a loop, with some explicit intermediate steps for readability:
res = []
for i in range(len(option) - 1):
from_ind = string.find(option[i])
to_ind = string.find(option[i+1])
sub_str = string[from_ind:to_ind]
clean_sub_str = sub_str.split(option[i])[1].strip()
res.append(clean_sub_str)
# Last option add-on
res.append(string.split(option[-1])[1].strip())
print(res)
# ['Assam', 'Meghalaya', 'West Bengal', 'Odisha']
This is not as pretty as using regex, but allows for more flexibility in defining the "options".
You can split your string using regular expression re.split() which is much more powerful compared to Python strings .split() adjusting the obtained
result using a list comprehension.
The provided solution does not require to modify the input string before splitting and works also in case the input string comes with overall spread whitespaces as demonstrated below:
import re
s = " [A] Assam[B] Meghalaya [C] West Bengal [D] Odisha "
print([ r.strip() for r in re.split("\[[A-Z]\]", s) if r.strip() ] )
# gives: ['Assam', 'Meghalaya', 'West Bengal', 'Odisha']
The regex pattern r'\[[A-Z]\]' splits on '[A]' up to '[Z]'
r.strip() removes any white spaces enclosing the results
if r.strip() removes empty strings and strings containing only spaces from the result of splitting
The backslash in '\[' and '\]' is necessary as square brackets have a special meaning when using regular expression pattern and must be escaped
[A-Z] represents any of uppercase ASCII letters from A up to Z

how to split string between different separators in python

I want to pick up a substring from <personne01166+30-90>, which the output should look like: +30 and -90.
The strings can be like: 'personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75', etc.
I tried use
<string.split('+')[len(string.split('+')) -1 ].split('+')[0]>
but the output must be two correspondent numbers.
Here is how you can use a list comprehension and re.findall:
import re
s = ['personne01144+0-30', 'personne01146+0+0', 'personne01180+60-75']
print([re.findall('[+-]\d+', i) for i in s])
Output:
[['+0', '-30'], ['+0', '+0'], ['+60', '-75']]
re.findall('[+-]\d+', i) finds all the patterns of '[+-]\d+' in the string i.
[+-] means any either + or -. \d+ means all numbers in a row.
If you know the interesting part always comes after + then you can simply split twice.
numbers = string.split('+', 1)[1]
if '+' in numbers:
this, that = numbers.split('+')
elif '-' in numbers:
this, that = numbers.split('-')
that = -that
else:
raise ValueError('Could not parse %s', string)
Perhaps a regex-based approach makes more sense, though;
import re
m = re.search(r'([-+]\d+)([-+]\d+)$', string)
if m:
this, that = m.groups()

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q
Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']
Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
Voilà

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

Python re.findall getting value

I have a text file which have multi lines in the same following pattern
Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx
I built this code to get the value after Pre:
x2 = (re.findall(r'Pre:(\d+)',s))
I'm not so familiar with re patterns , but this code don't get the value if it is + or empty value ( a None value )
Any suggestions to generlize the code to get what ever value after Pre: until the next # without the space ?
How about this as the pattern? It will get everything until the next " #" but without being greedy (that's what the ? is for).
r"Pre:(.*?) #"
The example you've provided works just fine:
>>> import re
>>> s = 'Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx'
>>> re.findall(r'Pre:(\d+)', s)
['00']
You may need to add handling of +/- and ., for negative numbers and decimals: (-?[\d.,]+).
If you need to match any string (not just numbers) you may want to use Pre:(.*?)\s*#.
Or you may avoid using regexps at all and split row by # separator:
>>> s.split('#')
['Server:x.x.x ', ' U:100 ', ' P:100 ', ' Pre:00 ', ' Tel:xxxxxx']
And then split rows by first ::
>>> for row in s.split('#'):
... k, v = row.split(':', 1)
... print(k.strip(), '=', v.strip())
...
Server = x.x.x
U = 100
P = 100
Pre = 00
Tel = xxxxxx
A non-regex approach would involve splitting by # and then by : forming a dictionary which would make accessing the parts of the string easy and readable:
>>> s = "Server:x.x.x # U:100 # P:100 # Pre:00 # Tel:xxxxxx"
>>> d = dict([key.split(":") for key in s.split(" # ")])
>>> d["Pre"]
'00'
x2 = (re.findall(r'Pre:(.*?) #',s))
Pre:(.*?) #
Match the character string “Pre:” literally «Pre:» Match the regex
below and capture its match into backreference number 1 «(.?)»
Match any single character that is NOT a line break character «.?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “ #”
literally « #»

Categories

Resources