Python regular expression to extract the parenthesis

Python regular expression to extract the parenthesis - python

I have the following unwieldy code to extract out 'ABC' and '(XYZ)' from a string 'ABC(XYZ)'
import re
test_str = 'ABC(XYZ)'
partone = re.sub(r'\([^)]*\)', '', test_str)
parttwo_temp = re.match('.*\((.+)\)', test_str)
parttwo = '(' + parttwo_temp.group(1) + ')'
I was wondering if someone can think of a better regular expression to split up the string. Thanks.

You may use re.findall
>>> import re
>>> test_str = 'ABC(XYZ)'
>>> re.findall(r'\([^()]*\)|[^()]+', test_str)
['ABC', '(XYZ)']
>>> [i for i in re.findall(r'(.*)(\([^()]*\))', test_str)[0]]
['ABC', '(XYZ)']

[i for i in re.split(r'(.*?)(\(.*?\))', test_str) if i]

For this kind of input data, we can replace the ( with space+( and split by space:
>>> s = 'ABC(XYZ)'
>>> s.replace("(", " (").split()
['ABC', '(XYZ)']
This way we are artificially creating a delimiter before every opening parenthesis.

Related

python regular expression split function issue

I'm using python2 and I want to get rid of these empty strings in the output of the following python regular expression:
import re
x = "010101000110100001100001"
print re.split("([0-1]{8})", x)
and the output is this :
['', '01010100', '', '01101000', '', '01100001', '']
I just want to get this output:
['01010100', '01101000', '01100001']

Regex probably isn't what you want to use in this case. It seems that you want to just split the string into groups of n (8) characters.
I poached an answer from this question.
def split_every(n, s):
return [ s[i:i+n] for i in xrange(0, len(s), n) ]
split_every(8, "010101000110100001100001")
Out[2]: ['01010100', '01101000', '01100001']

One possible way:
print filter(None, re.split("([0-1]{8})", x))

import re
x = "010101000110100001100001"
l = re.split("([0-1]{8})", x)
l2 = [i for i in l if i]
out:
['01010100', '01101000', '01100001']

This is exactly what is split for. It is split string using regular expression as separator.
If you need to find all matches try use findall instead:
import re
x = "010101000110100001100001"
print(re.findall("([0-1]{8})", x))

print([a for a in re.split("([0-1]{8})", x) if a != ''])

Following your regex approach, you can simply use a filter to get your desired output.
import re
x = "010101000110100001100001"
unfiltered_list = re.split("([0-1]{8})", x)
print filter(None, unfiltered_list)
If you run this, you should get:
['01010100', '01101000', '01100001']

Python removing delimiters from strings

I have 2 related questions/ issues.
def remove_delimiters (delimiters, s):
for d in delimiters:
ind = s.find(d)
while ind != -1:
s = s[:ind] + s[ind+1:]
ind = s.find(d)
return ' '.join(s.split())
delimiters = [",", ".", "!", "?", "/", "&", "-", ":", ";", "#", "'", "..."]
d_dataset_list = ['hey-you...are you ok?']
d_list = []
for d in d_dataset_list:
d_list.append(remove_delimiters(delimiters, d[1]))
print d_list
Output = 'heyyouare you ok'
What is the best way of avoiding strings being combined together when a delimiter is removed? For example, so that the output is hey you are you ok ?
There may be a number of different sequences of ..., for example .. or .......... etc. How does one go around implementing some form of rule, where if more than one . appear after each other, to remove it? I want to try and avoid hard-coding all sequences in my delimiters list. Thankyou

You could try something like this:
Given delimiters d, join them to a regular expression
>>> d = ",.!?/&-:;#'..."
>>> "["+"\\".join(d)+"]"
"[,\\.\\!\\?\\/\\&\\-\\:\\;\\#\\'\\.\\.\\.]"
Split the string using this regex with re.split
>>> s = 'hey-you...are you ok?'
>>> re.split("["+"\\".join(d)+"]", s)
['hey', 'you', '', '', 'are you ok', '']
Join all the non-empty fragments back together
>>> ' '.join(w for w in re.split("["+"\\".join(d)+"]", s) if w)
'hey you are you ok'
Also, if you just want to remove all non-word characters, you can just use the character group \W instead of manually enumerating all the delimiters:
>>> ' '.join(w for w in re.split(r"\W", s) if w)
'hey you are you ok'

So first of all, your function for removing delimiters could be simplified greatly by using the replace function (http://www.tutorialspoint.com/python/string_replace.htm)
This would help solve your first question. Instead of just removing them, replace with a space, then get rid of the spaces using the pattern you already used (split() treats consecutive delimiters as one)
A better function, which does this, would be:
def remove_delimiters (delimiters, s):
new_s = s
for i in delimiters: #replace each delimiter in turn with a space
new_s = new_s.replace(i, ' ')
return ' '.join(new_s.split())
to answer your second question, I'd say it's time for regular expressions
>>> import re
... ss = 'hey ... you are ....... what?'
... print re.sub('[.+]',' ',ss)
hey you are what?
>>>

String split on specific characters

I have a string like;
'[abc] [def] [zzz]'
How would I be able to split it into three parts:
abc
def
zzz

You can use re.findall:
>>> from re import findall
>>> findall('\[([^\]]*)\]', '[abc] [def] [zzz]')
['abc', 'def', 'zzz']
>>>
All of the Regex syntax used above is explained in the link, but here is a quick breakdown:
\[ # [
( # The start of a capture group
[^\]]* # Zero or more characters that are not ]
) # The end of the capture group
\] # ]
For those who want a non-Regex solution, you could always use a list comprehension and str.split:
>>> [x[1:-1] for x in '[abc] [def] [zzz]'.split()]
['abc', 'def', 'zzz']
>>>
[1:-1] strips off the square brackets on each end of x.

Another way:
s = '[abc] [def] [zzz]'
s = [i.strip('[]') for i in s.split()]

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

How do I split the following string?

I have the following string where I need to extract only the first digits from it.
string = '50.2000\xc2\xb0 E'
How do I extract 50.2000 from string?

If the number can be followed by any kind of character, try using a regex:
>>> import re
>>> r = re.compile(r'(\d+\.\d+)')
>>> r.match('50.2000\xc2\xb0 E').group(1)
'50.2000'

mystring = '50.2000\xc2\xb0 E'
print mystring.split("\xc2", 1)[0]
Output
50.2000

If you just wanted to split the first digits, just slice the string:
start = 10 #start at the 10th digit
print mystring[start:]
Demo:
>>> my_string = 'abcasdkljf23u109842398470ujw{}{\\][\\['
>>> start = 10
>>> print(my_string[start:])
23u109842398470ujw{}{\][\[
You can, split the string at the first \:
>>> s = r'50.2000\xc2\xb0 E'
>>> s.split('\\', 1)
['50.2000', 'xc2\\xb0 E']

You could solve this using a regular expression:
In [1]: import re
In [2]: string = '50.2000\xc2\xb0 E'
In [3]: m = re.match('^([0-9]+\.?[0-9]*)', string)
In [4]: m.group(0)
Out[4]: '50.2000'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression to extract the parenthesis - python

You may use re.findall >>> import re >>> test_str = 'ABC(XYZ)' >>> re.findall(r'\([^()]\)|[^()]+', test_str) ['ABC', '(XYZ)'] >>> [i for i in re.findall(r'(.)(\([^()]*\))', test_str)[0]] ['ABC', '(XYZ)']

[i for i in re.split(r'(.?)(\(.?\))', test_str) if i]

For this kind of input data, we can replace the ( with space+( and split by space: >>> s = 'ABC(XYZ)' >>> s.replace("(", " (").split() ['ABC', '(XYZ)'] This way we are artificially creating a delimiter before every opening parenthesis.

Related

python regular expression split function issue

Python removing delimiters from strings

String split on specific characters

python regular expression, pulling all letters out

How do I split the following string?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regular expression to extract the parenthesis - python

You may use re.findall >>> import re >>> test_str = 'ABC(XYZ)' >>> re.findall(r'\([^()]*\)|[^()]+', test_str) ['ABC', '(XYZ)'] >>> [i for i in re.findall(r'(.*)(\([^()]*\))', test_str)[0]] ['ABC', '(XYZ)']

[i for i in re.split(r'(.*?)(\(.*?\))', test_str) if i]

For this kind of input data, we can replace the ( with space+( and split by space: >>> s = 'ABC(XYZ)' >>> s.replace("(", " (").split() ['ABC', '(XYZ)'] This way we are artificially creating a delimiter before every opening parenthesis.

Related

python regular expression split function issue

Python removing delimiters from strings

String split on specific characters

python regular expression, pulling all letters out

How do I split the following string?

Categories

Resources

You may use re.findall >>> import re >>> test_str = 'ABC(XYZ)' >>> re.findall(r'\([^()]\)|[^()]+', test_str) ['ABC', '(XYZ)'] >>> [i for i in re.findall(r'(.)(\([^()]*\))', test_str)[0]] ['ABC', '(XYZ)']

[i for i in re.split(r'(.?)(\(.?\))', test_str) if i]