I have the following unwieldy code to extract out 'ABC' and '(XYZ)' from a string 'ABC(XYZ)'
import re
test_str = 'ABC(XYZ)'
partone = re.sub(r'\([^)]*\)', '', test_str)
parttwo_temp = re.match('.*\((.+)\)', test_str)
parttwo = '(' + parttwo_temp.group(1) + ')'
I was wondering if someone can think of a better regular expression to split up the string. Thanks.
You may use re.findall
>>> import re
>>> test_str = 'ABC(XYZ)'
>>> re.findall(r'\([^()]*\)|[^()]+', test_str)
['ABC', '(XYZ)']
>>> [i for i in re.findall(r'(.*)(\([^()]*\))', test_str)[0]]
['ABC', '(XYZ)']
[i for i in re.split(r'(.*?)(\(.*?\))', test_str) if i]
For this kind of input data, we can replace the ( with space+( and split by space:
>>> s = 'ABC(XYZ)'
>>> s.replace("(", " (").split()
['ABC', '(XYZ)']
This way we are artificially creating a delimiter before every opening parenthesis.
Related
I'm using python2 and I want to get rid of these empty strings in the output of the following python regular expression:
import re
x = "010101000110100001100001"
print re.split("([0-1]{8})", x)
and the output is this :
['', '01010100', '', '01101000', '', '01100001', '']
I just want to get this output:
['01010100', '01101000', '01100001']
Regex probably isn't what you want to use in this case. It seems that you want to just split the string into groups of n (8) characters.
I poached an answer from this question.
def split_every(n, s):
return [ s[i:i+n] for i in xrange(0, len(s), n) ]
split_every(8, "010101000110100001100001")
Out[2]: ['01010100', '01101000', '01100001']
One possible way:
print filter(None, re.split("([0-1]{8})", x))
import re
x = "010101000110100001100001"
l = re.split("([0-1]{8})", x)
l2 = [i for i in l if i]
out:
['01010100', '01101000', '01100001']
This is exactly what is split for. It is split string using regular expression as separator.
If you need to find all matches try use findall instead:
import re
x = "010101000110100001100001"
print(re.findall("([0-1]{8})", x))
print([a for a in re.split("([0-1]{8})", x) if a != ''])
Following your regex approach, you can simply use a filter to get your desired output.
import re
x = "010101000110100001100001"
unfiltered_list = re.split("([0-1]{8})", x)
print filter(None, unfiltered_list)
If you run this, you should get:
['01010100', '01101000', '01100001']
I have 2 related questions/ issues.
def remove_delimiters (delimiters, s):
for d in delimiters:
ind = s.find(d)
while ind != -1:
s = s[:ind] + s[ind+1:]
ind = s.find(d)
return ' '.join(s.split())
delimiters = [",", ".", "!", "?", "/", "&", "-", ":", ";", "#", "'", "..."]
d_dataset_list = ['hey-you...are you ok?']
d_list = []
for d in d_dataset_list:
d_list.append(remove_delimiters(delimiters, d[1]))
print d_list
Output = 'heyyouare you ok'
What is the best way of avoiding strings being combined together when a delimiter is removed? For example, so that the output is hey you are you ok ?
There may be a number of different sequences of ..., for example .. or .......... etc. How does one go around implementing some form of rule, where if more than one . appear after each other, to remove it? I want to try and avoid hard-coding all sequences in my delimiters list. Thankyou
You could try something like this:
Given delimiters d, join them to a regular expression
>>> d = ",.!?/&-:;#'..."
>>> "["+"\\".join(d)+"]"
"[,\\.\\!\\?\\/\\&\\-\\:\\;\\#\\'\\.\\.\\.]"
Split the string using this regex with re.split
>>> s = 'hey-you...are you ok?'
>>> re.split("["+"\\".join(d)+"]", s)
['hey', 'you', '', '', 'are you ok', '']
Join all the non-empty fragments back together
>>> ' '.join(w for w in re.split("["+"\\".join(d)+"]", s) if w)
'hey you are you ok'
Also, if you just want to remove all non-word characters, you can just use the character group \W instead of manually enumerating all the delimiters:
>>> ' '.join(w for w in re.split(r"\W", s) if w)
'hey you are you ok'
So first of all, your function for removing delimiters could be simplified greatly by using the replace function (http://www.tutorialspoint.com/python/string_replace.htm)
This would help solve your first question. Instead of just removing them, replace with a space, then get rid of the spaces using the pattern you already used (split() treats consecutive delimiters as one)
A better function, which does this, would be:
def remove_delimiters (delimiters, s):
new_s = s
for i in delimiters: #replace each delimiter in turn with a space
new_s = new_s.replace(i, ' ')
return ' '.join(new_s.split())
to answer your second question, I'd say it's time for regular expressions
>>> import re
... ss = 'hey ... you are ....... what?'
... print re.sub('[.+]',' ',ss)
hey you are what?
>>>
I have a string like;
'[abc] [def] [zzz]'
How would I be able to split it into three parts:
abc
def
zzz
You can use re.findall:
>>> from re import findall
>>> findall('\[([^\]]*)\]', '[abc] [def] [zzz]')
['abc', 'def', 'zzz']
>>>
All of the Regex syntax used above is explained in the link, but here is a quick breakdown:
\[ # [
( # The start of a capture group
[^\]]* # Zero or more characters that are not ]
) # The end of the capture group
\] # ]
For those who want a non-Regex solution, you could always use a list comprehension and str.split:
>>> [x[1:-1] for x in '[abc] [def] [zzz]'.split()]
['abc', 'def', 'zzz']
>>>
[1:-1] strips off the square brackets on each end of x.
Another way:
s = '[abc] [def] [zzz]'
s = [i.strip('[]') for i in s.split()]
Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?
I have the following string where I need to extract only the first digits from it.
string = '50.2000\xc2\xb0 E'
How do I extract 50.2000 from string?
If the number can be followed by any kind of character, try using a regex:
>>> import re
>>> r = re.compile(r'(\d+\.\d+)')
>>> r.match('50.2000\xc2\xb0 E').group(1)
'50.2000'
mystring = '50.2000\xc2\xb0 E'
print mystring.split("\xc2", 1)[0]
Output
50.2000
If you just wanted to split the first digits, just slice the string:
start = 10 #start at the 10th digit
print mystring[start:]
Demo:
>>> my_string = 'abcasdkljf23u109842398470ujw{}{\\][\\['
>>> start = 10
>>> print(my_string[start:])
23u109842398470ujw{}{\][\[
You can, split the string at the first \:
>>> s = r'50.2000\xc2\xb0 E'
>>> s.split('\\', 1)
['50.2000', 'xc2\\xb0 E']
You could solve this using a regular expression:
In [1]: import re
In [2]: string = '50.2000\xc2\xb0 E'
In [3]: m = re.match('^([0-9]+\.?[0-9]*)', string)
In [4]: m.group(0)
Out[4]: '50.2000'