Split string based on a regular expression - python

I have the output of a command in tabular form. I'm parsing this output from a result file and storing it in a string. Each element in one row is separated by one or more whitespace characters, thus I'm using regular expressions to match 1 or more spaces and split it. However, a space is being inserted between every element:
>>> str1="a b c d" # spaces are irregular
>>> str1
'a b c d'
>>> str2=re.split("( )+", str1)
>>> str2
['a', ' ', 'b', ' ', 'c', ' ', 'd'] # 1 space element between!!!
Is there a better way to do this?
After each split str2 is appended to a list.

By using (,), you are capturing the group, if you simply remove them you will not have this problem.
>>> str1 = "a b c d"
>>> re.split(" +", str1)
['a', 'b', 'c', 'd']
However there is no need for regex, str.split without any delimiter specified will split this by whitespace for you. This would be the best way in this case.
>>> str1.split()
['a', 'b', 'c', 'd']
If you really wanted regex you can use this ('\s' represents whitespace and it's clearer):
>>> re.split("\s+", str1)
['a', 'b', 'c', 'd']
or you can find all non-whitespace characters
>>> re.findall(r'\S+',str1)
['a', 'b', 'c', 'd']

The str.split method will automatically remove all white space between items:
>>> str1 = "a b c d"
>>> str1.split()
['a', 'b', 'c', 'd']
Docs are here: http://docs.python.org/library/stdtypes.html#str.split

When you use re.split and the split pattern contains capturing groups, the groups are retained in the output. If you don't want this, use a non-capturing group instead.

Its very simple actually. Try this:
str1="a b c d"
splitStr1 = str1.split()
print splitStr1

Related

Python: remove initial number and underscore from strings in a line (but keep other underscores)

I have a split line (split using .split()) that looks like this:
['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
Each string has a variable number of underscores and different combinations of letters/numbers after the first underscore. For any string with a number followed by an underscore, I want to drop the initial number/underscore to get this result:
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']
This is similar to this question, but I have multiple underscores for some strings in the split line.
You can use re.sub:
import re
d = ['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
new_d = [re.sub('^\d+_', '', i) for i in d]
Output:
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']
>>> l=['a', 'b', 'c', '1_a23_4', '2_b234', '300_235_2_2', '1000_1_1_1_1']
>>> l=["_".join(i.split("_")[1:]) if "_" in i else i for i in l]
>>> l
['a', 'b', 'c', 'a23_4', 'b234', '235_2_2', '1_1_1_1']

How to remove certain characters from lists (Python 2.7)?

I've got a list where each element is:
['a ',' b ',' c ',' d\n ']
I want to manipulate it so that each element just becomes:
['a','b','c','d']
I don't think the spaces matter, but for some reason I can't seem to remove the \n from the end of the 4th element. I've tried converting to string and removing it using:
str.split('\n')
No error is returned, but it doesn't do anything to the list, it still has the \n at the end.
I've also tried:
d.replace('\n','')
But this just returns an error.
This is clearly a simple problem but I'm a complete beginner to Python so any help would be appreciated, thank you.
Edit:
It seems I have a list of arrays (I think) so am I right in thinking that list[0], list[1] etc are their own arrays? Does that mean I can use a for loop for i in list to strip \n from each one?
>>> my_array = ['a ',' b ',' c ',' d\n ']
>>> my_array = [c.strip() for c in my_array]
>>> my_array
['a', 'b', 'c', 'd']
If you have a list of arrays then you can do something in the lines of:
>>> list_of_arrays = [['a', 'b', 'c', 'd'], ['a ', ' b ', ' c ', ' d\n ']]
>>> new_list = [[c.strip() for c in array] for array in list_of_arrays]
>>> new_list
[['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']]
Try this -
arr = ['a ',' b ',' c ',' d\n ']
arr = [s.strip() for s in arr]
A very simple answer is join your list, strip the nextline charcter and split to get a new list:
Newlist = ''.join(myList).strip().split()
Your Newlist is now:
['a', 'b', 'c', 'd']

How can I avoid those empty strings caused by preceding or trailing whitespaces?

>>> import re
>>> re.split(r'[ "]+', ' a n" "c ')
['', 'a', 'n', 'c', '']
When there is preceding or trailing whitespace, there will be empty strings after splitting.
How can I avoid those empty strings? Thanks.
The empty values are the things between the splits. re.split() is not the right tool for the job.
I recommend matching what you want instead.
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
If you must use split, you could use a list comprehension and filter it directly.
>>> [x for x in re.split(r'[ "]+', ' a n" "c ') if x != '']
['a', 'n', 'c']
That's what re.split is supposed to do. You're asking it to split the string on any runs of whitespace or quotes; if it didn't return an empty string at the start, you wouldn't be able to distinguish that case from the case with no preceding whitespace.
If what you're actually asking for is to find all runs of non-whitespace-or-quote characters, just write that:
>>> re.findall(r'[^ "]+', ' a n" "c ')
['a', 'n', 'c']
I like abarnert solution.
However, you can also do (maybe not a pythonic way):
myString.strip()
Before your split (or etc).

find all characters NOT in regex pattern

Let's say I have a regex of legal characters
legals = re.compile("[abc]")
I can return a list of legal characters in a string like this:
finder = re.finditer(legals, "abcdefg")
[match.group() for match in finder]
>>>['a', 'b', 'c']
How can I use regex to find a list of the characters NOT in the regex? IE in my case it would return
['d','e','f','g']
Edit: To clarify, I'm hoping to find a way to do this without modifying the regex itself.
Negate the character class:
>>> illegals = re.compile("[^abc]")
>>> finder = re.finditer(illegals, "abcdefg")
>>> [match.group() for match in finder]
['d', 'e', 'f', 'g']
If you can't do that (and you're only dealing with one-character length matches), you could
>>> legals = re.compile("[abc]")
>>> remains = legals.sub("", "abcdefg")
>>> [char for char in remains]
['d', 'e', 'f', 'g']

Regular expression group capture with multiple matches

Quick regular expression question.
I'm trying to capture multiple instances of a capture group in python (don't think it's python specific), but the subsequent captures seems to overwrite the previous.
In this over-simplified example, I'm essentially trying to split a string:
x = 'abcdef'
r = re.compile('(\w){6}')
m = r.match(x)
m.groups() # = ('f',) ?!?
I want to get ('a', 'b', 'c', 'd', 'e', 'f'), but because regex overwrites subsequent captures, I get ('f',)
Is this how regex is supposed to behave? Is there a way to do what I want without having to repeat the syntax six times?
Thanks in advance!
Andrew
You can't use groups for this, I'm afraid. Each group can match only once, I believe all regexes work this way. A possible solution is to try to use findall() or similar.
r=re.compile(r'\w')
r.findall(x)
# 'a', 'b', 'c', 'd', 'e', 'f'
The regex module can do this.
> m = regex.match('(\w){6}', "abcdef")
> m.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']
Also works with named captures:
> m = regex.match('(?P<letter>)\w)', "abcdef")
> m.capturesdict()
{'letter': ['a', 'b', 'c', 'd', 'e', 'f']}
The regex module is expected to replace the 're' module - it is a drop-in replacement that acts identically, except it has many more features and capabilities.
To find all matches in a given string use re.findall(regex, string). Also, if you want to obtain every letter here, your regex should be either '(\w){1}' or just '(\w)'.
See:
r = re.compile('(\w)')
l = re.findall(r, x)
l == ['a', 'b', 'c', 'd', 'e', 'f']
I suppose your question is a simplified presentation of your need.
Then, I take an exemple a little more complex:
import re
pat = re.compile('[UI][bd][ae]')
ch = 'UbaUdeIbaIbeIdaIdeUdeUdaUdeUbeIda'
print [mat.group() for mat in pat.finditer(ch)]
result
['Uba', 'Ude', 'Iba', 'Ibe', 'Ida', 'Ide', 'Ude', 'Uda', 'Ude', 'Ube', 'Ida']

Categories

Resources