python regex: string with maximum one whitespace - python

Hello I would like to know how to create a regex pattern with a sting which might contain maximum one white space. More specificly:
s = "a b d d c"
pattern = "(?P<a>.*) +(?P<b>.*) +(?P<c>.*)"
print(re.match(pattern, s).groupdict())
returns:
{'a': 'a b d d', 'b': '', 'c': 'c'}
I would like to have:
{'a': 'a', 'b': 'b d d', 'c': 'c'}

Another option could be to use zip and a dict and generate the characters based on the length of the matches.
You can get the matches which contain at max one whitespace using a repeating pattern matching a non whitespace char \S and repeat 0+ times a space followed by a non whitespace char:
\S(?: \S)*
Regex demo | Python demo
For example:
import re
a=97
regex = r"\S(?: \S)*"
test_str = "a b d d c"
matches = re.findall(regex, test_str)
chars = list(map(chr, range(a, a+len(matches))))
print(dict(zip(chars, matches)))
Result
{'a': 'a', 'b': 'b d d', 'c': 'c'}

With the help of The fourth birds answer I managed to do it in a way I imagened it to be:
import re
s = "a b d d c"
pattern = "(?P<a>\S(?: \S)*) +(?P<b>\S(?: \S)*) +(?P<c>\S(?: \S)*)"
print(re.match(pattern, s).groupdict())

Looks like you just want to split your string with 2 or more spaces. You can do it this way:
s = "a b d d c"
re.split(r' {2,}', s)
will return you:
['a', 'b d d', 'c']

It's probably easier to use re.split, since the delimiter is known (2 or more spaces), but the patterns in-between are not. I'm sure someone better at regex than myself can work out the look-aheads, but by splitting on \s{2,}, you can greatly simplify the problem.
You can make your dictionary of named groups like so:
import re
s = "a b d d c"
x = dict(zip('abc', re.split('\s{2,}', s)))
x
{'a': 'a', 'b': 'b d d', 'c': 'c'}
Where the first arg in zip is the named groups. To extend this to more general names:
groups = ['group_1', 'another group', 'third_group']
x = dict(zip(groups, re.split('\s{2,}', s)))
{'group_1': 'a', 'another group': 'b d d', 'third_group': 'c'}

I found an other solution I even like better:
import re
s = "a b dll d c"
pattern = "(?P<a>(\S*[\t]?)*) +(?P<b>(\S*[\t ]?)*) +(?P<c>(\S*[\t ]?)*)"
print(re.match(pattern, s).groupdict())
here it's even possible to have more than one letter.

Related

How to split a string which has blank to list?

I have next code:
can="p1=a b c p2=d e f g"
new = can.split()
print(new)
When I execute above, I got next:
['p1=a', 'b', 'c', 'p2=d', 'e', 'f', 'g']
But what I really need is:
['p1=a b c', 'p2=d e f g']
a b c is the value of p1, d e f g is the value of p2, how could I make my aim? Thank you!
If you want to have ['p1=a b c', 'p2=d e f g'], you can split using a regex:
import re
new = re.split(r'\s+(?=\w+=)', can)
If you want a dictionary {'p1': 'a b c', 'p2': 'd e f g'}, further split on =:
import re
new = dict(x.split('=', 1) for x in re.split(r'\s+(?=\w+=)', can))
regex demo
You can just match your desired results, looking for a variable name, then equals and characters until you get to either another variable name and equals, or the end-of-line:
import re
can="p1=a b c p2=d e f g"
re.findall(r'\w+=.*?(?=\s*\w+=|$)', can)
Output:
['p1=a b c', 'p2=d e f g']

How to switch text in a string?

I want to switch text but I always fail.
Let's say I want to switch,
I with We inx='I are We'
I tried
x=x.replace('I','We').replace('We','I')
but it is obvious that it will print I are I
Can someone help?
You can use a regex to avoid going through your string several times (Each replace go through the list) and to make it more readable ! It also works on several occurences words.
string = 'I are We, I'
import re
replacements = {'I': 'We', 'We': 'I'}
print(re.sub("I|We", lambda x: replacements[x.group()], string)) # Matching words you want to replace, and replace them using a dict
Output
"We are I, We"
You can use re.sub with function as a substitution:
In [9]: import re
In [10]: x = 'I are We'
In [11]: re.sub('I|We', lambda match: 'We' if match.group(0) == 'I' else 'I', x)
Out[11]: 'We are I'
If you need to replace more than 2 substrings you may want to create a dict like d = {'I': 'We', 'We': 'I', 'You': 'Not You'} and pick correct replacement like lambda match: d[match.group(0)]. You may also want to construct regular expression dynamically based on the replacement strings, but make sure to escape them:
In [14]: d = {'We': 'I', 'I': 'We', 'ar|e': 'am'}
In [15]: re.sub('|'.join(map(re.escape, d.keys())), lambda match: d[match.group(0)], 'We ar|e I')
Out[15]: 'I am We'
x='I are We'
x=x.replace('I','You').replace('We','I').replace('You','We')
>>> x
'We are I'
It is a bit clunky, but i tend to do something along the lines of
x='I are We'
x=x.replace('I','we')
x=x.replace('We','I')
x=x.replace('we','We')
Which can be shortened to
`x=x.replace('I','we').replace('We','I').replace('we','We')
This doesn't make use of replace, but I hope it helps:
s = "I are We"
d = {"I": "We", "We": "I"}
" ".join([d.get(x, x) for x in s.split()])
>>> 'We are I'
x='I are We'
dic = {'I':'We','We':'I'}
sol = []
for i in x.split():
if i in dic:
sol.append(dic[i])
else:
sol.append(i)
result = ' '.join(sol)
print(result)

How can I ignore a string in python regex group matching?

Say I have the following string
>>> mystr = 'A-ABd54-Bf657'
(a random string of dash-delimited character groups) and want to match the opening part, and the rest of the string, in separate groups. I can use
>>> re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+)', mystr)
This produces a groupdict() like this:
{'a': 'A', 'b': 'ABd54-Bf657'}
How can I get the same regex to match group b but separately match a specific suffix (or set of suffices) if it exists (they exist)? Ideally something like this
>>> myregex = <help me here>
>>> re.match(myregex, 'A-ABd54-Bf657').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-blah').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657-blah', 'test': None}
>>> re.match(myregex, 'A-ABd54-Bf657-test').groupdict()
{'a': 'A', 'b': 'ABd54-Bf657', 'test': 'test'}
Thanks.
mystr = 'A-ABd54-Bf657'
re.match('(?P<a>[a-zA-Z0-9]+)-(?P<b>[a-zA-Z0-9-]+?)(?:-(?P<test>test))?$', mystr)
^ ^
The first indicated ? makes the + quantifier non-greedy, so that it consumes the minimum possible.
The second indicated ? makes the group optional.
The $ is necessary or else the non-greediness plus optionality will match nothing.

How to split a string into characters in python

I have a string 'ABCDEFG'
I want to be able to list each character sequentially followed by the next one.
Example
A B
B C
C D
D E
E F
F G
G
Can you tell me an efficient way of doing this? Thanks
In Python, a string is already seen as an enumerable list of characters, so you don't need to split it; it's already "split". You just need to build your list of substrings.
It's not clear what form you want the result in. If you just want substrings, this works:
s = 'ABCDEFG'
[s[i:i+2] for i in range(len(s))]
#=> ['AB', 'BC', 'CD', 'DE', 'EF', 'FG', 'G']
If you want the pairs to themselves be lists instead of strings, just call list on each one:
[list([s[i:i+2]) for i in range(len(s))]
#=> [['A', 'B'], ['B', 'C'], ['C', 'D'], ['D', 'E'], ['E', 'F'], ['F', 'G'], ['G']]
And if you want strings after all, but with something like a space between the letters, join them back together after the list call:
[' '.join(list(s[i:i+2])) for i in range(len(s))]
#=> ['A B', 'B C', 'C D', 'D E', 'E F', 'F G', 'G']
You need to keep the last character, so use izip_longest from itertools
>>> import itertools
>>> s = 'ABCDEFG'
>>> for c, cnext in itertools.izip_longest(s, s[1:], fillvalue=''):
... print c, cnext
...
A B
B C
C D
D E
E F
F G
G
def doit(input):
for i in xrange(len(input)):
print input[i] + (input[i + 1] if i != len(input) - 1 else '')
doit("ABCDEFG")
Which yields:
>>> doit("ABCDEFG")
AB
BC
CD
DE
EF
FG
G
There's an itertools pairwise recipe for exactly this use case:
import itertools
def pairwise(myStr):
a,b = itertools.tee(myStr)
next(b,None)
for s1,s2 in zip(a,b):
print(s1,s2)
Output:
In [121]: pairwise('ABCDEFG')
A B
B C
C D
D E
E F
F G
Your problem is that you have a list of strings, not a string:
with open('ref.txt') as f:
f1 = f.read().splitlines()
f.read() returns a string. You call splitlines() on it, getting a list of strings (one per line). If your input is actually 'ABCDEFG', this will of course be a list of one string, ['ABCDEFG'].
l = list(f1)
Since f1 is already a list, this just makes l a duplicate copy of that list.
print l, f1, len(l)
And this just prints the list of lines, and the copy of the list of lines, and the number of lines.
So, first, what happens if you drop the splitlines()? Then f1 will be the string 'ABCDEFG', instead of a list with that one string. That's a good start. And you can drop the l part entirely, because f1 is already an iterable of its characters; list(f1) will just be a different iterable of the same characters.
So, now you want to print each letter with the next letter. One way to do that is by zipping 'ABCDEFG' and 'BCDEFG '. But how do you get that 'BCDEFG '? Simple; it's just f1[1:] + ' '.
So:
with open('ref.txt') as f:
f1 = f.read()
for left, right in zip(f1, f1[1:] + ' '):
print left, right
Of course for something this simple, there are many other ways to do the same thing. You can iterate over range(len(f1)) and get 2-element slices, or you can use itertools.zip_longest, or you can write a general-purpose "overlapping adjacent groups of size N from any iterable" function out of itertools.tee and zip, etc.
As you want space between the characters you can use zip function and list comprehension :
>>> s="ABCDEFG"
>>> l=[' '.join(i) for i in zip(s,s[1:])]
['A B', 'B C', 'C D', 'D E', 'E F', 'F G']
>>> for i in l:
... print i
...
A B
B C
C D
D E
E F
F G
if you dont want space just use list comprehension :
>>> [s[i:i+2] for i in range(len(s))]
['AB', 'BC', 'CD', 'DE', 'EF', 'FG', 'G']

Split string based on a regular expression

I have the output of a command in tabular form. I'm parsing this output from a result file and storing it in a string. Each element in one row is separated by one or more whitespace characters, thus I'm using regular expressions to match 1 or more spaces and split it. However, a space is being inserted between every element:
>>> str1="a b c d" # spaces are irregular
>>> str1
'a b c d'
>>> str2=re.split("( )+", str1)
>>> str2
['a', ' ', 'b', ' ', 'c', ' ', 'd'] # 1 space element between!!!
Is there a better way to do this?
After each split str2 is appended to a list.
By using (,), you are capturing the group, if you simply remove them you will not have this problem.
>>> str1 = "a b c d"
>>> re.split(" +", str1)
['a', 'b', 'c', 'd']
However there is no need for regex, str.split without any delimiter specified will split this by whitespace for you. This would be the best way in this case.
>>> str1.split()
['a', 'b', 'c', 'd']
If you really wanted regex you can use this ('\s' represents whitespace and it's clearer):
>>> re.split("\s+", str1)
['a', 'b', 'c', 'd']
or you can find all non-whitespace characters
>>> re.findall(r'\S+',str1)
['a', 'b', 'c', 'd']
The str.split method will automatically remove all white space between items:
>>> str1 = "a b c d"
>>> str1.split()
['a', 'b', 'c', 'd']
Docs are here: http://docs.python.org/library/stdtypes.html#str.split
When you use re.split and the split pattern contains capturing groups, the groups are retained in the output. If you don't want this, use a non-capturing group instead.
Its very simple actually. Try this:
str1="a b c d"
splitStr1 = str1.split()
print splitStr1

Categories

Resources