Difference between re.search() and re.findall() - python

The following code is very strange:
>>> words = "4324324 blahblah"
>>> print re.findall(r'(\s)\w+', words)
[' ']
>>> print re.search(r'(\s)\w+', words).group()
blahblah
The () operator seems to behave poorly with findall. Why is this? I need it for a csv file.
Edit for clarity: I want to display blahblah using findall.
I discovered that re.findall(r'\s(\w+)', words) does what I want, but have no idea why findall treats groups in this way.

One character off:
>>> print re.search(r'(\s)\w+', words).groups()
(' ',)
>>> print re.search(r'(\s)\w+', words).group(1)
' '
findall returns a list of all groups captured. You're getting a space back because that's what you capture. Stop capturing, and it works fine:
>>> print re.findall(r'\s\w+', words)
[' blahblah']
Use the csv module

If you prefer to keep the capturing groups in your regex, but you still want to find the entire contents of each match instead of the groups, you can use the following:
[m.group() for m in re.finditer(r'(\s)\w+', words)]
For example:
>>> [m.group() for m in re.finditer(r'(\s)\w+', '4324324 blahblah')]
[' blahblah']

Related

How to replace an re match with a transformation of that match?

For example, I have a string:
The struct-of-application and struct-of-world
With re.sub, it will replace the matched with a predefined string. How can I replace the match with a transformation of the matched content? To get, for example:
The [application_of_struct](http://application_of_struct) and [world-of-struct](http://world-of-struct)
If I write a simple regex ((\w+-)+\w+) and try to use re.sub, it seems I can't use what I matched as part of the replacement, let alone edit the matched content:
In [10]: p.sub('struct','The struct-of-application and struct-of-world')
Out[10]: 'The struct and struct'
Use a function for the replacement
s = 'The struct-of-application and struct-of-world'
p = re.compile('((\w+-)+\w+)')
def replace(match):
return 'http://{}'.format(match.group())
#for python 3.6+ ...
#return f'http://{match.group()}'
>>> p.sub(replace, s)
'The http://struct-of-application and http://struct-of-world'
>>>
Try this:
>>> p = re.compile(r"((\w+-)+\w+)")
>>> p.sub('[\\1](http://\\1)','The struct-of-application and struct-of-world')
'The [struct-of-application](http://struct-of-application) and [struct-of-world](http://struct-of-world)'

Split string multiple times in Python

This should be a simple thing to do but I can't get it to work.
Say I have this string.
I want this string to be splitted into smaller strings.
And, well, I want to split it into smaller strings, but only take what is between a T and a S.
So, the result should yield
this, to be s, to s, trings
So far I've tried splitting on every S, and then up to every T (backwards). However, it will only get the first "this", and stop. How can I make it continue and get all the things that are between T's and S's?
(In this program I export the results to another text file)
matches = open('string.txt', 'r')
with open ('test.txt', 'a') as file:
for line in matches:
test = line.split("S")
file.write(test[0].split("T")[-1] + "\n")
matches.close()
Maybe using Regular Expressions would be better, though I don't know how to work with them too well?
You want a re.findall() call instead:
re.findall(r't[^s]*s', line, flags=re.I)
Demo:
>>> import re
>>> sample = 'I want this string to be splitted into smaller strings.'
>>> re.findall(r't[^s]*s', sample, flags=re.I)
['t this', 'tring to be s', 'tted into s', 'trings']
Note that this matches 't this' and 'tted into s'; your rules need clarification as to why those first t characters shouldn't match when 'trings' does.
It sounds as if you want to match only text between t and s without including any other t:
>>> re.findall(r't[^ts]*s', sample, flags=re.I)
['this', 'to be s', 'to s', 'trings']
where tring in the second result and tted in the 3rd are not included because there is a later t in those results.

splitting merged words in python

I am working with a text where all "\n"s have been deleted (which merges two words into one, like "I like bananasAnd this is a new line.And another one.") What I would like to do now is tell Python to look for combinations of a small letter followed by capital letter/punctuation followed by capital letter and insert a whitespace.
I thought this would be easy with reg. expressions, but it is not - I couldnt find an "insert" function or anything, and the string commands seem not to be helpful either. How do I do this?
Any help would be greatly appreciated, I am despairing over here...
Thanks, patrick
Try the following:
re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", your_string)
For example:
import re
lines = "I like bananasAnd this is a new line.And another one."
print re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", lines)
# I like bananas And this is a new line. And another one.
If you want to insert a newline instead of a space, change the replacement to r"\1\n\2".
Using re.sub you should be able to make a pattern that grabs a lowercase and uppercase letter and substitutes them for the same two letters, but with a space in between:
import re
re.sub(r'([a-z][.?]?)([A-Z])', '\\1\n\\2', mystring)
You're looking for the sub function. See http://docs.python.org/library/re.html for documentation.
Hmm, interesting. You can use regular expressions to replace text with the sub() function:
>>> import re
>>> string = 'fooBar'
>>> re.sub(r'([a-z][.!?]*)([A-Z])', r'\1 \2', string)
'foo Bar'
If you really don't have any caps except at the beginning of a sentence, it will probably be easiest to just loop through the string.
>>> import string
>>> s = "a word endsA new sentence"
>>> lastend = 0
>>> sentences = list()
>>> for i in range(0, len(s)):
... if s[i] in string.uppercase:
... sentences.append(s[lastend:i])
... lastend = i
>>> sentences.append(s[lastend:])
>>> print sentences
['a word ends', 'A new sentence']
Here's another approach, which avoids regular expressions and does not use any imported libraries, just built-ins...
s = "I like bananasAnd this is a new line.And another one."
with_whitespace = ''
last_was_upper = True
for c in s:
if c.isupper():
if not last_was_upper:
with_whitespace += ' '
last_was_upper = True
else:
last_was_upper = False
with_whitespace += c
print with_whitespace
Yields:
I like bananas And this is a new line. And another one.

Replace ",**" with a linebreak using RegEx (or something else)

I'm getting started with RegEx and I was wondering if anyone could help me craft a statement to convert coordinates as follows:
145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
to
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
(Strip off the last comma and value and turn it into a line break.)
I can't figure out how to use wildcards to do something like that. Any help would be greatly appreciated! Thanks.
"Some people, when confronted with a
problem, think 'I know, I'll use
regular expressions.' Now they have
two problems." --Jamie Zawinski
Avoid that problem and use string methods:
s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,37.80301,16"
lines = s.split(' ') # each line is separated by ' '
for line in lines:
a,b,c=line.split(',') # three parts, separated by ','
print a,b
Regex have their uses, but this is not one of them.
>>> import re
>>> s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16"
>>> print re.sub(",\d*\w","\n",s)
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
String methods seem to suffice here, regex are overkill:
>>> s='145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16'
>>> print('\n'.join(line.rpartition(',')[0] for line in s.split()))
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
>>> s = '145.00694,37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
>>> patt = '(%s,%s),%s' % (('[+-]?\d+\.?\d*', )*3)
>>> m = re.findall(patt, s)
>>> m
['145.00694,37.80421', '145.00686,-37.80382', '145.00595,-37.8035', '145.00586,-37.80301']
>>> print '\n'.join(m)
145.00694,37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
but I prefer not use regular expressions in this case
I like SilentGhost solution

Create a Reg Exp to search for __word__?

In a program I'm making in python and I want all words formatted like __word__ to stand out. How could I search for words like these using a regex?
Perhaps something like
\b__(\S+)__\b
>>> import re
>>> re.findall(r"\b__(\S+)__\b","Here __is__ a __test__ sentence")
['is', 'test']
>>> re.findall(r"\b__(\S+)__\b","__Here__ is a test __sentence__")
['Here', 'sentence']
>>> re.findall(r"\b__(\S+)__\b","__Here's__ a test __sentence__")
["Here's", 'sentence']
or you can put tags around the word like this
>>> print re.sub(r"\b(__)(\S+)(__)\b",r"<b>\2<\\b>","__Here__ is a test __sentence__")
<b>Here<\b> is a test <b>sentence<\b>
If you need more fine grained control over the legal word characters it's best to be explicit
\b__([a-zA-Z0-9_':])__\b ### count "'" and ":" as part of words
>>> re.findall(r"\b__([a-zA-Z0-9_']+)__\b","__Here's__ a test __sentence:__")
["Here's"]
>>> re.findall(r"\b__([a-zA-Z0-9_':]+)__\b","__Here's__ a test __sentence:__")
["Here's", 'sentence:']
Take a squizz here: http://docs.python.org/library/re.html
That should show you syntax and examples from which you can build a check for word(s) pre- and post-pended with 2 underscores.
The simplest regex for this would be
__.+__
If you want access to the word itself from your code, you should use
__(.+)__
This will give you a list with all such words
>>> import re
>>> m = re.findall("(__\w+__)", "What __word__ you search __for__")
>>> print m
['__word__', '__for__']
\b(__\w+__)\b
\b word boundary
\w+ one or more word characters - [a-zA-Z0-9_]
simple string functions. no regex
>>> mystring="blah __word__ blah __word2__"
>>> for item in mystring.split():
... if item.startswith("__") and item.endswith("__"):
... print item
...
__word__
__word2__

Categories

Resources