Python re.findall prints list instead of string - python

address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
if you run the script the output will be something similiar to this:
['aaaaaaaaaaaa', 'bbbbbbbbbbbb', 'cccccccccccc']
how do i make the script print this output (note the line break):
aaaaaaaaaaaa
bbbbbbbbbbbb
cccccccccccc
any help?

Just print regex like this:
print "\n".join(regex)
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
regex = re.findall(r"([a-f\d]{12})", html)
print "\n".join(regex)

re.findall() returns a list. So you can either iterate over the list and print out each element separately like so:
address = ('http://www.somesite.com/article.php?page=' +numb)
html = urllib2.urlopen(address).read()
for match in re.findall(r"([a-f\d]{12})", html)
print match
Or you can do as #bigOTHER suggests and join the list together into one long string and print the string. It's essentially doing the same thing.
Source: https://docs.python.org/2/library/re.html#re.findall
re.findall(pattern, string, flags=0) Return all non-overlapping
matches of pattern in string, as a list of strings. The string is
scanned left-to-right, and matches are returned in the order found. If
one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group. Empty matches are included in the result unless they touch the
beginning of another match.

Use join on the result:
"".join("{0}\n".format(x) for x in re.findall(r"([a-f\d]{12})", html)

Related

How to find matching words using regex?

I have strings in a text file with more than 2000 lines, like:
cool.add.come.ADD_COPY
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
cool.add.go.MINUS_COPY
I have a list of more than 200 matching words, like:
store=['ADD_COPY','add.cool.warm.ADD_IN', 'warm.cool.warm.MINUS', 'MINUS_COPY']
I'm using regular expression in the code
def all(store, file):
lst=[]
for match in re.finditer(r'[\w.]+', file):
words = match.group()
if words in store:
lst.append(words)
return lst
Then I check in a loop for requirement.
Output I'm getting:
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
If I change the identifiers to \w+ then I get only:
ADD_COPY
MINUS_COPY
Required output:
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
ADD_COPY
MINUS_COPY
It appears you want to get the results using a mere list comprehension:
results = set([item for item in store if item in text])
If you need a regex (in case you plan to match whole words only, or match your store items only in specific contexts), you may get the matches using
import re
text="""cool.add.come.ADD_COPY
add.cool.warm.ADD_IN
warm.cool.warm.MINUS
cool.add.go.MINUS_COPY"""
store=['ADD_COPY','add.cool.warm.ADD_IN', 'warm.cool.warm.MINUS', 'MINUS_COPY']
rx="|".join(sorted(map(re.escape, store), key=len, reverse=True))
print(re.findall(rx, text))
The regex will look like
add\.cool\.warm\.ADD_IN|warm\.cool\.warm\.MINUS|MINUS_COPY|ADD_COPY
See the regex demo, basically, all your store items with escaped special characters and sorted by length in the descending order.

Regular expression to retrieve string parts within parentheses separated by commas

I have a String from which I want to take the values within the parenthesis. Then, get the values that are separated from a comma.
Example: x(142,1,23ERWA31)
I would like to get:
142
1
23ERWA31
Is it possible to get everything with one regex?
I have found a method to do so, but it is ugly.
This is how I did it in python:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
secondResult = re.search("(?<=\()(.*?)(?=\))", firstResult.group(0))
finalResult = [x.strip() for x in secondResult.group(0).split(',')]
for i in finalResult:
print(i)
142
1
23ERWA31
This works for your example string:
import re
string = "x(142,1,23ERWA31)"
l = re.findall (r'([^(,)]+)(?!.*\()', string)
print (l)
Result: a plain list
['142', '1', '23ERWA31']
The expression matches a sequence of characters not in (,,,) and – to prevent the first x being picked up – may not be followed by a ( anywhere further in the string. This makes it also work if your preamble x consists of more than a single character.
findall rather than search makes sure all items are found, and as a bonus it returns a plain list of the results.
You can make this a lot simpler. You are running your first Regex but then not taking the result. You want .group(1) (inside the brackets), not .group(0) (the whole match). Once you have that you can just split it on ,:
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?)\)", string)
for e in firstResult.group(1).split(','):
print(e)
A little wonky looking, and also assuming there's always going to be a grouping of 3 values in the parenthesis - but try this regex
\((.*?),(.*?),(.*?)\)
To extract all the group matches to a single object - your code would then look like
import re
string = "x(142,1,23ERWA31)"
firstResult = re.search("\((.*?),(.*?),(.*?)\)", string).groups()
You can then call the firstResult object like a list
>> print(firstResult[2])
23ERWA31

python re.findall returns a list of tuples (strings are expected) [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 5 months ago.
re.findall returns a list of tuples that containing the expected strings and also something unexpected.
I was conducting a function findtags(text) to find tags in a given paragraph text. When I called re.findall(tags, text) to find defined tags in the text, it returns a list of tuple. Each tuple in the list contains the string that I expected it to return.
The function findtags(text) is as follows:
import re
def findtags(text):
parms = '(\w+\s*=\s*"[^"]*"\s*)*'
tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
print(re.findall(tags, text))
return re.findall(tags, text)
testtext1 = """
My favorite website in the world is probably
Udacity. If you want
that link to open in a <b>new tab</b> by default, you should
write Udacity
instead!
"""
findtags(testtext1)
The expected result is
['<a href="www.udacity.com">',
'<b>',
'<a href="www.udacity.com"target="_blank">']
The actual result is
[('<a href="www.udacity.com">', 'href="www.udacity.com"'),
('<b>', ''),
('<a href="www.udacity.com"target="_blank">', 'target="_blank"')]
re.findall return a tuple because you have two capturing group just make the params group non capturing one using ?::
import re
def findtags(text):
# make this non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
tags = '(<\s*\w+\s*' + parms + '\s*/?>)'
print(re.findall(tags, text))
return re.findall(tags, text)
testtext1 = """
My favorite website in the world is probably
Udacity. If you want
that link to open in a <b>new tab</b> by default, you should
write Udacity
instead!
"""
findtags(testtext1)
OUPUT:
['<a href="www.udacity.com">', '<b>', '<a href="www.udacity.com"target="_blank">']
Another why is if there is no capturing group re.findall will return matched text:
# non capturing group
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'
# no group at all
tags = '<\s*\w+\s*' + parms + '\s*/?>'
According to the docs for re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
In your case, the stuff in parentheses in parms = '(\w+\s*=\s*"[^"]*"\s*)*' is a repeated group, so a list of tuples of possibly empty strings is returned.
Looks like you don't want to return your inner capture group matches, so make it a non-capturing group instead.
parms = '(?:\w+\s*=\s*"[^"]*"\s*)*'

Python Regex: Remove the parts of the string that does not match regex pattern

I want to remove parts of the string that does not match the format that I want. Example:
import re
string = 'remove2017abcdremove'
pattern = re.compile("((20[0-9]{2})([a-zA-Z]{4}))")
result = pattern.search(string)
if result:
print('1')
else:
print('0')
It returns "1" so I can find the matching format inside the string however I also want to remove the parts that says "remove" on it.
I want it to return:
desired_output = '2017abcd'
You need to identify group from search result, which is done through calling a group():
import re
string = 'remove2017abcdremove'
pattern = re.compile("(20[0-9]{2}[a-zA-Z]{4})")
string = pattern.search(string).group()
# 2017abcd

how to extract portion of a string between two substrings in a multiline string in python

I'm trying to extract the portion of a string between two string identifiers. The technique works if the search is made in first line but it do not work for substrings in other line.
The string is like this:
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
Until now I have the following code.
startString='jhfshf'
endString=';'
search_var=mystring[mystring.find(startString)+len(startString):mystring.find(endString)]
print(search_var)
I get the correct output like iztzrtzoi hjge)
But if I search for a string in second line like (startString=ldjfsj), it do not work. Can can body suggest some changes for correction?
Using Regex.
Demo:
import re
mystring="""abc jhfshf iztzrtzoi hjge);
kjsyh ldjfsj sjsdgj sodfsd);
sjfhsdvh isdjgdfg sdgjhg isjdgg);
ghdcbnv jgdfkjg fdjgjfdgj);
vgdfnkvgfd dfgjfdjgjöfd);
end"""
m = re.search("(?<=jhfshf).*?(?=\;)", mystring)
if m:
print( m.group() )
Output:
iztzrtzoi hjge)

Categories

Resources