Extract All urls from a string in python - python

Given a string of text which could possibly contain multiple urls all starting with http://
for example:
someString = "Text amongst words and links http://www.text.com more text more text another http http://www.word.com"
How can I extract all the urls from a string like the one above?
Leaving just
http://www.text.com
http://www.word.com

This should work:
>>> for url in re.findall('(http://\S+)', someString): print url
...
http://www.text.com
http://www.word.com

You want regular expressions.
In python: https://docs.python.org/2/library/re.html
Regular expression to evaluate: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
Shouldn't take you long from there

Related

How to handle HTML entities in parsed text - Python

I have a parsed text what contains HTML versions of different symbols like quotation marks or dashes.
This is how one string looks like:
Introduction &#8211 First page&#8218s content
And I would like to achive this:
Introduction - First page's content
Is there any library or common solution that changes the HTML entities in any string? Or I would need to write a function which replace the html to the proper string?
I already checked these answers, but I would rather need something that works with a simple Python string that contains html entities.
html module doesn't require anything special from the string. It just works:
>>> import html
>>> html.unescape('Introduction &#8211 First page&#8218s content')
'Introduction – First page‚s content'
Try
print unicode(x)
or
print x.encode('ascii')

web scrape python find all by text instead of find all by element tag

Let's use the word technology for my example.
I want to search all text on a webpage. For each text, I want to find each element tags containing a string with the word "technology" and print only the contents of the element tag containing the word. Please help me figure this out.
words = soup.body.get_text()
for word in words:
i = word.soup.find_all("technology")
print(i)
You should use the search by text which can be accomplished by using the text argument (which was renamed to string in the modern BeautifulSoup versions), either via a function and substring in a string check:
for element in soup.find_all(text=lambda text: text and "technology" in text):
print(element.get_text())
Or, via a regular expression pattern:
import re
for element in soup.find_all(text=re.compile("technology")):
print(element.get_text())
Since you are looking for data inside of an 'HTML structure' and not a typical data structure, you are going to have to nearly write an HTML parser for this job. Python doesn't normally know that "some string here" relates to another string wrapped in brackets somewhere else.
There may be a library for this, but I have a feeling that there isn't :(

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"
Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432
Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

Python Regex Tokenize

I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?
Thanks!
There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:
>>> from bs4 import BeautifulSoup
>>> html = """<p> some HTML and another link</p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
... print link['href']
http://link.com
http://second.com
Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:
>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']
Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:
the () means "save only stuff in here"
the . means "any character"
the * means "any number of times"
the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).
HTH!
Do not use regexp:
Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).
If you wish to use regex anyway,
You can find several pattern that do the job, and several way to fetch the strings you wish to find.
These patterns do the job:
r'\(a href="(.*?)"\)'
r'\(a href="(.*)"\)'
r'\(a href="(+*)"\)'
1. re.findall()
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)
2. re.search()
re.search(pattern, string, flags=0)
Scan through string looking for a location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance.
Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)
You would use search when looking for first occurence of pattern only. And with your example this would be
>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

python regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))

Categories

Resources