regex for certain html links - python

EDOT: you guys are right, bs4 is much better and have started using it, its much more intuitive and actually finds links
although I'm still struggling at points haha
thank you all very much
had a look and this doesnt seem to be in the other posts
so i am pretty sure I can use regex for this as the 15 links in this html page are pretty well defined I think, its an amazon page with 15 product links and I want those links
input is this
Nikon Coolpix L340 Bridge Camera - Bl...
I have tried
import re
links = re.findall(r'^(/n/n/n/n/n/n).(")', page)
which wont work, any thoughts?

Use regexp below:
s = """Nikon Coolpix L340 Bridge Camera - Bl..."""
re.findall('(?<=\n\n\n\n\n\n)(.*?)"', s)
Previous regexp was looking for \n... match at the begining of string, not for case when \n in the middle of string as in sample string.

Related

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

How to extract information from a web page using python in json or xml format?

I need help in extracting information from a webpage. I give the URL and then I need to extract information like contact number, address, href, name of person etc. I am able to extract the page source completely for a provided URL with known tags. But I need a generic source code to extract this data from any URL. I used regex to extract emails for e.g.
import urllib
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b'
pattern=re.compile(regex)
print pattern
while i<len(urls):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives me empty list. Any help to extract all info as I said above will be highly appreciated.
The idea is to give a URL and the extract all information like name, phone number, email, address etc in json or xml format. Thank you all in advance...!!
To start with you need to fix your regex.
\ needs to be escaped in python strings.
Easy way to fix this is using a raw string r'' instead.
regex=r'\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b
Meanwhile I have managed to get it working, after some small modifications (beware that I am working with Python 3.4.2):
import urllib.request
import re
#htmlfile=urllib.urlopen("http://www.plainsboronj.com/content/departmental-directory")
urls=["http://www.plainsboronj.com/content/departmental-directory"]
i=0
regex='[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}'
pattern=re.compile(regex)
print(pattern)
while i<len(urls):
htmlfile=urllib.request.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext.decode())
print(titles)
i+=1
The result is:
['townshipclerk#plainsboronj.com', 'acancro#plainsboronj.com', ...]
Good luck
I think you're on the wrong track here: you have a HTML file, from where you try to extract information. You have started doing this by filtering on '#'-sign for finding e-mail addresses (hence your choice of working with regular expressions). However other things like names, phone numbers, ... are not recognisable using regular expressions, hence another approach might be useful. Under URL "https://docs.python.org/3/library/html.parser.html" there is some explanation on how to parse HTML files. In my opinion this will be a better approach for solving your needs.

How can I parse HTML code with "html written" URL in Python?

I am starting to program in Python, and have been reading a couple of posts where they say that I should use an HTML parser to get an URL from a text rather than re.
I have the source code which I got from page.read() with the urllib and urlopen.
Now, my problem is that the parser is removing the url part from the text.
Also, if I had read correctly, var = page.read(), var is stored as a string?
How can I tell it to give me the text between 2 "tags"? The URL is always in between flv= and ; so and as such it doesn't start with href which is what the parsers look for, and it doesn't contain http:// either.
I have read many posts, but it seems they all look for ``href in the code.
Do I have it all completely wrong?
Thank you!
You could consider implementing your own search / grab. In psuedocode, it would look a little like this:
find location of 'flv=' in HTML = location_start
find location of ';' in HTML = location_end
grab everything in between: HTML[location_start : location_end]
You should be able to implement this in python.
Good luck!

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Categories

Resources