re.search webpage source with a carriage return mid string causing errors in Python - python

Using Python 2.6.6
I'm trying to grab the title of youtube links using mechanize browser, and while it does work on links to actual videos, linking to a channel's page, or their playlists, etc, causes it to crash.
The relevant code segment:
ytpage = br.open(ytlink).read()
yttitle = re.search('<title>(.*)</title>', ytpage)
yttitle = yttitle.group(1)
The error:
yttitle = yttitle.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
The only difference I can see is that a direct video link lays out the title tags on a single line in the source whereas every other youtube page seems to put a carriage return/newline in the middle of the title tags.
Anyone know how I can get around that carriage return, assuming that is the problem?
Cheers.

You can use re.DOTALL flag, which will make . match everything including a newline.
Documentation
So your second line of code should look like:
yttitle = re.search('<title>(.*)</title>', ytpage, re.DOTALL)
By the way to extract data from webpage it might be easier to use Beautiful Soup.

Related

Python Beautiful Soup scrape only if text match?

I just started learn beautiful soup, been watching videos and getting a hold of it somewhat. But examples provided, they seem already have a well structure in the HTML and not searching specific word anywhere.
What I try to do, is to print only the information of specific country mentioned, if it doesn't mention - it shouldn't print.
And later on will build so it append to text file.
I simply would like to grab everyone who is from new zealand, but to experiment I've been using United States because it's posted more frequently.
At the moment my code looks like this, it simply grabs all of them.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://pogotrainer.club/?sort=worldwide').text
soup = BeautifulSoup(source, 'lxml')
trainer = soup.find('article')
for box in trainer.find_all('div', class_='media-body'):
print(box.text)
In one tutorial I saw they used findNext, since anyway the important is the friend code listed. So I tried doing so
usa = box.find(text="United States").findNext(class_="TCLink")
however printing it with print(usa), gives me error
AttributeError: 'NoneType' object has no attribute 'findNext'
Before as well, I've tried things like
usa = soup.find(text="United")
But it prints
None
Even if looking at the page, it does have it.
Does anyone have suggestions?
Thanks in advance
AttributeError: 'NoneType' object has no attribute 'findNext' let's break this down:
The NoneType object is box
You access the attribute with .findNext (which is actually a method) but because the object is None, the statement makes no sense.
You're assuming that box is not None, so you have to make sure what you're working with. You might want to try this:
for box in trainer.find_all('div', class_='media-body'):
print(box)
Always try to know what you're working with by, for example, printing it explicitly.
That's one of Python's weaknesses (or strengths, depends on what you work on), it leaves this part of the debugging to the user.

Extracting the value of an attribute with a hyphen using Beautiful Soup 4

I am traversing through my tags using beautifulsoup 4. I have the following tag contents and am unable to extract the atttribute value of the 'data-event-name' attribute. I want '15:02' from this.
This is the html I need to extract 15:02 from
I have tried many many things but am unable to get this value. I tried using the re package, getattr python, find, find_all, etc, etc. this is one example of something I tried:
for racemeetnum,r1_a in enumerate(r1, start=1):
event1 = getattr(r1_a, 'data-event-name') # doesnt work
<
Thank you #Jack Fleming. I managed to sort this last night. In the end my issue wasn't that I couldn't find the attribute, it was that I wasn't trapping the errors when the attribute wasn't found. I surrounded code with a try/except and it worked fine.
Thanks for responding!

Using Xpath to get the anchor text of a link in Python when the link has no class

(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element

Unknown errors in webcrawler-dictionary (Python, modules: beautifulsoup4, operator, requests)

I am a beginner at python and I have developed a program that is is meant to crawl a website (that sells things) and print out the frequency of different words in the titles of the different items on sale.
There are three functions in my program:
1) A function that takes the text of the website and refines it to make a string
2) A function that takes that string and cleans it up, getting rid of things like brackets, commas, asterisks etc.
3) A function that then takes this string and sorts the words by how many times they are written on the website
I had an error in this program with my BeautifulSoup4 module, this other post helped me get rid of it: How to get rid of BeautifulSoup user warning?
Although this made two more errors in my program:
1) An error with the link I put into the first function
File "/Users/lowryj1/PycharmProjects/untitled2/Jaer.py", line 39, in <module>
start('https://hongkong.asiaxpat.com/classifieds/glassware/')
And this is the code that is wrong (The link is the website I'm crawling):
start('https://hongkong.asiaxpat.com/classifieds/glassware/')
2) This in an error with my line of code where I try to split the string in the first function and make all of the characters lowercase, this just makes this error:
File "/Users/lowryj1/PycharmProjects/untitled2/Jaer.py", line 11, in start
words = content.lower().split()
AttributeError: 'NoneType' object has no attribute 'lower'
And this is the code that is wrong:
words = content.lower().split()
This is the area I have the error (url is where my website url comes in):
def start(url):
word_list = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "html5lib")
for post_text in soup.findAll('a', {'target': '_blank'}):
content = post_text.string
**words = content.lower().split()**
I have tried my best to solve these problems, most solutions I've tried only make the issues worse. Please help me solve these errors, as I was unable to find adequate solutions to this problem via research.
At first, I see slightly different syntax for find_all in the docs for bs4.
But assuming your syntax also correct, it fails with issue that some of the found post_texts has no textual content (i.e. .string) and returns None. You need check your anchors for it, probably it is an error in the sources.
But if you want just avoid the issue - use
if post_text.string is not None:
content = post_text.string
words = content.lower().split()

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Categories

Resources