Using BeautifulSoup on very large HTML file - memory error? - python

I'm learning Python by working on a project - a Facebook message analyzer. I downloaded my data, which includes a messages.htm file of all my messages. I'm trying to write a program to parse this file and output data (# of messages, most common words, etc.)
However, my messages.htm file is 270MB. When creating a BeautifulSoup object in the shell for testing, any other file (all < 1MB) works just fine. But I can't create a bs object of messages.htm. Here's the error:
>>> mf = open('messages.htm', encoding="utf8")
>>> ms = bs4.BeautifulSoup(mf)
Traceback (most recent call last):
File "<pyshell#73>", line 1, in <module>
ms = bs4.BeautifulSoup(mf)
File "C:\Program Files (x86)\Python\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
File "C:\Program Files (x86)\Python\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError
So I can't even begin working with this file. This is my first time tackling something like this and I'm only just learning Python so any suggestions would be much appreciated!

As you're using this as a learning exercise, I won't give too much code. You may be better off with ElementTree's iterparse to allow you to process as you parse. BeautifulSoup doesn't have this functionality as far as I am aware.
To get you started:
import xml.etree.cElementTree as ET
with open('messages.htm') as source:
# get an iterable
context = ET.iterparse(source, events=("start", "end"))
# turn it into an iterator
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
# do something with elem
# get rid of the elements after processing
root.clear()
If you're set on using BeautifulSoup, you could look into splitting the source HTML into manageable chunks, but you'd need to be careful to keep the thread-message structure and ensure you keep valid HTML.

Related

How to fix bs4 select error: 'TypeError: __init__() keywords must be strings'

I'm writing a script that uses a post request and gets an XML in return. I need to parse that XML to know if the post request was accepted or not.
I'm using bs4 to parse it and it worked fine until about a week ago when I started to get an error I didn't get before:
TypeError: __init__() keywords must be strings
I'm using bs4's select function in other parts of the same file without getting this error, and I can't find anything about it online.
At first I thought it was a version issue, but I tried both python3.7 and 3.6 and got the same error.
This is the code used to produce the error:
res = requests.post(url, data = body, headers = headers)
logging.debug('Res HTTP status is {}'.format(res.status_code))
try:
res.raise_for_status()
resSoup = BeautifulSoup(res.text, 'xml')
# get the resultcode from the resultcode tag
resCode = resSoup.select_one('ResultCode').text
Full error messege:
Traceback (most recent call last):
File "EbarInt.py", line 292, in <module>
resCode = resSoup.select_one('ResultCode').text
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1345, in select_one
value = self.select(selector, namespaces, 1, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\bs4\element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\soupsieve\__init__.py", line 50, in compile
namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings
When I check res.text type I get class 'str' as expected.
When I log res.text I get:
<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><soap:Header><wsa:Action>Trackem.Web.Services/CreateOrUpdateTaskResponse</wsa:Action><wsa:MessageID>urn:uuid:3ecae312-d416-40a5-a6a3-9607ebf28d7a</wsa:MessageID><wsa:RelatesTo>urn:uuid:6ab7e354-6499-4e37-9d6e-61219bac11f6</wsa:RelatesTo><wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To><wsse:Security><wsu:Timestamp wsu:Id="Timestamp-6b84a16f-327b-42db-987f-7f1ea52ef802"><wsu:Created>2019-01-06T10:33:08Z</wsu:Created><wsu:Expires>2019-01-06T10:38:08Z</wsu:Expires></wsu:Timestamp></wsse:Security></soap:Header><soap:Body><CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services"><CreateOrUpdateTaskResult><ResultCode>OK</ResultCode><ResultCodeAsInt>0</ResultCodeAsInt><TaskNumber>18000146</TaskNumber></CreateOrUpdateTaskResult></CreateOrUpdateTaskResponse></soap:Body></soap:Envelope>
Update: BeautifulSoup 4.7.1 has been released, fixing the default-namespace issue. See the release notes. You probably would want to upgrade just for the performance fixes.
Original answer:
You must have upgraded to BeautifulSoup 4.7, which replaced the simple and limited internal CSS parser with the soupsieve project, which is a far more complete CSS implementation.
It is that project that has an issue with the default namespace attached to one of the elements in your response:
<CreateOrUpdateTaskResponse xmlns="Trackem.Web.Services">
The XML parser used to build the BeautifulSoup object tree correctly communicates that as the None -> 'Trackem.Web.Services' mapping in the namespace dictionary, but the soupsieve code required that all namespaces have a prefix name (xmlns:prefix) with the default namespace marked with an empty string, not None, leading to this bug. I've reported this as issue #68 to the soupsieve project.
You don't need to use select_one at all here, you are not using any CSS syntax beyond an element name. Use soup.find() instead:
resCode = resSoup.find('ResultCode').text

Write data scraped to text file with python script

I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.

Saving Image from URL using Python Requests - URL type error

Using the following code:
with open('newim','wb') as f:
f.write(requests.get(repr(url)))
where the url is:
url = ''
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python33\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
I have seen other posts with what, at first glance, appears to be a similar problem but I haven't had any luck just adding 'https://' or anything like that...I seriously want to avoid having to do this in webdriver+Autoit or something because I have to do a similar exercise for thousands of images.
There seems to be a problem with your understanding of the concept of embedded images. The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.
It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message.
So, how do we get these images?
The following script will handle this task:
import requests
from lxml import html
import binascii as ba
i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/#src')
for img in images:
i += 1
ext = img.partition('data:image/')[2].split(';')[0]
with open('newim'+str(i)+'.'+ext,'wb') as f:
f.write(ba.a2b_base64(img.partition('base64,')[2]))
print("Done")
To run it you will need to install, along with requests, the lxml library which can be found here.
Here follows a short description of how the script functions:
First it requests the url from the server and, after it gets the server's response, it stores it in a Response object (page).
Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/#src').
The result is a list containing the contents of the src attribute of every image in the page. In this case (embedded images) these are the data URIs.
Then, for every image in the list, it first gets the image type (which will be used as the newim's extension), using partition() and split() and stores it in ext. Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file.
As a small demo, save this html code (as, eg, images.html) somewhere in your server
<h1>Images</h1>
<img src="" />
<br />
<img src=""></img>
<br />
<img src=""/>
and point to it in the script: requests.get("http://yourserver/somedir/images.html").
When you run the script you will get the following 3 images:
, , , respectively named newim1.png, newim2.png and newim3.jpg.
As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).
This is an image encoded in base64. Quoting the URL below: "base64 equals to text (string) representation of the image itself".
Read this for a detailed explanation:
http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/
In order to use them you'll have to implement a base64 decoder. Luckily SO already provides you with the answer on how to do it:
Python base64 data decode

UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-9: ordinal not in range(128)

I've been working on a program to go through various links that I already have saved in a text file, which are mostly summer opportunities/camps/etc. and scrape through them to see if key words like "scholarship" or "financial aid" pop up. However, when I run through it, it gives me the error that's in the title above.
This question has been asked a few times, but it appears to be for different reasons for different people. Therefore, I get that there's probably an error involving Unicode, but I have no idea where or why that would be.
This is the code:
import BeautifulSoup
import requests
import nltk
file_from = open("links.txt", "r")
list_of_urls = file_from.read().splitlines()
aid_words = ["financial", "aid", "merit", "scholarship"]
count = 0
fin_aid = []
while count <= 10:
for url in list_of_urls:
clean = 1
result = "nothing found"
source = requests.get(url)
plain_text = source.text
soup = BeautifulSoup.BeautifulSoup(plain_text)
print (str(url).upper())
for links in soup.findAll('p', text = True):
tokenized_text = nltk.word_tokenize(links)
for word in tokenized_text:
if word not in aid_words:
print ("not it " + str(clean))
clean += 1
pass
else:
result = str(word)
print (result)
fin_aid.append(url)
break
count += 1
the_golden_book = {"link: ": str(url), "word found: ": str(result)}
fin_aid.append(the_golden_book)
file_to = open("links_with_aid.txt", "w")
file_to.write(str(fin_aid))
file_to.close()
print ("scrape finished")
print (str(fin_aid))
Basically, I wanted to take all the links from links.txt, visit the first ten (as a test), search for the four words in the list "aid_words", and return results in the form of "not it" and the number of words searched so far, if none of the words have been found yet, or the word that was detected if one is found (so that I can visit the link later and search for it, to see if it's a false alarm or not).
When I run this through the Command Prompt, this is the stuff it shows me right before the error message.
Traceback (most recent call last):
File "finaid.py", line 20, in <module>
soup = BeautifulSoup.BeautifulSoup(plain_text.encode("utf-8"))
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1522, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1147, in __init__
self._feed(isHTML=isHTML)
File "C:\Python27\lib\site-packages\BeautifulSoup.py", line 1189, in _feed
SGMLParser.feed(self, markup)
File "C:\Python27\lib\sgmllib.py", line 104, in feed
self.goahead(0)
File "C:\Python27\lib\sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "C:\Python27\lib\sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "C:\Python27\lib\sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-9: ordin
al not in range(128)
I'm running this on Python 2.7.10, and I'm on Windows 8.1. Thanks for any help you can provide! As far as I can tell, it shouldn't be anything in "link.txt", which is literally just links that a colleague crawled and saved earlier.
I do quite a bit of website scraping and I can tell you this: please try to write your scraper code using Python 3. As soon as I updated my scrapers to use Python 3 a lot of my encoding issues went away. Be sure on your file write to use 'a' instead of 'w' if you do go to Python 3 and you want to keep the contents of that file intact.
Let me know if you have specific questions about making that transition.
On the "expected string or buffer," that usually shows up for me when I pass in an object instead of a string. To check that is happening, use a print statement to check, like so:
for links in soup.findAll('p', text = True):
print(links)
tokenized_text = nltk.word_tokenize(links)
If it doesn't print text to your terminal (or wherever you are running the script from) then you are passing in an object when it is expecting to receive a string.
Pseudo-code to fix it might look like:
for links in soup.findAll('p', text = True):
links = links.text()
tokenized_text = nltk.word_tokenize(links)

Parsing XML/HTML encoded GChats

I'm attempting to learn XML in order to parse GChats downloaded from GMail via IMAP. To do so I am using lxml. Each line of the chat messages is formatted like so:
<cli:message to="email#gmail.com" iconset="square" from="email#gmail.com" int:cid="insertid" int:sequence-no="1" int:time-stamp="1236608405935" xmlns:int="google:internal" xmlns:cli="jabber:client">
<cli:body>Nikko</cli:body>
<met:google-mail-signature xmlns:met="google:metadata">0c7ef6e618e9876b</met:google-mail- signature>
<x stamp="20090309T14:20:05" xmlns="jabber:x:delay"/>
<time ms="1236608405975" xmlns="google:timestamp"/>
</cli:message>
When I try to build the XML tree like so:
root = etree.Element("cli:message")
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2568, in lxml.etree.Element (src/lxml/lxml.etree.c:52878)
File "apihelpers.pxi", line 126, in lxml.etree._makeElement (src/lxml/lxml.etree.c:11497)
File "apihelpers.pxi", line 1542, in lxml.etree._tagValidOrRaise (src/lxml/lxml.etree.c:23956)
ValueError: Invalid tag name u'cli:message'
When I try to escape it like so:
root = etree.Element("cli\:message")
I get the exact same error.
The header of the chats also gives this information, which seems relevant:
Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 7bit
Does anyone know what's going on here?
So this didn't get any response, but in case anyone was wondering, BeautifulSoup worked fantastically for this. All I had to do was this:
soup = BeautifulSoup(repr(msg_data))
print(soup.get_text())
And I got (fairly) clear text.
So the reason you got an invalid tag is that if you were to look at the way lxml parses xml it doesn't use the namespace "cli" it would look instead like:
{url_where_Cli_is_define}Message
If you refer to Automatic XSD validation you will see what i did to simplify managing large amounts of schemas etc..
similarly what i did to avoid this very problem you would just replace the namespace using str.replace() to change the "cli:" to "{url}". having placed all the namespaces in one dictionary made this process quick.
I imagine soup does this process for you automatically.

Categories

Resources