I am parsing a Wikipedia metadata file with bs4 and python 3.5
This works for extraction, from a test slice of the (much larger) file:
from bs4 import BeautifulSoup
with open ("Wikipedia/test.xml", 'r') as xml_file:
xml = xml_file.read()
print(BeautifulSoup(xml, 'lxml').select("timestamp"))
The issue is that the metadata files are all 12+ gigs, so rather than slurping in the entire file as a string before ensoupification, I'd like to have BeautifulSoup read the data as an iterator (possibly even from gzcat to avoid having the data sitting around in uncompressed files).
However, my attempts to hand BS anything other than a string causes it to choke. Is there a way to get BS to read data as a stream instead of a string?
You can give BS a file handle object.
with open("Wikipedia/test.xml", 'r') as xml_file:
soup = BeautifulSoup(xml_file, 'lxml')
This is the first example in the documentation of Making the Soup
BeautifulSoup or lxml has no stream option but you can use iterparse() to read large xml files in a chunk
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse("Wikipedia/test.xml", events=('start', 'end')):
....
if event == 'end':
....
elem.clear() # freed memory
read more here or here
Related
I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.
As the content is xml, I would like to stick to xml parser instead of lxml.
Expected output:
WDS
PAM
The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I'm assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.
Or, to keep using the xml library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM
I have an HTML string which is guaranteed to only contain text (i.e. no images, videos, or other assets). However, just to note, there might be formatting with some of the text like some of them might be bold.
Is there a way to convert the HTML string output to a .txt file? I don't care about maintaining the formatting but I do want to maintain the spacing of the text.
Is that possible with Python?
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print(html2text.html2text(txt))
This question already has an answer here:
How do I write all of these rows into a CSV file for a given range?
(1 answer)
Closed 6 years ago.
I'm parsing text from an XML file. Parsing works well, and I can print the results in full, but when I try to write the text into a text document, all I get in the document is the last item.
from bs4 import BeautifulSoup
import urllib.request
import sys
req = urllib.request.urlopen('file:///C:/Users/John/Desktop/Dow%20Jones/compaq%20neg%201.xml')
xml = BeautifulSoup(req, 'xml')
for item in xml.findAll('paragraph'):
sys.stdout = open('CN1.txt', 'w')
print(item.text)
sys.stdout.close()
What am I missing here?
It looks like you are opening the file every time you go through the loop, which I am surprised it let you do. When it opens the file, it is is opening it in write mode and therefore is wiping out everything that was in it on the last pass through the loop.
I am writing a program to scrape a Wikipedia table with python. Everything works fine except for some of the characters which seem don't seem to be encoded properly by python.
Here is the code:
import csv
import requests
from BeautifulSoup import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
url = 'https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'wikitable sortable'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./scrapedata.csv", "wb")
writer = csv.writer(outfile)
print list_of_rows
writer.writerows(list_of_rows)
For example Merzbrück is being encoded as Merzbrück.
The issue more or less seems to be with scandics (é,è,ç,à etc). Is there a way I can avoid this?
Thanks in advance for your help.
This is of course an encoding issue. The question is where it is. My suggestion is that you work through each step and look at the raw data to see if you can find out where exactly the encoding issue is.
So, for example, print response.content to see if the symbols are as you expect in the requests object. If so, move on, and check out soup.prettify() to see if the BeautifulSoup object looks ok, then list_of_rows, etc.
All that said, my suspicion is that the issue has to do with writing to csv. The csv documentation has an example of how to write unicode to csv. This answer also might help you with the problem.
For what it's worth, I was able to write the proper symbols to csv using the pandas library (I'm using python 3 so your experience or syntax may be a little different since it looks like you are using python 2):
import pandas as pd
df = pd.DataFrame(list_of_rows)
df.to_csv('scrapedata.csv', encoding='utf-8')
I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!