I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!
Related
I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.
As the content is xml, I would like to stick to xml parser instead of lxml.
Expected output:
WDS
PAM
The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I'm assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.
Or, to keep using the xml library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM
i downloaded webpages using wget. now i am trying to extract some data i need from those pages. the problem is with the Japanese words contained in this data. the English words extraction was perfect.
when i try to extract the Japanese words and use them in another app they appear gibberish. during testing diffrent methods there was one solution that fixed only half the japanese words.
what i tried: i tried
from_encoding="utf-8"
which had no effect. also i tried multiple ways to extract the text from the html code like
section.get_text(strip=True)
section.text.strip()
and others, also i tried to encode the generated text using URLencoding which did not work, also i tried using every code i could find on stackoverflow
one of the methods that strangely worked (but not completely) was saving the string in a dictionary then saving it into a JSON then calling the JSON from ANOTHER script. just using the dictionary, as it is, would not work. i have to use JSON as a middle man between two scripts. strange. (not all the words worked)
my question may seem like duplicates of anther question. but that other question is scraping from the internet. and what i am trying to do is extract from an offline source.
here is a simple script explaining the main problem
from bs4 import BeautifulSoup
page = BeautifulSoup(open("page1.html"), 'html.parser', from_encoding="utf-8")
word = page.find('span', {'class' : "radical-icon"})
wordtxt = word.get_text(strip=True)
#then save the word to a file
with open("text.txt", "w", encoding="utf8") as text_file:
text_file.write(wordtxt)
when i open the file i get gibberish characters
here is the part of the html that BeautifulSoup searchs:
<span class="radical-icon" lang="ja">亠</span>
the expected results is to get the symbols inside the text file. or to save them properly in anyway.
is there a better web scraper to use to properly get the utf8?
PS: sorry for bad english
i think i found an answer, just uninstall beautifulsoup4. i dont need it.
python has a builtin way to search for strings, i tried something like this:
import codecs
import re
with codecs.open("page1.html", 'r', 'utf-8') as myfile:
for line in myfile:
if line.find('<span class="radical-icon"') > -1:
result = re.search('<span class="radical-icon" lang="ja">(.*)</span>', line)
s = result.group(1)
with codecs.open("text.txt", 'w', 'utf-8') as textfile:
textfile.write(s)
which is the over complicated and non-pythonic way of doing it. but what works works.
I’m trying to read HTML content and extract only the data (such as the lines in a Wikipedia article). Here’s my code in Python:
import urllib.request
from html.parser import HTMLParser
urlText = []
#Define HTML Parser
class parseText(HTMLParser):
def handle_data(self, data):
print(data)
if data != '\n':
urlText.append(data)
def main():
thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
#Create instance of HTML parser (the above class)
lParser = parseText()
#Feed HTML file into parser. The handle_data method is implicitly called.
with urllib.request.urlopen(thisurl) as url:
htmlAsBytes = url.read()
#print(htmlAsBytes)
htmlAsString = htmlAsBytes.decode(encoding="utf-8")
#print(htmlAsString)
lParser.feed(htmlAsString)
lParser.close()
#for item in urlText:
#print(item)
I do get the HTML content from the webpage and if I print the bytes object returned by the read() method, it looks like I receive all the HTML content of the webpage. However, when I try to parse this content to get rid of the tags and store only the readable data, I’m not getting the result I expect at all.
The problem is that in order to use the feed() method of the parser, one has to convert the bytes object to a string. To do that you use the decode() method, which receives the encoding with which to do the conversion. If I print the decoded string, the content printed doesn’t contain the data itself (the useful readable data I’m trying to extract). Why does that happen and how can I solve this?
Note: I'm using Python 3.
Thanks for the help.
All right, I eventually used beautifulsoup to do the job, as Alden recommended, but I still don't know why the decoding process mysteriously gets rid of the data.
I have a huge HTML file that I have converted to text file. (The file is Facebook home page's source). Assume the text file has a specific keyword in some places of it. For example: "some_keyword: [bla bla]". How would I print all the different bla blas that are followed by some_keyword?
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
Imagine there are 50 different names with this format in the page. How would I print all the names followed by "name:", considering the text is very large and crashes when you read() it or try to search through its lines.
Sample File:
shortProfiles:{"100000094503825":{id:"100000094503825",name:"Bla blah",firstName:"Blah",vanity:"blah",thumbSrc:"https://scontent-lax3-1.xx.fbcdn.net/v/t1.0-1/c19.0.64.64/p64x64/10354686_10150004552801856_220367501106153455_n.jpg?oh=3b26bb13129d4f9a482d9c4115b9eeb2&oe=5883062B",uri:"https://www.facebook.com/blah",gender:2,i18nGender:16777216,type:"friend",is_friend:true,mThumbSrcSmall:null,mThumbSrcLarge:null,dir:null,searchTokens:["Bla"],alternateName:"",is_nonfriend_messenger_contact:false},"1347968857":
Based on your comment, since you are the person responsible for writting the data to the file. Write the data in JSON format and read it from file using json.loads() as:
import json
json_file = open('/path/to/your_file')
json_str = json_file.read()
json_data = json.loads(json_str)
for item in json_data:
print item['name']
Explanation:
Lets say data is the variable storing
{id:"1126830890",name:"Hillary Clinton",firstName:"Hillary"}
which will be dynamically changing within your code where you are performing write operation in the file. Instead append it to the list as:
a = []
for item in page_content:
# data = some xy logic on HTML file
a.append(data)
Now write this list to the file using: json.dump()
I just wanted to throw this out there even though I agree with all the comments about just dealing with the html directly or using Facebook's API (probably the safest way), but open file objects in Python can be used as a generator yielding lines without reading the entire file into memory and the re module can be used to extract information from text.
This can be done like so:
import re
regex = re.compile(r"(?:some_keyword:\s\[)(.*?)\]")
with open("filename.txt", "r") as fp:
for line in fp:
for match in regex.findall(line):
print(match)
Of course this only works if the file is in a "line-based" format, but the end effect is that only the line you are on is loaded into memory at any one time.
here is the Python 2 docs for the re module
here is the Python 3 docs for the re module
I cannot find documentation which details the generator capabilities of file objects in Python, it seems to be one of those well-known secrets...Please feel free to edit and remove this paragraph if you know where in the Python docs this is detailed.
My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]