Using python to get data (text) from wix - python

I'm making a python project in which I created a test wix website.
I want to get the data (text) from the wix website using urllib
so I did
url.urlopen(ADDRESS).readlines()
the problem is it did not give me anything from the text in the page and only information about the structure of the page in HTML.
how would I extricate the requested text information from the website?

I think you'll need to end up parsing the html for the information you want. Check out this python library:
https://docs.python.org/3/library/html.parser.html
You could potentially do something like this:
from html.parser import HTMLParser
rel_data = []
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
rel_data.append(data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
print(rel_data)
Output
["Test", "Parse me!"]

Related

Pars and extract urls inside an html web content without using BeautifulSoup or urlib libraries

I am new in python and I am so sorry if my question is very basic. In my program, I need to pars an html web page and extract all of the links inside that. Assume my web page content is such as below:
<html><head><title>Fakebook</title><style TYPE="text/css"><!--
#pagelist li { display: inline; padding-right: 10px; }
--></style></head><body><h1>testwebapp</h1><p>Home</p><hr/><h1>Welcome to testwebapp</h1><p>Random URLs!</p><ul><li>Rennie Tach</li><li>Pid Ko</li><li>Ler She</li><li>iti Sar</li><li><a </ul>
<p>Page 1 of 2
<ul id="pagelist"><li>
1
</li><li>2</li><li>next</li><li>last</li></ul></p>
</body></html>
Now, I need to pars this web content and extract all of the links inside that. In another words, I need below content to be extracted from the web page:
/testwebapp/847945358/
/testwebapp/848854776/
/testwebapp/850558104/
/testwebapp/851635068/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
I searched so much about parsing web pages using python such as this, this or this, but many of them have used libraries such as urlib or urlib2 or BeautifulSoup and request which I can not use these libraries in my program. Because my application will run on a machine that these libraries have not been installed on that. So I need to parse my web content manually. My idea was that, I save my web page content in a string and then I convert the string((separated by space)) to an array of string and then check each item of my array and if it has /testwebapp/ or fri keyword, save that in an array. But when I am using below command for converting the string contain my web page content to an array, I got this error:
arrayofwords_fromwebpage = (webcontent_saved_in_a_string).split(" ")
and the error is:
TypeError: a bytes-like object is required, not 'str'
IS there any quick and efficient way for parsing and extracting this links inside an html web page without using any library such as urlib, urlib2 or BeautifulSoup?
If all that you need is to found all url use only Python, this function will help you:
def search(html):
HREF = 'a href="'
res = []
s, e = 0, 0
while True:
s = html.find(HREF, e)
if s == -1:
break
e = html.find('">', s)
res.append(html[s+len(HREF):e])
return res
You can use something from the standard library, namely HTMLParser.
I subclass it for your purpose by watching for 'a' tags. When the parser encounters one it looks for the 'href' attribute and, if it's present, it prints its value.
To execute it, I instantiate the subclass, then give its feed method the HTML that you presented in your question.
You can see the results at the end of this answer.
>>> from html.parser import HTMLParser
>>> class SharoozHTMLParser(HTMLParser):
... def handle_starttag(self, tag, attrs):
... if tag == 'a':
... attrs = {k: v for (k, v) in attrs}
... if 'href' in attrs:
... print (attrs['href'])
...
>>> parser = SharoozHTMLParser()
>>> parser.feed(open('temp.htm').read())
/testwebapp/
/testwebapp/847945358/
/testwebapp/848854776/
/testwebapp/850558104/
/testwebapp/851635068/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/
/testwebapp/570508160/fri/2/

Python print function side-effect

I'm using lxml to parse some HTML with Russian letters. That's why i have headache with encodings.
I transform html text to tree using following code. Then i'm trying to extract some things from the page (header, arcticle content) using css queries.
from lxml import html
from bs4 import UnicodeDammit
doc = UnicodeDammit(html_text, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
tree = html.fromstring(html_text, parser=parser)
...
def extract_title(tree):
metas = tree.cssselect("meta[property^=og]")
for meta in metas:
# print(meta.attrib)
# print(sys.stdout.encoding)
# print("123") # Uncomment this to fix error
content = meta.attrib['content']
print(content.encode('utf-8')) # This fails with "[Decode error - output not utf-8]"
I get "Decode error" when i'm trying to print unicode symbols to stdout. But if i add some print statement before failing print then everything works fine. I never saw such strange behavior of python print function. I thought it has no side-effects.
Do you have any idea why this is happening?
I use Windows and Sublime to run this code.

Using regular expressions to parse HTML

I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from
funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)
user3530608 you want one match, instead of iterating through matches?
This is a nice way to get started with python regular expressions.
Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '#{0}\n{1}\n'.format(user,text)
Although you can parse html by regex , but I strongly suggest you to use some python third's lib.
My favorest htmlparser lib is PyQuery, you can use it as jquery:
such as
from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...
You can find it here:https://pypi.python.org/pypi/pyquery
Just:
pip install PyQuery
or
easy_install PyQuery
You'll love it !
Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2
If anyone is interested in getting only one joke from the html with no html tags, here is the final code:
import re
import urllib2
def remove_html_tags(text):
pattern = re.compile(r'</b></a>')
return pattern.sub('', text)
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)

Simple scraping of youtube xml to get a Python list of videos

I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).
Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.
In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs

Python lxml screen scraping?

I need to do some HTML parsing with python. After some research lxml seems to be my best choice but I am having a hard time finding examples that help me with what I am trying to do. this is why i am hear. I need to scrape a page for all of its viewable text.. strip out all tags and javascript.. I need it to leave me with what text is viewable. sounds simple enough.. i did it with the HTMLParser but its not handling javascript well
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
self.output.write(data)
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text
Any Ideas for a way to do this with lxml or a better way to do it HTMLParser.. HTMLParser would be best because no additional libs are needed.. thanks everyone
Scott F.
No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.
scrape.py can do this for you.
It's as simple as:
import scrape
s = scrape.Session()
s.go('yoursite.com')
print s.doc.text
Jump to about 2:40 in this video for an awesome overview from the creator of scrape.py:
pycon.blip.tv/file/3261277
BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.
I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more.
For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.
Two that can execute javascript and are either Python based or can interface with Python:
PAMIE
Selenium
Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text.
So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.
Your code is smart and very flexible to extent, I think.
How about simply adding handle_starttag() and handle_endtag() to supress the <script> blocks?
class HTML2Text(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.output = cStringIO.StringIO()
self.is_in_script = False
def get_text(self):
return self.output.getvalue()
def handle_data(self, data):
if not self.is_in_script:
self.output.write(data)
def handle_starttag(self, tag, attrs):
if tag == "script":
self.is_in_script = True
def handle_endtag(self, tag):
if tag == "script":
self.is_in_script = False
def ParseHTML(source):
p = HTML2Text()
p.feed(source)
text = p.get_text()
return text

Categories

Resources