parse directly on pure html source with selenium in python - python

I'm trying to test the selenium program that I wrote by giving it an HTML source as a string for some reasons such as speed. I don't want it to get the URL and I don't want it to open a file I just want to pass it a string that contains whole DIV part of that site and do parsing stuff on it.
this is part of a module that i wrote:
source = driver.page_source
return {'containers': source}
and in another module,
def get_rail_origin(self):
return self.data['containers'].find_element_by_id('o_outDepName')...
I'm trying to do parsing stuff on it but I get
AttributeError: 'str' object has no attribute 'find_element_by_id'
So how can I parse on pure HTML source without opening any file or URL

Selenium works with live HTML DOM. If you want to get source and then parse it, you can try, for instance, lxml.html:
def get_rail_origin(self):
source = html.fromstring(self.data['containers'])
return source.get_element_by_id('o_outDepName')
P.S. I assumed that self.data['containers'] is HTML source code

Related

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).
For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Python AttributeError NoneType 'text'

Trying to make a python script that can scrape from pastebin's RAW Paste Data section of the page on saved pastebin outputs. But I'm running into an issue with Python Attribute Error about NoneType has no object attribute 'text', I'm using the libraries from BeautifulSoup in my project. I tried to install spider-egg with pip install so I could use that also, but there was issues downloading the package from the server.
I need to be able to grab different multiple lines from the RAW Paste Data section and the print them back out to me.
first_string = raw_box.text.strip()
second_string = raw_box2.text.strip()
from the pastebin page I have the class element names for the RAW Paste Data section which is;
<textarea id="paste_code" class="paste_code" name="paste_code" onkeydown="return catchTab(this,event)">
taking the class name paste_code I then have this
raw_box = soup.find('first_string ', attrs={'class': 'paste_code'})
raw_box2 = soup.find('second_string ', attrs={'class': 'paste_code'})
I thought that should of been it, but apparently not, because I get the error I mentioned before. After parsing the data that has been stripped I need to be able to redirect that into a file after printing what it got. I also want to try make this python3 compatible, but that would take a little more work I think, since there's a lot of differences between python 2.7.12 and 3.5.2.
The following approach should help to get you started:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://pastebin.com/hGeHMBQf')
soup = BeautifulSoup(r.text, "html.parser")
raw = soup.find('textarea', id='paste_code').text
print raw
Which for this example should display:
hello world

XPATH query using python

I have an html file resulting from an ifort code coverage report. This html file contains numerous lines as follows:
<a name="l1" style="background-color: #ffffff"> module WriteOutput</a>
I succeeded in importing the file using the following in python:
from lxml import html
with open(SampleSourceFile, "r") as f:
page = f.read()
tree = html.fromstring(page)
Then I was actually able to get all the name attributes using the following XPATH syntax
tree.xpath(r'/html/body//a/#name')
I see that this offers interesting possibilities. Is it also possible to extract the content of the <a> tag? Namely in this case the string 'module WriteOutput' using XPATH?
Also can I add some constraints? for instance I'd like to only get back the names of the <a> tags with a certain background-color. Are these things possible?
Thanks,
Though I have not tried it , but something like this should work
tree.xpath(r'/html/body/a[#background-color eq "#ffffff"]')

How to save webpages text content as a text file using python

I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

Is there any way to parse DOM tree for website content? [duplicate]

This question already has answers here:
Amazon web scraping
(3 answers)
Closed 7 years ago.
There are some packages for parsing dom tree from xml content, like https://docs.python.org/2/library/xml.dom.minidom.html.
But I dont want to target xml, only html website page content.
from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
print( link.attr( "href" ) )
but for this I am getting this error:
Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
raise Exception
Exception
See I already checked BeautifulSoup but is not what I want. Beautifulsoup work only for html page. If page content loaded dynamically using Javascript then it fails. I dont want to parse the elements using getElementByClassName and similar. But dom.children(0).children(1) something like this.
So is there any way like using headless browser, selenium using which I can parse entire DOM tree structure and going through child and subchild I can access targget element?
The Python Selenium API provides you with everything you might need. You can start with
html = driver.find_element_by_tag_name("html")
or
body = driver.find_element_by_tag_name("body")
and then go from there with
body.find_element_by_xpath('/*[' + str(x) + ']')
which would be equivalent to "body.children(x-1)". You don't need to use BeautifulSoup or any other DOM traversal framework on top of that, but you certainly can by taking the page source and letting it be parsed by another library like BeautifulSoup:
soup = BeautifulSoup(driver.page_source)
soup.html.children[0] #...
Yes, but it's not going to be simple enough to include the code in a SO post. You're on the right track though.
Basically you're going to need to use a headless renderer of your choice (e.g. Selenium) to download all the resources and execute the javascript. There's really no use reinventing the wheel there.
Then you'll need to echo the HTML from the headless renderer out to file on the page ready event (every headless browser I've worked with offers this ability). At that point you can use BeautifulSoup over that file to navigate the DOM. BeautifulSoup does support child-based traversal as you desire: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

Categories

Resources