Faster way to parse JS generated site - python

I am creating a script that will have to parse from a JS generated site ( using python ) atleast a 1000 times a day. Parsing it the usual way (using a browser to open it and then get the code) takes me about 30 seconds. Which is not satisfactory. I thought about how to make the process faster and I have an idea - is it possible to create a browser that doesn`t create a window ( ignores the visual part ) and is simply a process in other words an "invisible" browser. What I want to know is if it will be efficient and if there are other ways to make it run faster. Any help is appreciated.
EDIT : Here is the code for my parser
from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()

Related

Easy way to work around slow Selenium Python startup?

So I have a program I want to run using selenium specifically that takes a series of actions on a password-protected website. Basically, I need to be able to input a unique link and password when I get it, which will take me to the main website which I have automated. The issue here is that Selenium takes very long to get to load a webpage when you start it up and time is very important in this application. Inputting the link and launching the browser to that link directly takes a long time. What I have tried doing is preloading the browser to a different website (ie, https://google.com) beforehand, and then waiting on user input for the link to the actual page. This process works a lot quicker, but I'm having trouble getting it to work inside a function and with multiprocessing. I am using multiprocessing to execute this on a wide scale with lots of instances. I am trying to start all of my functions the second a link is defined by me. I am on Windows 10, using Python 3.8.3, and using Chrome for my Selenium browser.
from selenium import webdriver
global link
link = input('Paste Link Here: ')
def instance_1():
browser1 = webdriver.Chrome(*my webdriver file path*)
browser1.get('https://google.com')
#need something that waits here until the link variable is defined by me
browser1.get(link)
#the rest of the automation works fine from here
Ideally, the solution would be able to work with multiprocessing. The ideal flow would be something like this:
1. All selenium instances" (written as their own functions) start-up and preload to a website (this part works fine)
2. They wait until the link to go to is specified (this is where the issue is)
3. They then go to the link and execute the automation (this part works fine)
Tldr; basically anything that would allow me to let the program continue while waiting on the input would be nice.

I am trying to download the Yearly data from this website using python but i am not sure how to approach it?

I want to learn how to download the CSV files for the last ten years using python. I think this would be helpful.
https://www.usgovernmentspending.com/compare_state_debt
My attempts involve requests and pandas.
This is a multipart problem and I'm going to outline the steps I think you should use.
The first part is going to be simply downloading the webpage. What I would suggest is use something like requests to get the webpage
Once you have that you can use beautiful soup to parse the webpage.
I took a look at the website and it looks like there are a number of ways you could download the data. I think the best way to get the data is going to be to extract all the text from this particular part in the page.
Once you do that you are probably going to need to clean up the data. I suggest using pandas for that.
People on here aren't going to solve the whole problem for you. That said, if you get stuck along the way and have a specific question, StackOverflow can probably help at that point.
Issue resolved I managed to solve it using selenium.
By doing the following:
from selenium import webdriver # allow launching browser
# Opening in incognito
driver_option = webdriver.ChromeOptions()
#driver_option.add_argument(" — incognito")
chromedriver_path = '# Write your path here' # Change this to your own chromedriver path!
# Creating a webdriver.
def create_webdriver():
return webdriver.Chrome(executable_path=chromedriver_path, options=driver_option)
URL = ""
browser.get(url)
# Clicking the button.
elem1 = browser.find_element_by_link_text("download file")
# Clicking the button.
elem1.click()
I put the previous code in a loop for all the years until 2020 and I got all the files in CSV format

Python selenium webdriver code performance

I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:
from selenium import webdriver
import codecs
driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[#id="content"]/table[1]/tbody/tr')
Each element in results_table is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:
results_file=codecs.open(path+"results.txt","w","cp1252")
for element in enumerate(results_table):
element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
element_list=[field.text for field in element_fields]
stuff_to_write='#'.join(element_list)+"\r\n"
results_file.write(stuff_to_write)
#print (i)
results_file.close()
driver.quit()
This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?
Using python 3.6
Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.
Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.

Rendering Javascript to obtain static HTML in Python

I have a big amount of HTML files which I want to process using BeautifulSoup and generate some statistics. Although, I came across the problem that the HTML files contain scripts that may generate more HTML code which is not being processed. Therefore, I need to render all Javascript into static HTML before proceeding.
I have seen some options such as using Selenium, but it doesn't seem to fit since I don't want to launch a browser (it should be done in background).
Can someone please suggest an appropriate approach to this?
Thanks in advance!
Since you need a Javascript engine, using a headless browser is the way to go.
Using Selenium web driver with the PhantomJS headless browser is probably your best option:
driver = webdriver.PhantomJS()
driver.get("...")
bs = BeautifulSoup(driver.page_source)

Using Python requests.get to parse html code that does not load at once

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)

Categories

Resources