How can i grab pdf links from website with Python script

How can i grab pdf links from website with Python script - python

Quite often i have to download the pdfs from websites but sometimes they are not on one page.
They have divided the links in pagination and I have to click on every page of get the links.
I am learning python and i want to code some script where i can put the weburl and it extracts the pdf links from that webiste.
I am new to python so can anyone please give me the directions how can i do it

Pretty simple with urllib2, urlparse and lxml. I've commented things more verbosely since you're new to Python:
# modules we're using (you'll need to download lxml)
import lxml.html, urllib2, urlparse
# the url of the page you want to scrape
base_url = 'http://www.renderx.com/demos/examples.html'
# fetch the page
res = urllib2.urlopen(base_url)
# parse the response into an xml tree
tree = lxml.html.fromstring(res.read())
# construct a namespace dictionary to pass to the xpath() call
# this lets us use regular expressions in the xpath
ns = {'re': 'http://exslt.org/regular-expressions'}
# iterate over all <a> tags whose href ends in ".pdf" (case-insensitive)
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
# print the href, joining it to the base_url
print urlparse.urljoin(base_url, node.attrib['href'])
Result:
http://www.renderx.com/files/demos/examples/Fund.pdf
http://www.renderx.com/files/demos/examples/FundII.pdf
http://www.renderx.com/files/demos/examples/FundIII.pdf
...

If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/).
It is pretty easy to understand how to use it and can download pdf files you need.

By phone, maybe it is not very readable
If you is going to gran things from website which are all static pages or other things. You can easily grab html by requests
import requests
page_content=requests.get(url)
But if you grab things like some communication website. There will be some anti-grabing ways.(how to break these noisy things will be the problem)
Frist way:make your requests more like a browser(human).
add the headers(you can use the dev tools by Chrome or Fiddle to copy the headers)
make the right post form.This one should copy the ways you post the form by browser.
get the cookies, and add it to requests
Second way. use selenium and browser driver. Selenium will use true browser driver(like me, i use chromedriver)
remeber to add chromedriver to the path
Or use code to load the driver.exe
driver=WebDriver.Chrome(path)
not sure is this set up code
driver.get(url)
It is trully surf the url by browser, so it will decrease the difficulty of grabing things
get the web page
page=driver.page_soruces
some of the website will jump several page. This will cause some error. Make your website wait for some certain element showing.
try:
certain_element=ExpectedConditions.presenceOfElementLocated(By.id,'youKnowThereIsAElement'sID)
WebDriverWait(certain_element)
or use implict wait: wait the time you like
driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS)
And you can controll the website by WebDriver. Here is not going to describe. You can search the module.

Related

Url request does not parse every information in HTML using Python

I am trying to extract information from an exchange website (chiliz.net) using Python (requests module) and the following code:
data = requests.get(url,time.sleep(15)).text
I used time.sleep since the website is not directly connecting to the exchange main page, but I am not sure it is necessary.
The things is that, I cannot find anything written under <body style> in the HTML text (which is the data variable in this case). How can I reach the full HTML code and then start to extract the price information from this website?
I know Python, but not familiar with websites/HTML that much. So I would appreciate if you explain the website related info like you are talking to a beginner. Thanks!

There could be a few reasons for this.
The website runs behind a proxy server from what I can tell, so this does interfere with your request loading time. This is why it's not directly connecting to the main page.
It might also be the case that the elements are rendered using javascript AFTER the page has loaded. So, you only get the page and not the javascript rendered parts. You can try to increase your sleep() time but I don't think that will help.
You can also use a library called Selenium. It simply automates browsers and you can use the page_source property to obtain the HTML source code.
Code (taken from here)
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
html_source = browser.page_source
With selenium, you can also set the XPATH to obtain the data of -' extract the price information from this website'; you can see a tutorial on that here. Alternatively,
once you extract the HTML code, you can also use a parser such as bs4 to extract the required data.

Python Requests only pulling half of intented tags

I'm attempting to scrape a website, and pull each sheriff's name and county. I'm using devtools in chrome to identify the HTML tag needed to locate that information.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
URL = 'https://oregonsheriffs.org/about-ossa/meet-your-sheriffs'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
sheriff_names = soup.find_all('a', class_ = 'eg-sheriff-list-skin-element-1')
sheriff_counties = soup.find_all(class_ = 'eg-sheriff-list-skin-element-2')
However, I'm finding that Requests is not pulling the entire page's html, even though the tag is at the end. If I scan page.content, I find that Sheriff Harrold is the last sheriff included, and that every sheriff from curtis landers onwards is not included (I tried pasting the full output of page.contents but it's too long).
My best guess from reading this answer is that the website has javascripts that load the remaining part of the page upon interacting with it, which would imply that I need to use something like Selenium to interact with the page to get the rest of it to first load.
However, if you look at the website, it's very simple, so as a novice part of me is thinking that there has to be a way to scrape this basic website without using a more complex tool like Selenium. That said, I recognize that the website is wordpress generated and wordpress can set up delayed javascripts on even simple web sites.
My questions are:
1) do I really need to use Selenium to scrape a simple, word-press generated website like this? Or is there a way to get the full page to load with just Requests? Is there anyway to tell when web pages will require a web driver and when Requests will not be enough?
2) I'm thinking one step ahead here - if I want to scale up this project, how would I be able to tell that Requests has not returned the full website, without manually inspecting the results of every website?
Thanks!

Unfortunately, your initial instinct is almost certainly correct. If you look at the page source it seems that they have some sort of lazy loading going on, pulling content from an external source.
A quick look at the page source indicates that they're probably using the "Essential Grid" WordPress theme to do this. I think this supports preloading. If you look at the requests that are made you might be able to ascertain how it's loading this and pull directly from that source (perhaps a REST call, AJAX, etc).
In a generalized sense, I'm afraid that there really isn't any automated way to programmatically determine if a page has 'fully' loaded, as that behavior is defined in code and can be triggered by anything.
If you want to capture information from pages that load content as you scroll, though, I believe Selenium is the tool you'll have to use.

Is there a way to get information about elements from the inspect menu in a website?

I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.

You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.

The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python

Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source

data mining from website using xpath in python

I run this program but it is giving me only "[]" instead of giving the web page data.please help
import urllib
import re
import lxml.html
start_link= "http://aepcindia.com/ApparelMarketplaces/detail"
html_string = urllib.urlopen(start_link)
dom = lxml.html.fromstring(html_string.read())
side_bar_link = dom.xpath("//*[#id='show_cont']/div/table/tr[2]/td[2]/text()")
print side_bar_link
file = open("next_page.txt","w")
for link in side_bar_link:
file.write(link)
print link
file.close()

The HTML source you are downloading contains an empty content area: <div id="show_cont"></div>. This div is populated later by a javascript function showData(). When you look at the page in a browser, the javascript is executed before, which is not the case when you just download the HTML source using urllib.
To get the data you want, you can try to mimic the POST request in the showData() function or, preferably, scrape the website using a scriptable headless browser.
Update: While a headless browser would be a much more generally applicable approach, in this case it might be overhead here. You actually will be better off reverse engineering the showData() function. The alax-call in that is all too obvious, delivers a plain HTML table and you can also limit searches :)
http://aepcindia.com/ApparelMarketplaces/ajax_detail/search_type:/search_value:

Using Python requests.get to parse html code that does not load at once

I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.

You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.

The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i grab pdf links from website with Python script - python

If there is a lot of pages with links you can try excellent framework -- Scrapy(http://scrapy.org/). It is pretty easy to understand how to use it and can download pdf files you need.

Related

Url request does not parse every information in HTML using Python

Python Requests only pulling half of intented tags

Is there a way to get information about elements from the inspect menu in a website?

data mining from website using xpath in python

Using Python requests.get to parse html code that does not load at once

Categories

Resources