Hi I am trying to make this basic scraper work, where it should go to a website fill "City" and "area" ,search for restaurants and return the html page.
This is the code i'm using
payload = OrderedDict([('cityId','NewYork'),('area','Centralpark')])
req = requests.get("http://www.somewebsite.com",params=payload)
f = req.content
soup = BeautifulSoup((f))
And Here is how the Source HTML looks like
When I'm checking the resulting soup variable it doesn't have the search results , instead it contains the data from the first page only,which has the form for entering city and area value (i.e. www.somewebsite.com, what i want is results of www.somewebsite.com?cityId=NewYork&area=centralPark).So Is there anything that i have to pass with that params to explicitly press the search button or is there any other way to make it work.
You need first check whether you can visit the URL by web browser and get the correct result.
Related
I'm trying to scrape the press releases of a Danish political party (https://danskfolkeparti.dk/nyheder/) but the content of the press releases only appears after clicking 'search' within a web browser. There is no navigable html (that I can find) that allows for identifying a unique URL for the 'search' function, and the website URL does not change after clicking search within a browser.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'http://danskfolkeparti.dk/nyheder.html'
headers = {'User-Agent': 'Mozilla/5.0'}
soup = BeautifulSoup(requests.get(url=url, headers=headers).content, 'html.parser')
### the data I'm looking for would usually be accessible using something like the following.
### However, the HTML does not appear until AFTER the search is clicked within a browser
soup.find_all("div", class_='timeline')
printing 'soup' shows the HTML without the content that's desired. The search button in the website (Søg, in Danish) is not accessible as an endpoint. After clicking the search button (<Søg>) in a web browser, the desired content appears in a web browser and is viewable by 'inspecting' the page, but the URL does not change so there's not a clear way to access the page with Beautiful soup.
The desired content is the title, url and date of each individual press release. For example, the first press release that appears when searching with default settings is the following:
title: Året 2022 viste hvad Dansk Folkeparti er gjort af
date: 23/12/2022
url: https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/aaret-2022-viste-hvad-dansk-folkeparti-er-gjort-af/
Any help with this would be greatly appreciated!!
the HTML does not appear until AFTER the search is clicked
That was not what I experienced - when I went to the nyheder page, there were already 10 posts on the timeline, and more loaded when I scrolled down.
However, it's true that the by HTML fetched requests.get does not contain the timeline. It's an empty frame with just the top and bottom panes of the page, and the rest is rendered with JavaScript. I can suggest 2 ways to get around this - either use selenium or scrape via their API.
Solution 1: Selenium
I have 2 functions which I often use for scraping:
linkToSoup_selenium which takes a URL and [if everything goes ok] returns a BeautifulSoup object. For this site, you can use it to:
scroll down a certain number of times [it's best to over-estimate how many scrolls you need]
wait for the links and dates to load
click "Accept Cookies" (if you want to, doesn't make a difference tbh)
selectForList which takes a bs4 Tag and list of CSS selectors and returns the corresponding details from that Tag
(If you are unfamiliar with CSS selectors, I often use this reference as a cheatsheet.)
So, you can set up a reference dictionary (selRef) of selectors [that will be passed to selectForList later] and then fetch and parse the loaded HTML with linkToSoup_selenium:
selRef = {
'title': 'div.content>a.post-link[href]', 'date': 'p.text-date',
'url': ('div.content>a.post-link[href]', 'href'),
# 'category': 'div.content>p.post-category-timeline',
# 'excerpt': 'div.content>p.post-content-rendered',
}
soup = linkToSoup_selenium('https://danskfolkeparti.dk/nyheder/', ecx=[
'div.timeline>div>div.content>a.post-link[href]', # load links [+titles]
'div.timeline>div>p.text-date' # load dates [probably redundant]
], clickFirst=[
'a[role="button"][data-cli_action="accept_all"]' # accept cookies [optional]
], by_method='css', scrollN=(20, 5), returnErr=True) # scroll 20x with 5sec breaks
Since returnErr=True is set, the function will return a string containing an error message if something causes it to fail, so you should habitually check for that before trying to extract the data.
if isinstance(soup, str):
print(type(soup), miniStr(soup)[:100]) # print error message
prTimeline = []
else:
prTimeline = [{k: v for k, v in zip(
list(selRef.keys()), selectForList(pr, list(selRef.values()))
)} for pr in soup.select('div.timeline>div')]
Now prTimeline looks something like
## [only the first 3 of 76 are included below] ##
[{'title': 'Året 2022 viste hvad Dansk Folkeparti er gjort af',
'date': '23/12/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/aaret-2022-viste-hvad-dansk-folkeparti-er-gjort-af/'},
{'title': 'Mette Frederiksen har stemt danskerne hjem til 2001',
'date': '13/12/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/mette-frederiksen-har-stemt-danskerne-hjem-til-2001/'},
{'title': 'Vi klarede folketingsvalget – men der skal kæmpes lidt endnu',
'date': '23/11/2022',
'url': 'https://danskfolkeparti.dk/nyheder/mortens-nyhedsbrev/vi-klarede-folketingsvalget-men-der-skal-kaempes-lidt-endnu/'}]
Solution 2: API
If you open open the Network Tab before clicking search (or just scrolling all the way down, or refreshing the page), you might see this request with a JSON response that is being used to fetch the data for populating the timeline. So, you just need to replicate this API request.
However, as #SergeyK commented,
site have Cloudflare protection, so u cant get result without setting up cookies
and the same seems to be true for the API as well. I'm not good with setting headers and cookies as needed; so instead, I tend to just use cloudscraper [or HTMLSession sometimes] in such cases.
# import cloudscraper
qStr = 'categories=20,85,15,83,73,84,&before=2022-12-28T23:59:59'
qStr += '&after=1990-05-08T01:01:01&per_page=99&page=1'
apiUrl = f'https://danskfolkeparti.dk/wp-json/wp/v2/posts?{qStr}'
prTimeline = [{
'title': pr['title']['rendered'],
'date': pr['date'], # 'date': pr['date_gmt'],
'url': pr['link']
} for pr in cloudscraper.create_scraper().get(apiUrl).json()]
and the resulting prTimeline looks pretty similar to the Selenium output.
There's an expanded version using a set of functions that lets you get the same results with
prTimeline, rStatus = danskfolkeparti_apiScraper(pathsRef={'title': ['title', 'rendered'], 'date': ['date'], 'url': ['link']})
But you can do much more, like passing searchFor={'before': '2022-10-01T00:00:00'} to only get posts before October, or searchFor="search terms" to search by keywords but
you can't search for keywords and also set parameters like category/time/etc
you have to make sure that before and after are in ISO format, and that page is a positive integer, and that categories is a list of integers [or they might be ignored]
You can get more information if you leave the default arguments and make use of all of the functions, as below:
# from bs4 import BeautifulSoup
### FIRST PASTE EVERYTHING FROM https://pastebin.com/aSQrW9ff ###
prTimeline, prtStatus = danskfolkeparti_apiScraper()
prCats = danskfolkeparti_catNames([pr['category'] for pr in prTimeline])
for pi, (pr, prCat) in enumerate(zip(prTimeline, prCats)):
prTimeline[pi]['category'] = prCat
cTxt = BeautifulSoup(pr['content']).get_text(' ')
cTxt = ' '.join(w for w in cTxt.split() if w) # reduce whitescpace
prTimeline[pi]['content'] = cTxt
FULL RESULTS
The full results from the last code snippet, as well as the selenium solution (with ALL of selRef uncommented) have been uploaded to this spreadsheet. The CSVs were saved using pandas.
# import pandas
fileName = 'prTimeline.csv' # 'dfn_prTimeline_api.csv' # 'dfn_prTimeline_sel.csv'
pandas.DataFrame(prTimeline).to_csv(fileName, index=False)
Also, if you are curious, you can see the categories in the default API call with danskfolkeparti_categories([20,85,15,83,73,84]).
There are 4000 pages of data in an given image of a website, BscScan. Is there any way to directly go to on some certain page number say 1000 using Python? I only know a long way of doing it, i.e. go to page 1, 2, 3, 4, but it will take too much time to reach to page 1000.
Web-page image
After sniffing around the page I can see that there is a method to change the page via a URL.
Go to the page:
https://bscscan.com/token/0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3
View page source. Search for a var called sid. It will look like this: var sid = '877a5d483a2d45460c42adf16029f473';
Use that sid in the following URL: https://bscscan.com/token/generic-tokentxns2?m=normal&contractAddress=0x8076c74c5e3f5852037f31ff0093eeb8c8add8d3&a=&sid=PUTTHESIDHERE&p=1
You will now be able to change the page by changing the page number (p=1 in URL in step 3)
You can use this with Selenium to extract the data you need. You will need to utilize a headless browser as the SID is generated for each session.
You can extract the sid programmatically by utilizing a regex on the page source:
import requests
import re
sid = ""
def get_sid(url):
r = requests.get(url)
pattern = re.findall(r"var sid = '[^']*", r.text)
sid = pattern[0].replace("var sid = '","")
return sid
You already located the button, anyways
next_button = driver.find_element_by_xpath('.//li[#data-original-title="Go to Next"]/a')
Then change the address of the button (Edit element in browser with python selenium), specificly the last argument which is the page num, and finally
next_button.click()
I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")
I am trying to scrape data from this website. To access the tables, I need to click the "Search" button. I was able to successfully do this using mechanize:
br = mechanize.Browser()
br.open(url + 'Wildnew_Online_Status_New.aspx')
br.select_form(name='aspnetForm')
page = br.submit(id='ctl00_ContentPlaceHolder1_Button1')
"page" gives me the resulting webpage with the table, as needed. However, I'd like to iterate through the links to subsequent pages at the bottom, and this triggers javascript. I've heard mechanize does not support this, so I need a new strategy.
I believe I can get to subsequent pages using a post request from the requests library. However, I am not able to click "search" on the main page to get to the initial table. In other words, I want to replicate the above code using requests. I tried
s = requests.Session()
form_data = {'name': 'aspnetForm', 'id': 'ctl00_ContentPlaceHolder1_Button1'}
r = s.post('http://forestsclearance.nic.in/Wildnew_Online_Status_New.aspx', data=form_data)
Not sure why, but this returns the main page again (without clicking Search). Any help appreciated.
I think you should look into scrapy
you forgot some parameters in ths post request:
https://www.pastiebin.com/5bc6562304e3c
check the Post request with google dev tools
I'm trying to create a bot that checks for open classes, the webpage uses a cookie that is set when visiting the site. However I cant seem to replicate this using requests/sessions with my code.
What it's supposed to do:
visit link 1 (creates the cookie) (search page)
visit link 2 which includes the search terms in the URL (search results)
When done in browser, Link 2 should show the search results
Issue:
I can create the cookie visiting link 1
But cant use it with the link 2 that includes the search terms
this results in loading the same first link (search page)
Here is some sample code I have tried:
s = requests.Session()
# create the cookie using first link
r = s.get(url)
# r2 should be search results
r2 = s.post(urlWithSearchTerms, cookies=r.cookies)
# parse html etc, however loads wrong page
data = r2.text
soup = BeautifulSoup(data,"html.parser")
print(soup.prettify())
Instead of loading the search results, it still loads the first page.
I also tried including r.headers, using sessions.post(url), using without sessions etc..
How would I get python to load the second page?
Thanks!
you are sending an HTTP POST request where you should be sending a GET.
change this line:
r2 = s.post(urlWithSearchTerms, cookies=r.cookies)
to:
r2 = s.get(urlWithSearchTerms, cookies=r.cookies)