Solution: The action for this specific site is action="user/ajax/login" so this is what has to be appended to url of the main site in order to implement the payload. (action can be found by searching ctrl + f for action in the Page Source). The url is the what is going to be scraped. The with requests.Session() as s: is what is maintaining the cookies from within the site, which is what allows consistent scraping. The res variable is the response that posts the payload into the login url, allowing the user to scrape from a specific account page. After the post, requests will then attain the specified url. With this in place, BeautifulSoup can now grab and parse the HTML from within the accounts site. "html.parser" and "lxml" are both compatible in this case. If there is HTML from within an iframe, it's doubtful it can be grabbed and parsed using only requests, so I recommend using selenium preferably using Firefox.
import requests
payload = {"username":"?????", "password":"?????"}
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
with requests.Session() as s:
res = s.post(loginurl, data=payload)
res = s.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, "html.parser")
[Windows 10] To install Selenium pip3 install selenium and for the drivers - (chrome: https://sites.google.com/a/chromium.org/chromedriver/downloads) (Firefox: https://github.com/mozilla/geckodriver/releases) How to place "geckodriver" into PATH for Firefox Selenium: control panel "environmental variables "Path" "New" "file location for "geckodriver" enter Then your'e all set.
Also, in order to grab the iframes when using selenium, try import time and time.sleep(5) after 'getting' the url with your driver. This will give the site more time to load those extra iframes
Example:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox() # The WebDriver for this script
driver.get("https://www.google.com/")
time.sleep(5) # Extra time for the iframe(s) to load
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.prettify()) # To see full HTML content
print(soup.find_all("iframe")) # Finds all iframes
print(soup.find("iframe"))["src"] # If you need the 'src' from within an iframe.
You’re trying to make a GET request to a URL which requires being logged in to and therefore it is producing a 403 error which means forbidden. This means that the request is not authenticated to view the content.
If you think about it in terms of the URL you're constructing in your GET request, you would literally expose the username (x) and password (y) within the url like so:
https://9anime.to/user/watchlist?username=x&password=y
... which would of course be a security risk.
Without knowing what specific access you have to this particular site, in principle, you need to simulate authentication with a POST request first and then perform the GET request on that page afterwards. A successful response would return a 200 status code ('OK') and then you would be in a position to use BeautifulSoup to parse the content and target your desired part of that content from between the relevant HTML tags.
I suggest, to start, give the address of the login page and connect. Then you make an
input('Enter something')
to allow you to pause the time you connect (You must hit the ENTER key in the terminal to continue the process once connected and voila.)
Solved: The action-tag was user/ajax/login in this case. So by appending that to the original main url of the website - not https://9anime.to/user/watchlist but to https://9anime.to you get https://9anime.to/user/ajax/login and this gives you the login url.
import requests
from bs4 import BeautifulSoup as bs
url = "https://9anime.to/user/watchlist"
loginurl = "https://9anime.to/user/ajax/login"
payload = {"username":"?????", "password":"?????"}
with requests.Session() as s:
res = s.post(loginurl, data=payload)
res = s.get(url)
Related
I am trying to grab an element from tradingview.com. Specifically this link. I want the price of a symbol of whatever link I give my program. I noticed when looking through the elements of the url, I can find the price of the stock here.
<div class="tv-symbol-price-quote__value js-symbol-last">
"3.065"
<span class>57851</span>
</div>
When running this code below, I get this output.
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup
URL = "https://www.tradingview.com/symbols/NEARUSD/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser') # If this line causes an error, run 'pip install html5lib' or install html5lib
L = [soup.find_all(class_ = "tv-symbol-price-quote__value js-symbol-last")]
print(L)
output
[[<div class="tv-symbol-price-quote__value js-symbol-last"></div>]]
How can I grab the entire price from this website? I would like the 3.065 as well as the 57851.
You have the most common problem: page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JS. You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTool I found it uses JavaScript to send POST (with some JSON data) and it gets fresh price.
import requests
payload = {
"columns": ["market_cap_calc", "market_cap_diluted_calc", "total_shares_outstanding", "total_shares_diluted", "total_value_traded"],
"range": [0, 1],
"symbols": {"tickers": ["BINANCE:NEARUSD"]}
}
url = 'https://scanner.tradingview.com/crypto/scan'
response = requests.post(url, json=payload)
print(response.text)
data = response.json()
print(data['data'][0]["d"][1]/1_000_000_000)
Result:
{"totalCount":1,"data":[{"s":"BINANCE:NEARUSD","d":[2507704855.0467912,3087555230,812197570,1000000000,106737372.9550421]}]}
3.08755523
EDIT:
It seems above code gives only market cap. And page uses websocket to get fresh price every few seconds.
wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNEARUSD%2F&date=2022_10_17-11_33
And this would need more complex code.
Other answer (with Selenium) gives you correct value.
The webpage's contents are loaded dynamically by JavaScript. So you have to use an automation tool something like selenium or hidden API.
Here I use selenium with bs4 to grab the desired dynamic content.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://www.tradingview.com/symbols/NEARUSD/"
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"lxml")
price = soup.find('div',class_ = "tv-symbol-price-quote__value js-symbol-last").get_text(strip=True)
print(price)
Output:
3.07525163
I created bs4 web-scraping app with python. My program return empty list for review. For soup program runs normally.
from bs4 import BeautifulSoup
import requests
import pandas as pd
data = []
usernames = []
titles = []
comments = []
result = requests.get('https://www.kupujemprodajem.com/review.php?action=list')
soup = BeautifulSoup(result.text, 'html.parser')
review = soup.findAll('div', class_="single-review")
print(review)
for i in review:
header = i.find('div', class_="single-review__header")
footer = i.find('div', class_="comment-holder")
username = header.find('a', class_="single-review__username").text
title = header.find('div', class_="single-review__related-to").text
comment = footer.find('div', class_="single-review__comment").text
usernames.append(username)
titles.append(title)
comments.append(comment)
data.append(usernames)
data.append(titles)
data.append(comments)
print(data)
It isn't problem with class.
It looks like the reason this doesn't work is because the website needs a login in order to access that page. If in a private tab in a browser you where to visit https://www.kupujemprodajem.com/review.php?action=list, it would just take you to a login page.
There's 2 paths I can think of that you could take here:
Reverse engineer how the login process works and use the requests library to make a request to login and get (most likely) the session cookie from that in order to be able to request pages that require sign in.
(much simpler) use selenium instead. Selenium is a library that allows you to control a full browser instance, so you would be able to easily input credentials using this method. Beautiful soup on the other hand simply just parses html, so doing things like authenticating often take much more work in Beautiful Soup then they do in Selenium. I'd definitely suggest looking into it if you haven't already.
I am trying to scrape some data from a specific website using the requests and Beautiful Soup libraries. Unfortunately, I am not receiving the HTML for that page, but for the parent page https://salesweb.civilview.com. Thank you for your help!
import requests
from bs4 import BeautifulSoup
example="https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=473016965"
exampleGet=requests.get(example)
exampleGetText=exampleGet.text
soup = BeautifulSoup(exampleGetText,"lxml")
soup
You need to feed a cookie to the request:
import requests
from bs4 import BeautifulSoup
cookie = {'ASP.NET_SessionId': 'rk2b0dxast1eyu5jvxezltgh'}
example="https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=473016964"
exampleGet=requests.get(example, cookies=cookie)
exampleGetText=exampleGet.text
soup = BeautifulSoup(exampleGetText,"lxml")
soup.title
<title>Sales Listing Detail</title>
That particular cookie may not work for you, so you'll need manually navigate to that page one time, then go into the developer (web inspector) tools in your browser, and lookup the cookie under "Headers" in the network tab. My cookie looked like 'ASP.NET_SessionId=rk2b0dxast1eyu5jvxezltgh'.
The cookie should be valid for other property pages as well.
I'm trying to access a webpage and return all of the hyperlinks on that page. I'm using the same code from a question that was answered here.
I wish to access this correct page, but It is only returning content from this incorrect page.
Here is the code I am running:
import httplib2
from bs4 import SoupStrainer, BeautifulSoup
http = httplib2.Http()
status, response = http.request('https://www.iparkit.com/Minneapolis')
for link in BeautifulSoup(response, 'html.parser', parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print (link['href'])
Results:
/account
/monthlyAccount
/myproducts
/search
/
{{Market}}/search?expressSearch=true&timezone={{timezone}}
{{Market}}/search?expressSearch=false&timezone={{timezone}}
{{Market}}/events
monthly?timezone={{timezone}}
/login?next={{ getNextLocation(browserLocation) }}
/account
/monthlyAccount
/myproducts
find
parking-app
iparkit-express
https://interpark.custhelp.com
contact
/
/parking-app
find
https://interpark.custhelp.com
/monthly
/iparkit-express
/partners
/privacy
/terms
/contact
/events
I don't mind returning the above results, but It doesn't return any links that could get me to the page I want. Maybe it's protected? Any ideas or suggestions, thank you in advance.
The page you are trying to scrape is full JavaScript generated.
This http.request('https://www.iparkit.com/Minneapolis') would give almost nothing in this case.
Instead, you must do what a real browser do - Process JavaScript, then try to scrape what has been processed. For this you can try Selenium.
For your page, after running JavaScript you will get ~84 URLs, while trying to scrape without running JavaScript, you would get ~7 URLs.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('PATH_TO_CHROME_WEBDRIVER', chrome_options=chrome_options)
driver.get('https://www.iparkit.com/Minneapolis')
content = driver.page_source
Then you extract what you want from that content using BeautifulSoup in your case.
I want to get Problem Solved in the hackerearth page, for example,
https://www.hackerearth.com/#babe
When I do inspect element, I get
But on doing view source, I cannot find the class dark-weight 700. I think the content is loaded from java script. Therefore, when I use python's bs4 library, it returns me None Element.
I do not want to use selenium because it will open a new browser windows but I am doing all this in DJANGO platform so I want all the scripts to be processed in backend without any interruption and return only the number of problems solved, that is, 119.
Fortunately the data is loaded via publicly avaliable api (/users/pagelets/babe/coding-data/ for this user), so you can get the info with requests and bs4.
import requests
from bs4 import BeautifulSoup
user = 'babe'
url = 'https://www.hackerearth.com/users/pagelets/{}/coding-data/'.format(user)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
problems_solved = soup.find(string='Problems Solved').find_next().text
print(problems_solved)
119