Trouble with scraping HTML, JS, from Odysee using requests_html

Trouble with scraping HTML, JS, from Odysee using requests_html - python

So, I succesfully used requests_html to render javascript for Dailymotion.com, now I tried the same with Odysee.com, but using render gives me just a big data mess with nothing that relates to the search I make there.
I suppose that rendering the javascript is just not the right way to get data from Odysee. But maybe somebody knows what WOULD be a way?
I'll post my code, and below that a little excerpt from the data that is obtained:
from requests_html import HTMLSession
def search(search_terms):
session = HTMLSession()
url = f"https://odysee.com/$/search?q={search_terms}"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
try:
resp = session.get(url, timeout=10, headers=headers)
except:
return "connection error"
resp.html.render()
print(resp.html.html)
print(resp.html.text)
search("clannad")
Data mess:
<script src="/public/ui-202302061938.f1d5d3f511.js" async=""></script>
<style type="text/css">:root{--breakpoint-medium: 1150px}[dir=ltr] .pulse{animation:pulse 2s infinite ease-in-out}[dir=rtl] .pulse{animation:pulse 2s infinite ease-in-out}.header{z-index:3;position:fixed;top:0;width:100vw;font-size:var(--font-body);user-select:none;-webkit-user-select:none;-webkit-app-region:drag;-webkit-backdrop-filter:blur(4px);backdrop-filter:blur(4px)}[dir] .header{background-color:var(--color-header-background)}.header button.skip-button{opacity:0;position:absolute;top:0;width:0;height:0;overflow:hidden}[dir=ltr] .header button.skip-button{left:0;margin-right:var(--spacing-l)}[dir=rtl] .header button.skip-button{right:0;margin-left:var(--spacing-l)}.header button.skip-button:focus{opacity:1;position:relative;overflow:unset;width:unset;height:unset}.header .header__menu--right .header__navigationItem--balance{transition:border-radius .4s}[dir=ltr] .header .header__menu--right .header__navigationItem--balance .button__label{margin-left:var(--spacing-xxs) !important}[dir=rtl] .header .header__menu--right .header__navigationItem--balance .button__label{margin-right:var(--spacing-xxs) !important}.header .header__menu--right .header__navigationItem--balance:hover .button__label{color:var(--color-odysee-contrast) !important}#media(max-width: 1600px){[dir] .header .header__menu--right .header__navigationItem--balance-round{border-radius:50%}.header .header__menu--right .header__navigationItem--balance-round .button__label{display:none}}#media(max-width: 900px){.header .header__menu--right .header__navigationItem--balance-round{width:var(--height-button-mobile);height:var(--height-button-mobile)}}[dir=ltr] .header .header__menu--right .channel-thumbnail .comment__badge{left:20%}[dir=rtl] .header .header__menu--right .channel-thumbnail .comment__badge{right:20%}.header .header__menu--right .channel-thumbnail .comment__badge .icon{width:10px}[dir] .header .header__menu--right .channel-thumbnail .comment__badge .icon{margin-bottom:0px}.header .button--link{color:var(--color-odysee) !important}.header .button--link:hover{color:var(--color-link-hover) !important}.header .button--primary{color:var(--color-odysee-contrast) !important}[dir] .header .button--primary{background-color:var(--color-odysee) !important}[dir=ltr] .header .comment__badge{padding-right:var(--spacing-xxs)}[dir=rtl] .header .comment__badge{padding-left:var(--spacing-xxs)}.header .comment__badge .icon{width:18px}[dir] .header .comment__badge .icon{margin-bottom:-4px}#media(max-width: 900px){.header .card__actions--between .header__menu--left,.header .card__actions--between .header__menu--right{width:5rem;min-width:5rem}}.header *:focus-visible:not(.wunderbar__input):not(.menu__list):not(.menu__list--header):not(.button--secondary):not(.button-like):not(.button-dislike):not(select):not(input):not(textarea):not(video){color:var(--color-text) !important}[dir] .header *:focus-visible:not(.wunderbar__input):not(.menu__list):not(.menu__list--header):not(.button--secondary):not(.button-like):not(.button-dislike):not(select):not(input):not(textarea):not(video){background-color:rgba(var(--color-primary-static), 0.2) !important;box-shadow:0px 0px 0px 2px var(--color-odysee) inset}.header *:focus-visible:not(.wunderbar__input):not(.menu__list):not(.menu__list--header):not(.button--secondary):not(.button-like):not(.button-dislike):not(select):not(input):not(textarea):not(video) .icon{stroke:var(--color-text) !important}[dir] .header select:focus-visible, [dir] .header input:focus-visible:not(.wunderbar__input), [dir] .header textarea:focus-visible{box-shadow:0px 0px 0px 2px var(--color-odysee) inset}[dir] .header .ff-container canvas{border-radius:50%}.header .ff-container .freezeframe-img{width:var(--height-button) !important;height:var(--height-button) !important}[dir]

I found a solution, by looking around at the Headers that are send and received, I found this link:
https://lighthouse.odysee.tv/search?s=YOURQUERYHERE&size=20&from=0&claimType=f
which generates a JSON object with names and "claimId"s. To search them, one just uses odysee.com/name:ClaimId
Additional information here:
https://github.com/searx/searx/issues/2504

Related

How to bypass 503 BS4 python

How can I bypass 503 with BS4
Selenium works for a long time, so I would not want to use it
site to request
changing user-agent did not help
there is no cycle in the code, this error arrives from the first request
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

I did just
import requests
r = requests.get("https://mangalib.me/")
and got 503 too, in r.text I found that
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
and later
<div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
So I suspect you need tool with JavaScript support if you want to access this page

Getting 404 error for certain stocks and pages on yahoo finance python

I am trying to scrape data from yahoo finance at this URL https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL. After running the python code below, I get the following HTML response
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html
stockStatDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
page = requests.get(URL)
print(page.text)
<!DOCTYPE html>
<html lang="en-us"><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<title>Yahoo</title>
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<style>
html {
height: 100%;
}
body {
background: #fafafc url(https://s.yimg.com/nn/img/sad-panda-201402200631.png) 50% 50%;
background-size: cover;
height: 100%;
text-align: center;
font: 300 18px "helvetica neue", helvetica, verdana, tahoma, arial, sans-serif;
}
table {
height: 100%;
width: 100%;
table-layout: fixed;
border-collapse: collapse;
border-spacing: 0;
border: none;
}
h1 {
font-size: 42px;
font-weight: 400;
color: #400090;
}
p {
color: #1A1A1A;
}
#message-1 {
font-weight: bold;
margin: 0;
}
#message-2 {
display: inline-block;
*display: inline;
zoom: 1;
max-width: 17em;
_width: 17em;
}
</style>
<script>
document.write('<img src="//geo.yahoo.com/b?s=1197757129&t='+new Date().getTime()+'&src=aws&err_url='+encodeURIComponent(document.URL)+'&err=%<pssc>&test='+encodeURIComponent('%<{Bucket}cqh[:200]>')+'" width="0px" height="0px"/>');var beacon = new Image();beacon.src="//bcn.fp.yahoo.com/p?s=1197757129&t="+new Date().getTime()+"&src=aws&err_url="+encodeURIComponent(document.URL)+"&err=%<pssc>&test="+encodeURIComponent('%<{Bucket}cqh[:200]>');
</script>
</head>
<body>
<!-- status code : 404 -->
<!-- Not Found on Server -->
<table>
<tbody><tr>
<td>
<img src="https://s.yimg.com/rz/p/yahoo_frontpage_en-US_s_f_p_205x58_frontpage.png" alt="Yahoo Logo">
<h1 style="margin-top:20px;">Will be right back...</h1>
<p id="message-1">Thank you for your patience.</p>
<p id="message-2">Our engineers are working quickly to resolve the issue.</p>
</td>
</tr>
</tbody></table>
</body></html>
I am confused, because I had no problem scraping the data on the summary tab at this URL https://finance.yahoo.com/quote/AAPL?p=AAPL using the following code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml
from lxml import html
stockDict = {}
stockSymbol = 'AAPL'
URL = 'https://finance.yahoo.com/quote/'+ stockSymbol + '?p=' + stockSymbol
page = requests.get(URL)
print(page.text)
soup = BeautifulSoup(page.content, 'html.parser')
stock_data = soup.find_all('table')
stock_data
for table in stock_data:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if len(tds) > 0:
stockDict[tds[0].get_text()] = [tds[1].get_text()]
stock_sum_df = pd.DataFrame(data=stockDict)
print(stock_sum_df.head())
print(stock_sum_df.info())
Anyone have any idea what I'm doing wrong? I'm also using the free version of yahoo finance if that makes any difference.

So I figured out your problem.
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser:
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
stockSymbol = 'AAPL'
url = 'https://finance.yahoo.com/quote/'+ stockSymbol + '/key-statistics?p=' + stockSymbol
resp = requests.get(url, headers=headers, timeout=5).text
print(resp)
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
These things usually kick in due to 2 main reasons:
Anti-automation systems (systems that detect bots/crawlers)
Sites that tend to moderate content based upon the browser that you're visiting with.
Hence, it is always a good idea to supply a user-agent in the headers when designing automation systems.

Not really sure as to what causes the issue and what the intent is of your project. However if your intent is to be able to do something with yahoo finance data - and not to learn how to scrape data, than the following module could help you out (https://pypi.org/project/yahoo-finance/)

How to find relevant information in HTML code to send login request manually?

I'm trying to log into a webpage using python's request library. It doesn't work and I think the main issue is that I'm forgetting to send some information with the request, but unfortunately I don't know how to figure out what exactly is missing.
Important:
This question is not about how to use python to log in into a webpage (there are already enough other questions that answer this [see here, here, etc.]). I'd like to know how to figure out from a given HTML page what I need to send to pass a login screen.
Example
Regardless of that, I think an example can't hurt.
The login I tried to pass is https://mangadex.org/login. Looking at the HTML I found
<input autofocus="" tabindex="1" type="text" name="login_username" id="login_username" class="form-control" placeholder="Username" required="">
<input tabindex="2" type="password" name="login_password" id="login_password" class="form-control" placeholder="Password" required="">
So my first attempt was:
import requests
url = 'https://mangadex.org/login'
payload = {'login_username' : 'XXXXXX',
'login_password' : 'YYYYYY'}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post(url, data=payload)
# print the html returned or something more intelligent to see if it's a successful login page.
print p.text
Unfortunately I just get redirected to the login screen. So there seems to be something "hidden" that gets send along the log-in information as suggested here, see step 1.3. The issue is that I don't really know if the above website has something like this (there were some hidden fields, but they don't seem to be involved in the log in process). If not, I really don't understand how I'm supposed to figure out what is missing.
TL;DR:
Given the html code of a webpage, how do I figure out from the html code what information is necessary to be sent to the website to successfully log in?

Given the html code of a webpage, how do I figure out from the html code what information is necessary to be sent to the website to successfully log in?
If you only have HTML, maybe you could only know the Content-Type and the name in the form(Or even the API of login).Mostly, it depends on the code on the backend.Most of pages will use some measure to prevent web-scrape.
if you use the code below in the page you post,:
import requests
url = "https://mangadex.org/ajax/actions.ajax.php?function=login"
payload = {
"login_password": "xxxxx",
"login_username": "acs"
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Content-Type': 'multipart/form-data; boundary=----WebKitFormBoundaryIEBjAQpjLF2kWUAJ',
}
with requests.Session() as s:
response = s.post(url, headers=headers, data=payload)
print(response.text)
See the result:
Hacking attempt... Go away.
but if you add 'X-Requested-With': 'XMLHttpRequest' in your code:
import requests
url = "https://mangadex.org/ajax/actions.ajax.php?function=login"
payload = {
"login_password": "xxxxx",
"login_username": "acs"
}
headers = {
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Content-Type': 'multipart/form-data; boundary=----WebKitFormBoundaryIEBjAQpjLF2kWUAJ',
}
with requests.Session() as s:
response = s.post(url, headers=headers, data=payload)
print(response.text)
This could send the login information normally.
TL;DR:
I think you couldn't.you need to analyse it by yourself.

Error when requesting page with requests.get python

i am trying to get html of supreme main page to parse it.
Here is what i am trying:
from bs4 import BeautifulSoup
all_page = requests.get('https://www.supremenewyork.com/index', headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}).text
all_page_html = BeautifulSoup(all_page,'html.parser')
print(all_page_html)
But instead of html i get this response:
<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><title>Supreme</title><meta content="Supreme. The official website of Supreme. EST 1994. NYC." name="description"/><meta content="telephone=no" name="format-detection"/><meta content="on" http-equiv="cleartype"/><meta content="notranslate" name="google"/><meta content="app-id=664573705" name="apple-itunes-app"/><link href="//www.google-analytics.com" rel="dns-prefetch"/><link href="//ssl.google-analytics.com" rel="dns-prefetch"/><link href="//d2flb1n945r21v.cloudfront.net" rel="dns-prefetch"/><script src="https://www.google.com/recaptcha/api.js">async defer</script><meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1, user-scalable=no" id="viewport" name="viewport"/><link href="//d17ol771963kd3.cloudfront.net/assets/application-2000eb9ad53eb6df5a7d0fd8c85c0c03.css" media="all" rel="stylesheet"/><script \
e.t.c
Is this a kind of a block or maybe i am missing something? I even added requested headers but still i get this type of response instead of a normal one.

Well, that's actually how the page is. It is saying that it's and HTML page with some css and javascript running, then you should use the "Inspect Element" to search for the elements you want to grab and maybe write down the class they are stored in to find them more easily.

Webscraping with http shows "Web page blocked"

I am trying to scrape http website using proxies and when I am trying to extract text, it shows as "Web page Blocked". How could I avoid this error?
My code is as follows
url = "http://campanulaceae.myspecies.info/"
proxy_dict = {
'http' : "174.138.54.49:8080",
'https' : "174.138.54.49:8080"
}
page = requests.get(url, proxies=proxy_dict)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I get below output when I am trying to output text from the website.
<html>
<head>
<title>Web Page Blocked</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="NO-CACHE" http-equiv="PRAGMA"/>
<meta content="initial-scale=1.0" name="viewport"/>
........
<body bgcolor="#e7e8e9">
<div id="content">
<h1>Web Page Blocked</h1>
<p>Access to the web page you were trying to visit has been blocked in accordance with company policy. Please contact your system administrator if you believe this is in error.</p>

Because you did not specify a user-agent for the request headers.
Quite often, sites block requests that come from robot-like sources.
Try it like this:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'}
page = requests.get(url, headers=headers, proxies=proxy_dict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with scraping HTML, JS, from Odysee using requests_html - python

Related

How to bypass 503 BS4 python

Getting 404 error for certain stocks and pages on yahoo finance python

How to find relevant information in HTML code to send login request manually?

Error when requesting page with requests.get python

Webscraping with http shows "Web page blocked"

Categories

Resources