I’m trying to create a script where I can parse the source code from https://www.youtube.com/feed/subscriptions and retrieve the URLs of the videos in my subscription feed, in order to stick them in a MP4 download and save to my FTP server.
However I have been stuck on this problem for a couple of hours.
import bs4
import requests
source = requests.get('https://www.youtube.com/feed/subscriptions')
sourceSoup = bs4.BeautifulSoup(source.text,'html.parser')
sourceSoup.select('#grid-319397 > li:nth-child(1) > div > div.yt-lockup-dismissable > div.yt-lockup-content > h3')
[]
I am right clicking on the css element and ‘inspect element’ then ‘copy selector’ and pasting it inside the select method.
As you can see, it keeps returning an empty list.
I have tried many different derivatives of this, but it’s not picking up anything. I am having the same problem when doing the same things on the homepage, therefore I doubt that it is because it is behind a login (although I am logged in on the PC in which the script is running).
Can someone please point in the right direction?
You are facing 2 different (but somehow related) issues:
The page that the server returns to the GET reguest that is being sent by your code might be different from the page that you recieve when you go to the page with your browser, because of an unknown user-agent to the server.
The item you're looking for is only visible after you log-in.
Now, instead of manually taking care for both of these issues, there is a YouTube API that you should be considering to use.
A demo code showing that we get a different page for different user-agents:
import requests
python_user_agent_request = requests.get('http://www.youtube.com')
chrome_user_agent_request = requests.get('http://www.youtube.com',
headers={'user-agent':'''Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'''})
print(python_user_agent_request.request.headers['user-agent'])
>> python-requests/2.7.0 CPython/3.4.2 Windows/7
print(chrome_user_agent_request.request.headers['user-agent'])
>> Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36
# .text holds the HTML page source
print(python_user_agent_request.text == chrome_user_agent_request.text)
>> False
Related
I'm attempting to write a script that logs in to my online banking account (USAA) and gets the transaction information so that I can do some analysis on it. I managed to get Playwright to correctly go to the webpage, enter my username, and click the "Next" button, but every time it does that, I get the below pop-up message:
We are unable to complete your request. Our system is currently unavailable. Please try again later.
When I manually enter my username and click the "Next" button, however, it works just fine. The code I'm using is below:
from playwright.sync_api import sync_playwright
import dotenv
import os
dotenv.load_dotenv()
USER_NAME = os.getenv('USER_NAME') or ''
PASSWORD = os.getenv('PASSWORD') or ''
with sync_playwright() as p:
browser = p.chromium.launch(headless=False, slow_mo=1000)
context = browser.new_context(user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36')
page = context.new_page()
page.goto('https://www.usaa.com/my/logon?logoffjump=true&wa_ref=pub_global_log_on')
page.locator('input:below(:text(\'Online ID\'))').fill(USER_NAME)
page.locator('button[type=submit]').click()
page.locator('input:below(:text(\'Password\'))').fill(PASSWORD)
page.locator('button[type=submit]').click()
I suspect that the issue isn't within the code itself, but perhaps with some security feature I'm not aware of, much less how to get around. Has anyone else successfully written a script that accomplishes this? Did you have to get around this issue? If so, how did you do it?
Hi i'm trying to build a manga downloader app, for this reason I'm scraping several sites, however I have a problem once I get the image URL.
I can see the image using my browser (chrome), I can also download it, however I can't do the same using any popular scripting library.
Here is what I've tried:
String imgSrc = "https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"
Connection.Response resultImageResponse = Jsoup.connect(imgSrc)
.userAgent(
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
.referrer("none").execute();
// output here
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(new java.io.File(String.valueOf(imgPath))));
out.write(resultImageResponse.body()); // resultImageResponse.body() is where the image's contents are.
out.close();
I've also tried this:
URL imgUrl = new URL(imgSrc);
Files.copy(imgUrl.openStream(), imgPath);
Lastly, since I was sure the link works I've tried to download the image using python, but also in this case I get a 403 error
import requests
base_url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url)
googling I found this Unable to get image url in Mangaeden API Angular 6 which seems really close to my problem, however I don't understand if I'm setting wrong the referrer or it doesn't work at all...
Do you have any tips?
Thank you!
How to fix?
Add some "headers" to your request to show that you might be a "browser", this will give you a 200 as response and you can save the file.
Note This will also work for postman, just overwrite the hidden user agent and you will get the image as response
Example (python)
import requests
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
url = "https://cdn.mangaeden.com/mangasimg/d0/d08f07d762acda8a1f004677ab2414b9766a616e20bd92de4e2e44f1.jpg"
res = requests.get(url,headers=headers)
with open("image.jpg", 'wb') as f:
f.write(res.content)
Someone wrote this answer, but later deleted it, so I will copy the answer in case it can be useful.
AFAIK, you can't download anything else apart from HTML Documents
using jsoup.
If you open up Developer Tools on your browser, you can get the exact
request the browser has made. With Chrome, it's something like this.
The minimal cURL request would in your case be:
'https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg'
\ -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21
(KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21' \ --output
image.jpg;
You can refer to HedgeHog's answer for a sample Python
solution; here's how to achieve the same in Java using the new HTTP
Client:
import java.net.URI; import java.net.http.HttpClient; import
java.net.http.HttpRequest; import
java.net.http.HttpResponse.BodyHandlers; import java.nio.file.Path;
import java.nio.file.Paths;
public class ImageDownload {
public static void main(String[] args) throws Exception {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://cdn.mangaeden.com/mangasimg/aa/aa75d306397d1d11d07d66746dae78a36dc78672ae9e97a08cb7abb4.jpg"))
.header("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0
Safari/535.21")
.build();
client.send(request, BodyHandlers.ofFile(Paths.get("image.jpg")));
} }
I adopted this solution in my java code.
Also, one last bit, if the image is downloaded but you can't open it, it is probably due to a 503 error code in the request, in this case you will just have to perform the request again. You can recognize broken images because the image reader will say something like
Not a JPEG file: starts with 0x3c 0x68
which is <h, an HTML error page instead of the image
I've tried searching for this - can't seem to find the answer!
I'm trying to do a really simple scrape of an entire webpage so that I can look for key words. I'm using the following code:
import requests
Website = requests.get('http://www.somfy.com', {'User-Agent':'a'}, headers = {'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
When I visit this website in a browser (eg chrome or firefox) it works. When I run the python code I just get the result "Gone" (error code 410).
I'd like to be able to reliably put in a range of website urls, and pull back the raw html to be able to look for key-words.
Questions
1. What have I done wrong, and how should I set this up to have the best chance of success in the future.
2. Could you point me to any guidance on how to go about working out what is wrong?
Many thanks - and sorry for the beginner questions!
You have an invalid User-Agent and you didn't include it in your headers.
I have fixed your code for you - it returns a 200 status code.
import requests
Website = requests.get('http://www.somfy.com', headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3835.0 Safari/537.36', 'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.
I've written a script to scrape data from a div and return a boolean if a prespecified string exists within the div class, everything works perfectly locally. However, when I copy the code to a colab notebook the script hits the ReCaptcha and returns a 403 status code.
My code is below:
def stock_checker(listofurls):
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/79.0.3945.88 Safari/537.36"
}
stock_level = []
for target_url in tqdm(listofurls):
print(target_url)
query = requests.get(target_url,headers=headers).text
html = soup(query, "html.parser")
soup_result = html.find("div", {"class": "product-details__options-basket"}).text
stock_bool = "Out of Stock" if "Out of Stock" in str(soup_result) else "In Stock"
stock_level.append(stock_bool)
return pd.DataFrame({"URls" : listofurls, "In Stock" : stock_level})
print(stock_checker(myurllist))
The html returned is for the ReCaptcha and therefore the div I'm referencing further down does not exist and the code errors.
Any ideas on why this is happening in colab and not locally? and/or how to fix the issue?
Ps - I'm putting it in colab so others can use by just running the code without needing to code.
RE: "why is this happening?" --
Headless browsers are often used for abuse, and hence more often receive counter-abuse tests like captchas. The chance is likely greater when executing from shared IP ranges typical of Cloud providers.
The short version is that the site you are using is probably working as intended. If you are not adhering to its robots.txt directives, I'd start there in order to reduce the chance of encountering counter-abuse mechanisms.