Python Web Crawler No Results - python

I am making a basic Web Crawler/Spider with Python. I am trying to crawl through a YouTube channel and print all the titles of the videos on it but it never returns anything.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/c/DanTDM/videos'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
x = soup.select(".yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(x)
And the output is always: []. An empty list (which means it didn't find anything). I need to know what I'm doing wrong.

The code seems correct.
Call print(response.text) and see if YouTube is returning you a blocking page.
Anti scraping measures can be in action, as checking your user agent, etc.

Browser Automation with Selenium
When I send a request to YouTube, I receive the following page:
(A 'Before you continue to
Youtube' page).
So...
We should use Selenium instead as we need to click one of the buttons. I don't think we can interact with the website using the requests module.
Selenium allows you to have control over your browser. Read the documentation!

Related

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

Why am I getting an empty body tag content when trying to use web scraping using the requests library?

I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)
There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link
Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.

PYTHON 3 - How to web scrape a password protected website?

I'm trying to access a website in my work, however it's username/password protected. The user/pw pop-up also looks as in the picture.Login image
I attach my code to view the website.
I can see the HTML code, however with an error "401 Authorization Required".
Can you please help?
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("http://10.75.19.101/mfgindex", auth=('root', 'password'))
# Convert to beautiful soup object
soup = bs(r.content, features="html.parser")
# print
print(soup.prettify())
Generally if site is password-protected you can't obviously bypass the login procedure. That forces you to leverage a RPA process where your code controls the web browser and performs login action leveraging real login and pwd, followed by automated browsing of the pages you need and extraction of the elements you require from HTML using the BeautifulSoup.
For that I suggest to try out Selenium (https://www.selenium.dev/)
A short tutorial is here:
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
I tried it to solve similar task some time ago and it worked good

How do I scrape a full instagram page in python?

Long story short, I'm trying to create an Instagram python scraper, that loads the entire page and grabs all the links to the images. I have it working, only problem is, it only loads the original 12 photos that Instagram shows. Is there anyway I can tell requests to load the entire page?
Working code;
import json
import requests
from bs4 import BeautifulSoup
import sys
r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')
script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
image_src = post['node']['display_url']
print(image_src)
As Scratch already mentioned, Instagram uses "infinite scrolling" which won't allow you to load the entire page. But you can check the total amount of messages at the top of the page (within the first span with the _fd86t class). Then you can check if the page already contains all of the messages. Otherwise, you'll have to use a GET request to get a new JSON response. The benefit to this is that this request contains the first field, which seems to allow you to modify how many messages you get. You can modify this from its standard 12 to get all of the remaining messages (hopefully).
The necessary request looks similar to the following (where I've anonymised the actual entries, and with some help from the comments):
https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}
parse_ig.py
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from InstagramAPI import InstagramAPI
import time
c = webdriver.Chrome()
# load IG page here, whether a hashtag or a public user's page using c.get(url)
for i in range(10):
c.send_keys(Keys.END)
time.sleep(1)
soup = BeautifulSoup(c.page_source, 'html.parser')
ids = [a['href'].split('/') for a in soup.find_all('a') if 'tagged' in a['href']]
Once you have the ids, you can use Instagram's old API to get data for those. I'm not sure if it still works, but there was an API that I used -- which was limited by how much FB has slowly deprecated parts of the old API. Here's the link, in case you don't want to access Instagram API on your own :)
You can also add improvements to this simple code. Like instead of a "for" loop, you could do a "while" loop instead (i.e. while page is still loading, keep pressing END button.)
#zero's answer is incomplete (at least as of 1/15/19). c.send_keys is not a valid method. Instead, this is what I did:
c = webdriver.Chrome()
c.get(some_url)
element = c.find_element_by_tag_name('body') # or whatever tag you're looking to scrape from
for i in range(10):
element.send_keys(Keys.END)
time.sleep(1)
soup = BeautifulSoup(c.page_source, 'html.parser')
Here is a link to good tutorial for scraping Instagram profile info and posts that also handles pagination and works in 2022: Scraping Instagram
In summary, you have to use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous page response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?
In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)
According to the two websites that you given, i think you may try the following code to get the original url for they all hidden in a part of javascript(the main scraper code i am using is from the question that you post):
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = reurl.search(buffer)
if match:
print(htmlp.unescape(match.group(1)).replace('\\',''))
break
say you're able to grab the href attribute/value:
s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'
then you need to do the following:
import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)
and s will now be a string of the original bit-ly link!
try the following code:
import requests
url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)
it prints the desired output:
http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter

Categories

Resources