Since Instagram API is not working, I am trying to crawl information of a given hashtag. On the search page of Hash-Tag, it has Ajax embedded, so I followed rules online to look for url where the data is retrieved. Then I have the following link.
https://www.instagram.com/graphql/query/?query_hash=f92f56d47dc7a55b606908374b43a314&variables=%7B%22tag_name%22%3A%22cancun%22%2C%22show_ranked%22%3Afalse%2C%22first%22%3A20%2C%22after%22%3A%22QVFENlVELW9hZjlJVWU1RWd6anpWdGNsYkVwU3M5TzUtaDlRN3VoRHlwU1EwWWRBZ2t6TFkzbEl1M3RRcmItd0JKbVBiM2pLUXZpT0JzNWp3dFhIcElfWg%3D%3D%22%7D
However, When I am trying crawl that page using Urlopen, Instagram blocked my crawler. I tried to use User-Agent to get around it, and it is not working.
Then I tried to use Webdriver to fake the browser, it gets around the blockage, but I don't get anything from the crawling.
Does anyone know what is wrong with that.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver.get(url)
pagesource = driver.page_source
bsObj = BeautifulSoup(pagesource,'html.parser')
print(bsObj.prettify())
Appreciate any Help!
Related
I want to scrape the data present in this website "https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105". I tried using beautiful soup and selenium,
The first approach:
import requests as requests
from bs4 import BeautifulSoup
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'lxml')
print(soup)
This was not giving the output i was expecting, the fetched page contains something like this "Sorry, something about your browser or browsing activity made us think you were a robot."
The second approach:
import selenium
from selenium import webdriver
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
PATH = r"C:\Users\Vinay Edula\Desktop\xxxxxxxx\chromedriver.exe"
driver= webdriver.Chrome(PATH)
driver.get(url)
This approach works fine for one or two pages in that site but then after the problem is this website is blocking the requests.
Approach 3:
import webbrowser
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105"
chrome_path = 'C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s'
for i in range(10):
url="https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105&cursor="+str(i*10)+"&limit=10"
webbrowser.get(chrome_path).open(url)
time.sleep(10)
This code working fine, it was opening the site in chrome without any error or blockibg but i dont know how to fetch the source code.
When the python code tries to fetch the code or by accessing from the guest browser as selenium did i am getting error. When I manually opens this webpage or using webbrowser module in python I can able to see the contents. So how can i solve this problem , my final aim is to fetch the contents present from this paginated site https://www.findhelp.org/care/support-network--san-francisco-ca?postal=94105? .
Any solution for this problem will be will be highly appreciated.
You can use the page_source property of Selenium as follows:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://example.com")
I am looking to scrape the following web page, where I wish to scrape all the text on the page, including all the clickable elements.
I've attempted to use requests:
import requests
response = requests.get("https://cronoschimp.club/market/details/2424?isHonorary=false")
response.text
Which scrapes the meta-data but none of the actual data.
Is there a way to click through and get the elements in the floating boxes?
As it's a Javascript enabled web page, you can't get anything as output using requests, bs4 because they can't render javascript. So, you need an automation tool something like selenium. Here I use selenium with bs4 and it's working fine. Please, see the minimal working example as follows:
Code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://cronoschimp.club/market/details/2424?isHonorary=false'
driver.get(url)
time.sleep(20)
soup = BeautifulSoup(driver.page_source, 'lxml')
name = soup.find('div',class_="DetailsHeader_title__1NbGC").get_text(strip=True)
p= soup.find('span',class_="DetailsHeader_value__1wPm8")
price= p.get_text(strip=True) if p else "Not for sale"
print([name,price])
Output:
['Chimp #2424', 'Not for sale']
I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)
There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link
Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.
I'm using BeautifulSoup for a while and I've hadn't had much problems.
But now I'm trying to scrape from a site that gives me some problem.
My code is this:
preSoup = requests.get('https://www.betbrain.com/football/world/')
print(currUrl)
soup = BeautifulSoup(preSoup.content,"lxml")
print(soup)
the content I get seems to be some sort of script and/or api they're connected to, but not the real content of the webpage I see in the browser.
I cant reach the games for example. Does anyone knows a way around it?
Thank you
Okay requests gets only the html and doesnt load the js
you have to use webdriver for that
you can use Chrome, Firefox and etc.. i use PhantomJS because is running in the background its "headless" browser. Underneath you will find some example code that will help you understand how to use it
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("https://www.betbrain.com/football/world/")
time.sleep(5)# you can give it some time to load the js
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for i in soup.findAll("span", {"class": "Participant1"}):
print (i.text)
I'm trying to access a webpage and return all of the hyperlinks on that page. I'm using the same code from a question that was answered here.
I wish to access this correct page, but It is only returning content from this incorrect page.
Here is the code I am running:
import httplib2
from bs4 import SoupStrainer, BeautifulSoup
http = httplib2.Http()
status, response = http.request('https://www.iparkit.com/Minneapolis')
for link in BeautifulSoup(response, 'html.parser', parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print (link['href'])
Results:
/account
/monthlyAccount
/myproducts
/search
/
{{Market}}/search?expressSearch=true&timezone={{timezone}}
{{Market}}/search?expressSearch=false&timezone={{timezone}}
{{Market}}/events
monthly?timezone={{timezone}}
/login?next={{ getNextLocation(browserLocation) }}
/account
/monthlyAccount
/myproducts
find
parking-app
iparkit-express
https://interpark.custhelp.com
contact
/
/parking-app
find
https://interpark.custhelp.com
/monthly
/iparkit-express
/partners
/privacy
/terms
/contact
/events
I don't mind returning the above results, but It doesn't return any links that could get me to the page I want. Maybe it's protected? Any ideas or suggestions, thank you in advance.
The page you are trying to scrape is full JavaScript generated.
This http.request('https://www.iparkit.com/Minneapolis') would give almost nothing in this case.
Instead, you must do what a real browser do - Process JavaScript, then try to scrape what has been processed. For this you can try Selenium.
For your page, after running JavaScript you will get ~84 URLs, while trying to scrape without running JavaScript, you would get ~7 URLs.
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome('PATH_TO_CHROME_WEBDRIVER', chrome_options=chrome_options)
driver.get('https://www.iparkit.com/Minneapolis')
content = driver.page_source
Then you extract what you want from that content using BeautifulSoup in your case.