Website returning spoofed result/404 while scraping?

Website returning spoofed result/404 while scraping? - python

I am trying to scrape the following site. I tried using request.get and parsed with Beautiful Soup, but it does not return the same result as to when viewed using a browser. I also directly calling the endpoint they were using but that returns a 404 error. I have tried using headers, but that has not solved it. How do I solve it?
Here is the code, I used:
import requests
import BeautifulSoup
headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 X-Requested-With: XMLHttpRequest'}
URL = 'url'
x = requests.get(url, headers=headers)
The above code does return output, but it does not have the same content as the website, that is the link to a article that appears

It used ajax to load the page.I found the API.
All the url could be:
url = "https://legitquest.com/Search/GetResultBySelectedSearchResult?caseText=AIR+1950+SC+1&type=citation&filter=&sortBy=1&formattedCitation=AIR+1950+SC+1&removeFilter=&filterValueList=&_={}".format(str(time.time()).replace(".","")[:-4])
But due to some reasons, it also couldn't crawl the page.(This page use a strict rule to prevent crawl)
Even I used the right url,it also couldn't get it:
Strongly recommend you use selenium.It will be easier.
I get it:
import requests
import time
headers = {
"X-Requested-With": "XMLHttpRequest"
}
url = 'https://legitquest.com/Search/GetResultBySelectedSearchResult?caseText=AIR+1950+SC+1&type=citation&filter=&sortBy=1&formattedCitation=AIR+1950+SC+1&removeFilter=&filterValueList=&_={}'.format(str(time.time()).replace(".","")[:-4])
x = requests.get(url,headers=headers)
print(x.json()["CaseDetails"][0]["LinkText"])
Result:
Sheth Maneklal Mansukhbhai V. Messrs. Hormusji Jamshedji Ginwallaand Sons
The json format:
{
'filterList': '',
'filterValueList': '',
'caseText': 'AIR 1950 SC 1',
'currentpage': 1,
'CaseCount': 1,
'openPopup': False,
'UserId': '',
'IsSubscribed': False,
'IsMobileDevice': False,
'CaseDetails': [{
'LinkText': 'Sheth Maneklal Mansukhbhai V. Messrs. Hormusji Jamshedji Ginwallaand Sons',
'PartyName': 'sheth-maneklal-mansukhbhai-vs-messrs.-hormusji-jamshedji-ginwallaand-sons',
'SearchString': None,
'CaseId': 21763,
'EncryptedId': '1EBBB',
'CourtName': 'Supreme Court Of India',
'Id': 125883,
'CourtId': 1,
'CaseType': None,
'HeadNotes': None,
'Judges': "HON'BLE MR. JUSTICE M.C. MAHAJAN<BR />HON'BLE MR. JUSTICE SAIYID FAZAL ALI<BR />HON'BLE MR. JUSTICE B.K. MUKHERJEA",
'DateOfJudgment': '21-03-1950',
'Judgment': None,
'OrderByDateTime': '/Date(-624326400000)/',
'CaseNo': None,
'Advocates': None,
'CitationText': '',
'CitatedCount': 0,
'CopyText': None,
'AlternativeCitation': '(1950) SCR 75 ; AIR 1950 SC 1 ; 1950 SCJ 317 ; (1950) 63 LW 495',
'Petitioner': None,
'Responder': None,
'Citation': None,
'Question': None,
'HighlightedText': '',
'IsFoundText': True,
'IsOverruledExist': False,
'IsDistinguishedExist': False,
'IsOtherStatusExist': True,
'OtherStatusImgUrl': 'https://www.legitquest.com/Content/themes/treatment/referred.svg',
'OverruledImgUrl': None,
'DistinguishedImgUrl': None,
'BookmarkId': 0,
'Chart': None,
'CaseCitedCount': None,
'SnapShot': None
}]
}

On Doing This :
url = 'https://legitquest.com/Home/GetCaseDetails?searchType=citation&publisher=AIR%201950%20SC%201'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'}
page_html = requests.get(url,headers=headers)
print("Status Code : ")
print(page_html.status_code)
page_soup = soup(page_html.content,features="lxml")
I got this result which you require

Related

Selecting links within a div tag using beautiful soup

I am trying to run the following code
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
params = {
'q': 'Machine learning,
'hl': 'en'
}
html = requests.get('https://scholar.google.com/scholar', headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for result in soup.select('.gs_r.gs_or.gs_scl'):
profiles=result.select('.gs_a a')['href']
The following output (error) is being shown
"TypeError: list indices must be integers or slices, not str"
What is it I am doing wrong?

The following is tested and works:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
params = {
'q': 'Machine learning',
'hl': 'en'
}
html = requests.get('https://scholar.google.com/scholar', headers=headers,
params=params).text
soup = bs(html, 'lxml')
for result in soup.select('.gs_r.gs_or.gs_scl'):
profiles=result.select('.gs_a a')
for p in profiles:
print(p.get('href'))
Result in terminal:
/citations?user=rSVIHasAAAAJ&hl=en&oi=sra
/citations?user=MnfzuPYAAAAJ&hl=en&oi=sra
/citations?user=09kJn28AAAAJ&hl=en&oi=sra
/citations?user=yxUduqMAAAAJ&hl=en&oi=sra
/citations?user=MnfzuPYAAAAJ&hl=en&oi=sra
/citations?user=9Vdfc2sAAAAJ&hl=en&oi=sra
/citations?user=lXYKgiYAAAAJ&hl=en&oi=sra
/citations?user=xzss3t0AAAAJ&hl=en&oi=sra
/citations?user=BFdcm_gAAAAJ&hl=en&oi=sra
/citations?user=okf5bmQAAAAJ&hl=en&oi=sra
/citations?user=09kJn28AAAAJ&hl=en&oi=sra
In your code, you were trying to obtain the href attribute from a list (soup.select returns a list, and soup.select_one return a single element).
See BeautifulSoup documentation here

Why is this web scrape not working on python?

I haven’t recently been using the code attached. For the past few weeks, it has been working completely fine and always produced results. However, I used this today and for some reason it didn’t work. Could you please help and provide a solution to the problem.
import requests, json
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {"q": "dji", "hl": "en", 'gl': 'us', 'tbm': 'shop'}
response = requests.get("https://www.google.com/search",
params=params,
headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
# list with two dict() combined
shopping_data = []
shopping_results_dict = {}
for shopping_result in soup.select('.sh-dgr__content'):
title = shopping_result.select_one('.Lq5OHe.eaGTj h4').text
product_link = f"https://www.google.com{shopping_result.select_one('.Lq5OHe.eaGTj')['href']}"
source = shopping_result.select_one('.IuHnof').text
price = shopping_result.select_one('span.kHxwFf span').text
try:
rating = shopping_result.select_one('.Rsc7Yb').text
except:
rating = None
try:
reviews = shopping_result.select_one('.Rsc7Yb').next_sibling.next_sibling
except:
reviews = None
try:
delivery = shopping_result.select_one('.vEjMR').text
except:
delivery = None
shopping_results_dict.update({
'shopping_results': [{
'title': title,
'link': product_link,
'source': source,
'price': price,
'rating': rating,
'reviews': reviews,
'delivery': delivery,
}]
})
shopping_data.append(dict(shopping_results_dict))
print(title)

Because .select in for shopping_result in soup.select('.sh-dgr__content'): could not find any element so it gives you an empty list. Therefor the body of the for-loop is not executed. Python jumps out of the loop.
title only exists and is defined when the body of the for loop executes.
You should make sure you used a correct method to find your element(s).

Unable to scrape all the urls available in different depth out of some json content

I'm trying to parse all the value of urls available in different depth within some json content. I'm attaching a file containing the urls in different depth for your consideration.
This is how they are structured (truncated):
{'hasSub': True,
'navigationTitle': 'Products',
'nodeName': 'products',
'pages': [{'hasSub': True,
'navigationTitle': 'Enclosures',
'nodeName': 'PG0002SCHRANK1',
'pages': [{'hasSub': True,
'navigationTitle': 'Hygienic Design',
'nodeName': 'PG0125SCHRANK1',
'pages': [{'hasSub': False,
'navigationTitle': 'Hygienic Design Terminal '
'box HD',
'nodeName': 'PRO0130',
'target': '_self',
'url': '/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0130'},
{'hasSub': False,
'navigationTitle': 'Hygienic Design Compact '
'enclosure HD, '
'single-door',
'nodeName': 'PRO0131',
'target': '_self',
'url': '/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0131'},
If I consider the above content, the output I'm after:
/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0130
/com-en/products/PG0002SCHRANK1/PG0125SCHRANK1/PRO0131
The script that I've written to produce the json content:
import requests
from pprint import pprint
url = 'https://www.rittal.com/.rest/nav/menu/tree?'
params = {
'path': 'com',
'locale': 'en',
'deep': '10'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['Accept'] = 'application/json, text/plain, */*'
r = s.get(url,params=params)
pprint(r.json()['pages'][0])
How can I scrape all the urls from different depth out of the json content?

Okay, it seems I've found a solution elsewhere to fetch all the available links out of any nested json.
import requests
from pprint import pprint
url = 'https://www.rittal.com/.rest/nav/menu/tree?'
params = {
'path': 'com',
'locale': 'en',
'deep': '10'
}
def json_extract(obj, key):
arr = []
def extract(obj, arr, key):
if isinstance(obj, dict):
for k, v in obj.items():
if isinstance(v, (dict, list)):
extract(v, arr, key)
elif k == key:
arr.append(v)
elif isinstance(obj, list):
for item in obj:
extract(item, arr, key)
return arr
values = extract(obj, arr, key)
return values
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['Accept'] = 'application/json, text/plain, */*'
r = s.get(url, params=params).json()
for item in json_extract(r,'url'):
print(item)
Number of links the script produces is around 3500.

What you can do is recurse over the JSON. This is the best way to handle the differing depth of URLs.
The following recursion will retrieve the deepest URLs by recursing over the JSON.
import requests
from pprint import pprint
url = 'https://www.rittal.com/.rest/nav/menu/tree'
params = {
'path': 'com',
'locale': 'en',
'deep': '10'
}
def recurse(data):
if 'pages' in data:
for page in data['pages']:
recurse(page)
elif 'url' in data and data['url'].startswith('/com-en/'):
urls.append(data['url'])
urls = []
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['Accept'] = 'application/json, text/plain, */*'
r = s.get(url, params=params).json()
recurse(r)
pprint(urls)
This is how it works:
Recursive case - if there are pages at the current level, then recurse for each page at the current level
Base case - if a URL appears at the current level, then append it to a list of URLs
Also, if you switch out the elif for an if, it will give you all the URLs at any level.
Update: It seems there are 2 rouge URLs in that JSON. In particular, one is https://www.eplan-software.com/solutions/eplan-platform/ and another is blank! As such, I've added the condition data['url'].startswith('/com-en/') to only append the URLs which fit the expected pattern.

Pick only one number from an html page with beatifulsoup

I have this url from coronavirus worldwide and I would like to pick only one number, the newcases in Arizona which is +2383 right now.
import requests
from bs4 import BeautifulSoup
import lxml
url = "https://www.worldmeter.com/coronavirus/us/"
page = requests.get("https://www.worldmeter.com/coronavirus/us/")
soup = BeautifulSoup(page.content, "lxml")
page.close()
newcases = soup.find('a', href_="https://worldmeter.com/coronavirus/arizona", class_="tableRowLinkYellow newCasesStates").get_text(strip=True)
print(newcases)
I get this error:
AttributeError: 'NoneType' object has no attribute 'get_text'
How do I pick only that number from the whole table? Thank you for your time.

Just like Linh said, it was generated by Javascript.Using selenium is an easy way but not efficient enough.(too slow)
You could scrape the API directly:
import requests
url = "https://worldmeter.com/coronavirus/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=2582&target_action=get-all-data&default_sorting=old_first"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
results = requests.get(url, headers=headers).json()
for result in results:
if result["state_name"] == "Arizona":
print(result)
print("The newcases is", result["new_cases"])
And this gave me:
{'state_name': 'Arizona', 'positive': '275,436', 'new_cases': '2,383', 'death_in_states': '6,302', 'new_deaths': '2', 'recovered_states': '45,400', 'new_recovered': '364', 'totaltestresults': 'Arizona', 'postname': 'arizona', 'cases_100_k_population': '3,866.37', 'state_population': '7278717', 'death_100_k_population': '88.46'}
The newcases is 2,383

How to login to instacart using requests?

So I tried the following rough version:
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
session = requests.Session()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "user#gmail.com", "password": "xxxxx"},
"authenticity_token": token}
res2 = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(res2)
I always get the following error:
<Response [400]>
apparent_encoding:'ascii'
connection:<requests.adapters.HTTPAdapter object at 0x0000021F3FF8F940>
content:b'{"status":400,"error":"There was a problem in the JSON you submitted: Empty input () at line 1, column 1"}'
What am I doing wrong?

Actually you were missing the correct Params for the POST request.
I've made a GET request to the main site to collect the necessary authenticity_token which is used within the POST request. and then made the POST request for the correct login url.
import requests
from bs4 import BeautifulSoup
params = {
'source': 'web',
'cache_key': 'undefined'
}
data = {
'email': 'email#email.com',
'grant_type': 'password',
'password': 'yourpassword',
'scope': '',
'signup_v3_endpoints_web': 'null'
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0",
}
def main(url):
with requests.Session() as req:
r = req.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
data['authenticity_token'] = soup.find(
"meta", {'name': 'csrf-token'}).get("content")
r = req.post(
"https://www.instacart.com/v3/dynamic_data/authenticate/login", params=params, json=data, headers=headers).json()
print(r)
main("https://www.instacart.com")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Website returning spoofed result/404 while scraping? - python

Related

Selecting links within a div tag using beautiful soup

Why is this web scrape not working on python?

Unable to scrape all the urls available in different depth out of some json content

Pick only one number from an html page with beatifulsoup

How to login to instacart using requests?

Categories

Resources