I scrape the review of post but they don't scrape

I scrape the review of post but they don't scrape - python

I scrape the review of post but they don't scrape solve it I am very thankful
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
r =requests.get('https://www.realpatientratings.com/botox-cosmetic')
soup=BeautifulSoup(r.content, 'lxml')
tag = soup.find_all('p',class_='text')
for u in tag:
print(u.text)

After checking xhr requests I found out that you're getting the incorrect page.
Try:
import requests
from bs4 import BeautifulSoup
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
}
r =requests.get('https://www.realpatientratings.com/reviews/procreviewfilters?type=surgical&star=&procedureId=147&sort=new&location=&state=0&within=0')
soup=BeautifulSoup(r.content, 'lxml')
tag = soup.find_all('p',class_='text')
for u in tag:
print(u.text)
Just changed https://www.realpatientratings.com/botox-cosmetic to https://www.realpatientratings.com/reviews/procreviewfilters?type=surgical&star=&procedureId=147&sort=new&location=&state=0&within=0

Related

Web Scrapping just return None

I'm trying to make a pop-up program with mir4 draco price. But the price return None :
import requests
from bs4 import BeautifulSoup
urll = 'https://www.xdraco.com/coin/price/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/86.0.4240.198 Safari/537.36"}
site = requests.get(urll, headers=headers)
soup = BeautifulSoup(site.content, 'html5lib')
price = soup.find('span', class_="amount")
print(price)

You won't be able to parse a site that is dynamically loaded using JS as #jabbson mentioned.
This might be a way to get the data you want.
If you check the network requests being made by the page, you will find that it makes calls to a few different APIs. I found one that might have the info you're looking for. You can make POST requests to this API as shown below...
import requests
import json
headers = {'accept':'application/json, text/plain, */*','user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
html = requests.post('https://api.mir4global.com/wallet/prices/hydra/daily', headers=headers)
output = json.loads(html.text)
# 'output' is a dictionary. If we index the last element, we can get the latest data entry
print(output['Data'][-1])
OUTPUT:
{'CreatedDT': '2022-08-04 21:55:00', 'HydraPrice': '2.1301000000000001', 'HydraAmount': '13434', 'HydraPricePrev': '2.3336000000000001', 'HydraAmountPrev': '5972', 'HydraUSDWemixRate': '2.9401340627166839', 'HydraUSDKLAYRate': '0.29840511595654395', 'USDHydraRate': '6.2627795669928084'}

How do I Scrape the link of the website from these page

I am trying to scrape the link from amazon website but they will provide me 2 or 3 links
the link of website is https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers ={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
r =requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar')
soup=BeautifulSoup(r.content, 'html.parser')
for link in soup.find_all('a',href=True):
print(link['href'])

Here is the working solution:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
r = requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
p=link['href']
l=urljoin(base_url,p)
print(l)
Output:
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A05861132UJ9W79S82Z3&url=%2FFiskars-Inch-Student-Scissors-Pack%2Fdp%2FB08CL355MN%2Fref%3Dsr_1_1_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-1-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A0918144191FAIKGYK3YC&url=%2FFiskars-Inch-Blunt-Kids-Scissors%2Fdp%2FB00TJSS9ZW%2Fref%3Dsr_1_2_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-2-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_browse_office-products_sr_pg1_1?ie=UTF8&adId=A09889161KB2CNO5NB8QC&url=%2FLind-Kitchen-Dispenser-Decorative-Stationery%2Fdp%2FB07VRLW5C6%2Fref%3Dsr_1_3_sspa%3Fdchild%3D1%26qid%3D1633717907%26s%3Doffice-products%26sr%3D1-3-spons%26psc%3D1&qualifier=1633717907&id=1565389383398743&widgetName=sp_atf_browse
https://www.amazon.com/Zebra-Pen-Retractable-Ballpoint-18-Count/dp/B00M382RJO/ref=sr_1_4?dchild=1&qid=1633717907&s=office-products&sr=1-4
... so on

WebScraping A Website With Json Content Gives Value Error

I am trying to scrape an api call with requests. This is the website
Following Is The Error That It Gives Me:
ValueError: No JSON object could be decoded
Following Is The Code :
import requests
import json
import time
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/api/event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
data = json.loads(request.text)
print(data)
How Can I Scrape This Website ?

Try this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.nseindia.com/companies-listing/corporate-filings-event-calendar'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
request = requests.get(url,headers=headers)
soup = BeautifulSoup(request.text,'html.parser')
print(soup)

The table is probably being dynamically generated with Javascript. Therefore, requests won't work. You need selenium and a headless browser to do that.

Requests.get returns code, bs4 gives empty list

I try to parse the following page: https://www.amazon.de/s?k=lego+7134&__mk_nl_NL=amazon&ref=nb_sb_noss_1.
Requests.get gets me the total code, but when I try to parse it using Beautiful Soup, it returns an empty list [].
I've tried encoding, using chromium, requests-html, different parsers, replacing the beginning of the code, etc. I'm sad to say that nothing seems to work.
from fake_useragent import UserAgent
from lxml import html
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.amazon.de/s?k=lego+7134&__mk_nl_NL=amazon&ref=nb_sb_noss_1"
userAgentList = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 5.1; rv:36.0) Gecko/20100101 Firefox/36.0',
]
proxyList = [
'xxx.x.x.xxx:8080',
'xx.xx.xx.xx:3128',
]
def make_soup_am(url):
print(url)
random.shuffle(proxyList)
s = requests.Session()
s.proxies = proxyList
headers = {'User-Agent': random.choice(userAgentList)}
pageHTML = s.get(url, headers=headers).text
pageSoup = soup(pageHTML, features='lxml')
return pageSoup
make_soup_am()
Anyone has an idea?
Thanks in advance,
Tom

when I use urllib2 to crawl a wibsite,but without labels ,such as html,body

import urllib2
url = 'http://www.bilibili.com/video/av1669338'
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers={"User-Agent":user_agent}
request=urllib2.Request(url,headers=headers)
response=urllib2.urlopen(request)
text = response.read()
text[:100]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xcd}ys\x1bG\xb2\xe7\xdfV\xc4|\x87\x1exhRk\x81\xb8\x08\x10\x90E\xfa\x89\xb2f\x9f\xe3\xd9\xcf\x9e\x1dyb7\xec\tD\x03h\x90\x90p\t\x07)yf"D\xf9I&EI\xd4}\x91\xb6.\xeb\xb0e\x93\x94%Y\xbc$E\xccW\x194\x00\xfe\xe5\xaf\xf0~Y\xd5\xd5\xa8\xeeF\x83\xa7'

import requests
from bs4 import BeautifulSoup
def data():
url = 'http://www.bilibili.com/video/av1669338'
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers = {"User-Agent": user_agent}
response = requests.get(url, headers=headers)
data = response.content
_html = BeautifulSoup(data)
_meta = _html.head.select('meta[name=keywords]')
print _meta[0]['content']

Try this:
import bs4, requests
res = requests.get("http://www.bilibili.com/video/av1669338")
soup = bs4.BeautifulSoup(res.content, "lxml")
result = soup.find("meta", attrs = {"name":"keywords"}).get("content")
print result

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I scrape the review of post but they don't scrape - python

Related

Web Scrapping just return None

How do I Scrape the link of the website from these page

WebScraping A Website With Json Content Gives Value Error

Requests.get returns code, bs4 gives empty list

when I use urllib2 to crawl a wibsite,but without labels ,such as html,body

Categories

Resources