I have been trying to find the Xpath information for the "customers who viewed this product also viewed" but i cannot seem to get the code correct. I have very little experience with Xpath but i have been using an online scraper to get the info and learn from it.
what ive been doing is
def AmzonParser(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,headers=headers)
def scraper():
While True:
XPATH_RECOMMENDED = '//a[#id="anonCarousel3"]/ol/li[1]/div/a/div[2] //href'
RAW_RECOMMENDED = doc.xpath(XPATH_RECOMMENDED)
RECOMMENDED = ' '.join(RAW_RECOMMENDED).strip() if RAW_RECOMMENDED else None
And my main goal is to get the customers also viewed link. so i can pass it to the scraper. This is just a snippet of my code.
Related
Im writing a program to track the prices of amazon products for which the user provides URL of the product. I have first read two URLs as input and stored them as objects in a list then at last I run the list of objects one by one to track the prices further but I would like to add more URLs furthermore even while the function is being run but without pausing the function. I have attached the full program I have written yet below, Thanks.
import requests
import lxml
from bs4 import BeautifulSoup
import re
import time
class PriceTrack:
productMethods=[]
def __init__(self,url):
HEADERS = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
source=requests.get(url, headers=HEADERS).text
#title
soup=BeautifulSoup(source,'lxml')
title=soup.find(id='productTitle')
ProductTitle=title.get_text().strip()
#price
price=soup.find(id='priceblock_ourprice').text[2:]
result = re.sub(r"[,₹]","", price, flags = re.I)
price,_=result.split('.')
price=float(price)
#pricelist
price_list=[]
self.ProductTitle=ProductTitle
self.Price=price
price_list.append(price)
self.PriceList=price_list
def trackAndadd(self):
HEADERS = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
source=requests.get(url, headers=HEADERS).text
soup=BeautifulSoup(source,'lxml')
#price
price=soup.find(id='priceblock_ourprice').text[2:]
result = re.sub(r"[,₹]","", price, flags = re.I)
price,_=result.split('.')
price=float(price)
if price!=self.PriceList[-1]:
self.PriceList.append(price)
print(self.ProductTitle,self.Price,self.PriceList)
print()
#classmethod
def Addproduct(cls,url):
product=cls(url)
PriceTrack.productMethods.append(product)
print(str(product))
url=input('Enter the url :')
PriceTrack.Addproduct(url)
url=input('Enter the url :')
PriceTrack.Addproduct(url)
while True:
for i in PriceTrack.productMethods:
#func=str(i)+'.trackAndadd'
eval('i.trackAndadd()')
time.sleep(5)
I suggest you take a look at the Scrapy Python package which does what you are looking to do and then some. You will probably have to rewrite your code a bit to fit their framework, but yielding initial links and then yielding additional links to be added as the code execution goes on is the bread and butter of Scrapy (see for example here)
What I am trying to do is go onto https://www.ssrn.com/index.cfm/en/, search for an author name, and then click on that author's name to take me to the author page. Right now, I am stuck on that first step, as requests.post() just returns the original URL input, not the results page. I've searched around for a while but I have no idea what the problem is; if I've misidentified the key or something else.
Here is my code:
url = "https://www.ssrn.com/index.cfm/en/"
header = {}
header['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
obj = requests.post(url, headers = header, data = {'txtKey_Words': 'auth_name'}) # make POST request for each name
print(obj.url)
So print(obj.url) just returns https://www.ssrn.com/index.cfm/en/ instead of https://papers.ssrn.com/sol3/results.cfm
Much appreciated!
Edit: Using https://papers.ssrn.com/sol3/results.cfm returns https://papers.ssrn.com/sol3/displayabstractsearch.cfm
I currently making a project which will get all the orders i've ordered on amazon and categorize them and then write them to an excel file. The problem is, when i try to scrape the page using bs4, i get the result as None.
I've made a similar project before, which will search amazon for the product you want to search for then save all the data about that product like name, price, review in a json file.
That worked perfectly.
But this doesnt seem to work
here is the code -
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
link = 'https://www.amazon.in/gp/your-account/order-history?opt=ab&digitalOrders=1&unifiedOrders=1&returnTo=&orderFilter=year-2020'
data = requests.get(link, headers = headers)
soup = BeautifulSoup(data.text, 'lxml')
product = soup.find('div', class_="a-box-group a-spacing-base order")
print(product)
I'm a beginner, but I think its because I need to log in to get my details, but my password is already saved in my browser.
Any help is appreciated.
Thanks
See this GitHub project for reference
Amazon, like most prominent companies, doesn't allow simple scraping and needs some form of auth.
I am getting certain elements in google chrome (Inspect) but not in internet explorer when I view the source of the same web page.
I assume Beautiful Soup uses internet explorer inside? Its results match IE more closely.
However when I use the Inspect feature of chrome, I see certain elements not listed in the source.
Is there a way I can emulate this in Python or using Beautiful Soup?
You can change your user agent's to one of the following:
https://webscraping.com/blog/User-agents/
A Snippet: changing User-Agent forces the page to open different contets (mobile vs Chrome)
from bs4 import BeautifulSoup
import requests
#headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'}
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36'}
result = requests.get("http://derstandard.at", headers=headers)
c = result.content
print result.request.headers
print len(c)
Note: Some websites are protecting them selfes for user-agent spoofing. So not all websites might respond to these frequent jumps.
I am trying to read the LinkedIn company page, for example, https://www.linkedin.com/company/facebook
getting company name,location,type of industry,etc.
This is my code below
urlCreate1<-"https://www.linkedin.com/company/facebook"
parse_rvest<-getURL(urlCreate1,'useragent' = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36")
nameRest <- content %>%html_nodes(".industry") %>%html_text()
nameRest
and the output I get for this is character(0) which from previous posts I understand that its not getting .industry tag as I read the https code.
I have also tried this
parse_rvest<-content(GET(urlCreate1),encoding='UTF-8')
but it doesn't help
I have a python code that works but I need this to be done in R
This is part of the python code I got online
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers)
formatted_response = response.content.replace('<!--', '').replace('-->', '')
doc = html.fromstring(formatted_response)
datafrom_xpath = doc.xpath('//code[#id="stream-promo-top-bar-embed-id-content"]//text()')
if datafrom_xpath:
try:
json_formatted_data = json.loads(datafrom_xpath[0])
company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None
size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None
Please help me in reading the page. I am using selector gadget to get the xpath(.industry)
Have a look at the LIN API:
https://cran.r-project.org/web/packages/Rlinkedin/Rlinkedin.pdf
Then, you should be able to easily, and legally, do whatever you want to do.
Here are some ideas to get you started.
http://thinktostart.com/analyze-linkedin-with-r/
https://github.com/hadley/httr/issues/200
https://www.reddit.com/r/datascience/comments/3rufk5/pulling_data_from_linkedin_api/