I currently making a project which will get all the orders i've ordered on amazon and categorize them and then write them to an excel file. The problem is, when i try to scrape the page using bs4, i get the result as None.
I've made a similar project before, which will search amazon for the product you want to search for then save all the data about that product like name, price, review in a json file.
That worked perfectly.
But this doesnt seem to work
here is the code -
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
link = 'https://www.amazon.in/gp/your-account/order-history?opt=ab&digitalOrders=1&unifiedOrders=1&returnTo=&orderFilter=year-2020'
data = requests.get(link, headers = headers)
soup = BeautifulSoup(data.text, 'lxml')
product = soup.find('div', class_="a-box-group a-spacing-base order")
print(product)
I'm a beginner, but I think its because I need to log in to get my details, but my password is already saved in my browser.
Any help is appreciated.
Thanks
See this GitHub project for reference
Amazon, like most prominent companies, doesn't allow simple scraping and needs some form of auth.
Related
I am trying to scrape the tbody under the following element of https://steamdb.info/app/730/graphs/. (I gained permission for scraping)
<div id="chart-month-breakdown" class="table-responsive">
However, when trying to scrape the content or access it through Selenium, I can't because it appears as such:
<div id="chart-month-breakdown" class="table-responsive" hidden="">
The 'hidden' tag only disappears when I manually browse the page, thus not able to scrape through requests.get.
Is there a way to get the content?
If you are good with stats, here are all the stats that table uses. they fetch table data from api. You can make api calls directly by using this:
import requests
headers = {'referer': 'https://steamdb.info/app/730/graphs/','x-requested-with': 'XMLHttpRequest','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
a = requests.get('https://steamdb.info/api/GetGraph/?type=concurrent_week&appid=730',headers = headers).text
print(a)
If you are using selenium. I have tried changing hidden to nothing and that got me all the data. You can use selenium javascript to make that div id visible. You can try something like this:
javaScript = "document.getElementById('chart-month-breakdown').removeAttribute('hidden');"
driver.execute_script(javaScript)
time.sleep(3) # To let the data poplulate
I'm trying to scrape the value of Balance from a webpage using requests module. I've looked for the name Balance in dev tools and in page source but found nowhere. I hope there should be any way to grab the value of Balance from that webpage without using any browser simulator.
website address
Output I'm after:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://tronscan.org/?fbclid=IwAR2WiSKZoTDPWX1ufaAIEg9vaA5oLj9Yd_RUfpjE6MWEQKRGBaK-L_JdtwQ#/contract/TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
balance = soup.select_one("li:has(> p:contains('Balance'))").get_text(strip=True)
print(balance)
The reason the page's HTML doesn't have the balance is because the page is making AJAX requests which are sending back the information you want after the page is loaded. You can look at these requests by loading up your developer window by pressing F12 in Chrome (it might be different in other browsers), go to the Network tab and you'll see this:
Here you can see the request that you want is account?address= followed by the code that is in the URL string for the page, and mousing over that shows the complete URL for the AJAX request, highlighted in coral, and the part of the response which holds the data you want is on the right highlighted in turquoise.
You can look at response by going here and find tokenBalances.
In order to get the balance in Python you can run the following:
import requests, json
url = 'https://apilist.tronscan.org/api/account?address=TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
response = requests.get(url, headers=headers)
response = json.loads(response.text)
balance = response['tokenBalances'][0]['balance']
print(balance)
I'm a Korean who just started learning Python.
First, I apologize for my English.
I learned how to use beautifulSoup on YouTube. And on certain sites, crawling was successful.
However, I found out that crawl did not go well on certain sites, and that I had to set up user-agent through a search.
So I used 'requests' to make code to set user-agent. Subsequently, the code to read a particular class from html was used equally, but it did not work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
url ='https://store.leagueoflegends.co.kr/skins'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
for skin in soup.select(".item-name"):
print(skin)
Here's my code. I have no idea what the problem is.
Please help me.
Your problem is that requests do not render javascript. instead, it only gives you the "initial" source code of the page. what you should use is a package called Selenium. it lets you control your browser )Chrome, Firefox, ...etc) from Python. the website won't be able to tell the difference and you won't need to mess with the headers and user-agents. there are plenty of videos on Youtube on how to use it.
I've tried searching for this - can't seem to find the answer!
I'm trying to do a really simple scrape of an entire webpage so that I can look for key words. I'm using the following code:
import requests
Website = requests.get('http://www.somfy.com', {'User-Agent':'a'}, headers = {'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
When I visit this website in a browser (eg chrome or firefox) it works. When I run the python code I just get the result "Gone" (error code 410).
I'd like to be able to reliably put in a range of website urls, and pull back the raw html to be able to look for key-words.
Questions
1. What have I done wrong, and how should I set this up to have the best chance of success in the future.
2. Could you point me to any guidance on how to go about working out what is wrong?
Many thanks - and sorry for the beginner questions!
You have an invalid User-Agent and you didn't include it in your headers.
I have fixed your code for you - it returns a 200 status code.
import requests
Website = requests.get('http://www.somfy.com', headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3835.0 Safari/537.36', 'Accept': '*/*'})
print (Website.text)
print (Website.status_code)
I have been trying to find the Xpath information for the "customers who viewed this product also viewed" but i cannot seem to get the code correct. I have very little experience with Xpath but i have been using an online scraper to get the info and learn from it.
what ive been doing is
def AmzonParser(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,headers=headers)
def scraper():
While True:
XPATH_RECOMMENDED = '//a[#id="anonCarousel3"]/ol/li[1]/div/a/div[2] //href'
RAW_RECOMMENDED = doc.xpath(XPATH_RECOMMENDED)
RECOMMENDED = ' '.join(RAW_RECOMMENDED).strip() if RAW_RECOMMENDED else None
And my main goal is to get the customers also viewed link. so i can pass it to the scraper. This is just a snippet of my code.