Beautiful soup took too long to get data, ps never got data

Beautiful soup took too long to get data, ps never got data - python

This is the code, it took too long to get the data, plus never retrieved the data.
import requests
from bs4 import BeautifulSoup
print("started")
url="https://www.analog.com/en/products.html#"
def get_data(url):
r=requests.get(url)
soup=BeautifulSoup(r.text,"html.parser")
return soup
def parse(soup):
datas=soup.find_all("div",{"class":"product-row row"})
print(len(datas))
return
print("started")
soup=get_data(url)
print("got data")
parse(soup)

You will need to provide a User-Agent to you request header, just add
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
at the top of your file and then add the "headers" parameter to your request, as follows
r=requests.get(url,headers=header)
You can read more at this question: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Related

Beautiful soup returning empty in PythonAnywhere

I have a bs4 app that would in this context prints the most recent post on igg-games.com
Code:
from bs4 import BeautifulSoup
import requests
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/').text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
current = get_new()
new_item = list(current.items())[0]
print(f"Title: {new_item[0]}\nLink: {new_item[1][0]}\nCatagories: {new_item[1][1]}\nAdded: {new_item[1][2]}")
Output on my machine:
Title: Beholder�s Lair Free Download
Link: https://igg-games.com/beholders-lair-free-download.html
Catagories: Action, Adventure
Added: January 7, 2021
I know it works. However, my end goal is to turn this into rss feed entries. So I plugged it all into a premium PythonAnywhere container. However, my function get_new() returns {}. Is there something I need to do that I'm missing?

Solved thanks to the help of Dmytro O.
Since it was likely that PythonAnywhere was blocked as a client, setting the user agent allowed me to receive a response from my intended site.
#the fix
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
when placed in my code
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}).text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
This method was provided to me through this stack overflow post: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Why is requests.get() returning an outdated website in Python?

Relevant line of code is :
response = requests.get(url)
Here's what I've tried so far :
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
and :
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=headers)
But the data I get is still not the current version of the website.
The website I'm trying to scrape is this grocery store flyer.
Can anyone tell me why the data I get is outdated and/or how to fix it?
Update: it works all of a sudden but I haven't changed anything so I'm still curious as to why ...

Why Request doesn't work on a specific URL?

I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!

The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)

Why are the parsed label names different?

from bs4 import BeautifulSoup
import requests
web_url = r'https://www.mlb.com/scores/2019-05-12'
get_web = requests.get(web_url).text
soup = BeautifulSoup(get_web,"html.parser")
score = soup.find_all('div',class_='container')
print(score)
I want to find this.
But result is this

Send headers to API to tell it "hey I'm a desktop browser" to get identical HTML from server side:
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
request = requests.get(url, headers={'User-Agent': user_agent})
Useful links:
How to use Python requests to fake a browser visit?
Sending "User-agent" using Requests library in Python

Session not transferring over to next requests

I'm finally learning how to use class and __init__, however I'm having an issue with session. It seems like session is not carrying over to the next request. I made a simple script for testing, it adds an item, then I make another request to see if the Bag contains any value (eg. Bag(1)). The problem is that the item is adding but I'm getting Bag(0) when I make the second request. All I can think of is that there might be an issue with session on my part, but I can't figure it out. Here's the script:
import requests, re
from bs4 import BeautifulSoup
class Test():
def __init__(self):
self.s = requests.Session()
self.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'
def cart(self):
headers = {'User-Agent': self.userAgent}
r = self.s.get('http://undefeated.com/store/index.php?api=1&rowid=130007&qty=1', headers=headers)
print(r.text)
if re.findall('Added', r.text):
r = self.s.get('http://undefeated.com/store/cart/pg', headers=headers).text
soup = BeautifulSoup(r, 'lxml')
bag = soup.find('li', {'class': 'leaf cart'}).text
print(bag)
start = Test().cart()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful soup took too long to get data, ps never got data - python

Related

Beautiful soup returning empty in PythonAnywhere

Why is requests.get() returning an outdated website in Python?

Why Request doesn't work on a specific URL?

Why are the parsed label names different?

Session not transferring over to next requests

Categories

Resources