Python: parse data with BeautifulSoup

Python: parse data with BeautifulSoup - python

I have a list of urls
yandex.ru/search?text=игрушка%20"веселая%20гусеница"%20keenway%20отзывы&lr=47
yandex.ru/search?text=модис&lr=47
yandex.ru/search?text=модис&lr=47
yandex.ru/search?text=авито&lr=47
yandex.ru/search?text=авито&lr=47
yandex.ru/search?text=цветок%20киддиленд%20музыкальный&lr=47
dns-shop.ru/product/c7bf1138670f3361/rul-hori-racing-wheel
dns-shop.ru/product/c7bf1138670f3361/rul-hori-racing-wheel#opinion
kaluga.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_ps_4_ps4_020e_acps440-274260.html
kazan.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_xboxone_xbox_005u_acxone34-274261.html
kazan.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_xboxone_xbox_005u_acxone34-274261.html
ebay.com
And I need to get text from tag title to every from this.
html = urllib.urlopen(url, proxies=proxies).read()
print html
soup = BeautifulSoup(html, 'html.parser')
titles = soup.title.get_text()
When I print html I get real code of pages. But when I try print title
I get
ERROR: The requested URL could not be retrieved
for most of urls.
What's wrong there?

Related

How to scrape a website using selected words if present?

I have used Beautifulsoup to scrape a website. My current code helps me to get the website content in HTML format. I used soup to find the word if it is present or not but I am not able to get the paragraph it belongs to.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://manychat.com/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title.text
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_body, page_head)
thirdParty = soup.find(text = 'Facebook')

Usually, the areas you're interested in searching are of a common kind, like <div> with a common class. So, you have Soup return all of the <div>s with that class, and you search the div text for your word.

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.

Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.

As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

how to remove the starting and ending tags using python Beautiful soup

I'm having difficulty in stripping the starting and ending tags from a json url. I've used beautiful soup and the only problem i'm facing is that i'm getting <pre> tags in my response. Please advise how can i remove the starting and ending tags. The code chunk i'm using is here:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json)

So Thanks to Demian Wolf. The solution is something like this:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json.text)

You may use soup.text to remove all the tags:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<pre>Hello, world!</pre>", "html.parser")
print(soup.find("pre").text)

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)

Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))

You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))

The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).

Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html

I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: parse data with BeautifulSoup - python

Related

How to scrape a website using selected words if present?

Soup works on one IMBD page but not on another. How to solve?

how to remove the starting and ending tags using python Beautiful soup

How can I get the text from this specific div class?

using python to pull href tags

Categories

Resources