Scraping data from site

Scraping data from site - python

I've tried to scrape some data from a site using BeauitfulSoup, I've scraped some of the data successfully some others like (phone, website) I get errors with those data.
https://yellowpages.com.eg/en/search/spas/3231
this is the link to the site I try to scrape.
from bs4 import BeautifulSoup
import requests
url = 'https://yellowpages.com.eg/en/search/spas/3231'
r = requests.get(url)
soup =BeautifulSoup(r.content, 'lxml')
info = soup.find_all('div', class_='col-xs-12 padding_0')
for item in info:
phone = item.find('span', class_='phone-spans')
print(phone)
Every time I run this code the result is none.

Not sure where that code comes from but I couldn't see anything that looked similar, however this code works:
from bs4 import BeautifulSoup
import requests
url = 'https://yellowpages.com.eg/en/search/spas/3231'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
for item in soup.find_all('div', class_='searchResultsDiv'):
name = item.find('a',class_= 'companyName').text.strip()
phone = item.find('a',class_= 'search-call-mob')['href']
print(name,phone)

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)

It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

Webscraping Google news page: getting AttributeError: 'NoneType' object has no attribute 'find_all'

I'm trying to web-scrape the Google News page for a personal project and retrieve the article headlines to print out onto another page. I've been searching for any typos or mistakes but I'm not sure why my element keeps returning as "None" when I try to print it.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.google.com/search?q=beyond+meat&rlz=1C1CHBF_enUS898US898&sxsrf=ALeKk00IH9jp1Kz5-LSyi7FUB4rd6--_hw:1624935518812&source=lnms&tbm=nws&sa=X&ved=2ahUKEwicqIbD7LvxAhVWo54KHXgRA9oQ_AUoAXoECAEQAw&biw=1536&bih=754'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', id='rso') #grabs everything in column
article_results = results.find_all('div', class_='yr3B8d KWQBje') #grabs divs surrounding each article
for article_result in article_results:
headliner = article_result.find('div', class_='JheGif nDgy9d')#grabs article header div for every article
if None in (headliner):
continue
headliner_text = headliner.text.strip()
print()

import requests
from bs4 import BeautifulSoup
URL = 'https://www.google.com/search?q=beyond+meat&rlz=1C1CHBF_enUS898US898&sxsrf=ALeKk00IH9jp1Kz5-LSyi7FUB4rd6--_hw:1624935518812&source=lnms&tbm=nws&sa=X&ved=2ahUKEwicqIbD7LvxAhVWo54KHXgRA9oQ_AUoAXoECAEQAw&biw=1536&bih=754'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
headers = soup.find_all('div', class_='BNeawe vvjwJb AP7Wnd')
for h in headers:
print(h.text)
Refer the output... Is this what you are expecting?

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.

Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.

I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

Web Scraping - URL extraction from Lazada ecommerce platform

I am currently trying to scrape the products URLs from Lazada ecommerce platform, however i am getting random links from the website rather than the products links.
https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2
My code below:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-
Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
links = soup.find_all('div', {'class': 'c16H9d'})
for link in soup.find_all("a"):
print(link.get("href"))
The result I am getting out of this code(which is not what i want) :
This is the section of the links that I need, i wanted to list down all the products URLs from the products page.
I hope you guys can help me on this, I know it is simple it just doesn't seems to work, have been looking at this since yesterday.

The page is dynamic. Within the html source code is the script that generates a json format of the products. You can pull that, then parse the json object to print off the urls:
from bs4 import BeautifulSoup, SoupStrainer
import requests
import json
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
scripts = soup.find_all('script')
jsonObj = None
for script in scripts:
if 'window.pageData=' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split("window.pageData=")[1]
jsonObj = json.loads(jsonStr)
products = jsonObj['mods']['listItems']
for item in products:
print (item['productUrl'])
Output:
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-classic-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434054933-s631848411.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-i436317818-s639072921.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x4-packs-i436315786-s639051449.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-extra-rich-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434056795-s631848533.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434061102-s631849047.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x2-packs-free-2-sticks-random-flavor-i434043717-s631808896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-i436302855-s639061262.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434054922-s631844581.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x4-packs-i436302850-s639053701.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-i428864462-s623547178.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x2-packs-free-2-sticks-random-flavor-i434051011-s631820678.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-i429053474-s623935676.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-i429057450-s623942383.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-i436315838-s639079875.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-mocha-15s-i429056707-s623944523.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-i446475373-s668555737.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434038995-s631814134.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434059429-s631843615.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-free-2-sticks-random-flavor-i434048248-s631818679.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-free-2-sticks-random-flavor-i434045440-s631815896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x4-packs-i436347739-s639057542.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-i434617928-s633601781.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x4-packs-i436353732-s639059385.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-i429056294-s623942054.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x4-packs-free-4-sticks-random-flavor-i434053005-s631824941.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x2-packs-free-2-sticks-random-flavor-i434041889-s631816196.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-i429055995-s623946380.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-i436317821-s639075740.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-i435467037-s636334815.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434043759-s631818279.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-i436315754-s639016791.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x6-packs-i436353749-s639073757.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-3-in-1-white-milk-tea-13s-x6-packs-free-oldtown-4-sticks-random-flavor-i434058314-s631843262.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-free-4-sticks-random-flavor-i446532232-s668570216.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x6-packs-i436303866-s639071227.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-foc-glass-mug-i442451196-s654637474.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-3-in-1-white-milk-tea-13s-x4-packs-free-4-sticks-random-flavor-i434046913-s631825874.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-classic-15s-x4-packs-free-4-sticks-random-flavor-i434052193-s631832149.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-i436282923-s639018195.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434062035-s631849174.html?mp=1

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)

You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data from site - python

Related

how to put web scraped data into a list

Webscraping Google news page: getting AttributeError: 'NoneType' object has no attribute 'find_all'

'NoneType' object is not callable in Beautiful Soup 4

Web Scraping - URL extraction from Lazada ecommerce platform

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

Categories

Resources