How to breakup webpage text - python

I assume I have to use the /br to break up this text and extract the fixtures only, but cannot figure it out at all!
import requests
from termcolor import colored
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
import pandas as pd
import pprint
import json
from urllib.request import urlopen
# sample web page
sample_web_page = 'https://bleacherreport.com/articles/10005879-epl-schedule-2021-22-official-list-of-fixtures-for-new-premier-league-season'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create soup
soup = BeautifulSoup(page.content, "html.parser")
z = soup.findAll('p', {'class':''})
print(z)

Related

How to use find_all method in bs4 on an object without class

import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/').text
soup=BeautifulSoup (result, 'lxml')
stories=soup.find_all('tr')
print (stories)
The find method works but find_all doesn't I'm not sure why maybe it is because it doesn't have a class?
correct code is
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
stories=soup.find_all('tr')
you can access each 'tr' by
stories[0]
0 can be replaced with any number in list
You can also use Pandas
eg
import pandas
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
df=pandas.read_html(soup.prettify())
print(df)

How to scrape price data on page using BeautifulSoup

I am new to web scraping and is having trouble figuring out how to scrape all the prices in the webpage below. What I tried returns blank, any pointers would be great!
import bs4
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
from datetime import datetime
from pytz import timezone
import urllib.request
url = 'https://www.remax.ca/find-real-estate'
page = urlopen(url)
soup = bs4.BeautifulSoup(page,'html.parser')
price = soup.findAll('h3', {'class' : 'price'})
First thing, if you use from bs4 import BeautifulSoup, don't use import bs4 too.
Second, write soup = BeautifulSoup(page,'html.parser)
Then use price = soup.find_all('h3',{'class':'price})
After this, you should have in "price" all the prices, but you still need to refine, as in that form you will copy all that code from the h3s.
EDIT
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
from datetime import datetime
import urllib.request
url = 'https://www.remax.ca/find-real-estate'
page = urlopen(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
for p in price:
print(p.text)
This should do the job. I eliminated pandas because i have not it installed.

Beautifulsoup - Remove HTML tags

I am trying to strip away all the HTML tags from the ‘profile’ soup, whoever am I unable to perform the “.text.strip()” operation as it is a list, as shown in code below
import requests
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get("https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm").text
soup = BeautifulSoup(company_page, "html.parser")
info = {}
info['Profile'] = soup.select('div.text-desc-members')
pprint(info)
Just iterate through that list:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get("https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm").text
soup = BeautifulSoup(page, "html.parser")
info = {}
info['Profile'] = soup.select('div.text-desc-members')
for item in info['Profile']:
pprint(item.text.strip())

Having trouble in getting page source with beautifulsoup

I am trying to get the HTML source of a web page using beautifulsoup.
import bs4 as bs
import requests
import urllib.request
sourceUrl='https://www.pakwheels.com/forums/t/planing-a-trip-from-karachi-to-lahore-by-road-in-feb-2017/414115/2.html'
source=urllib.request.urlopen(sourceUrl).read()
soup=bs.BeautifulSoup(source,'html.parser')
print(soup)
I want the HTML source of the page. This is what I am getting now:
'ps.store("siteSettings", {"title":"PakWheels Forums","contact_email":"sami.ullah#pakeventures.com","contact_url":"https://www.pakwheels.com/main/contact_us","logo_url":"https://www.pakwheels.com/assets/logo.png","logo_small_url":"/images/d-logo-sketch-small.png","mobile_logo_url":"data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjwhLS0gR2VuZXJhdG9yOiBBZG9iZSBJbGx1c3RyYXRvciAxNi4wLjAsIFNWRyBFeHBvcnQgUGx1Zy1JbiAuIFNWRyBWZXJzaW9uOiA2LjAwIEJ1aWxkIDApICAtLT4NCjwhRE9DVFlQRSBzdmcgUFVCTElDICItLy9XM0MvL0RURCBTVkcgMS4xLy9FTiIgImh0dHA6Ly93d3cudzMub3JnL0dyYXBoaWNzL1NWRy8xLjEvRFREL3N2ZzExLmR0ZCI+DQo8c3ZnIHZlcnNpb249IjEuMSIgaWQ9IkxheWVyXzEiIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIgeG1sbnM6eGxpbms9Imh0dHA6Ly93d3cudzMub3JnLzE5OTkveGxpbmsiIHg9IjBweCIgeT0iMHB4Ig0KCSB3aWR0aD0iMjQwcHgiIGhlaWdodD0iNjBweCIgdmlld0JveD0iMCAwIDI0MCA2MCIgZW5hYmxlLWJhY2tncm91bmQ9Im5ldyAwIDAgMjQwIDYwIiB4bWw6c3BhY2U9InByZXNlcnZlIj4NCjxwYXRoIGZpbGw9IiNGRkZGRkYiIGQ9Ik02LjkwMiwyMy4yODZDMzQuNzc3LDIwLjI2Miw1Ny4yNC'
Have a look at this code:
from urllib import request
from bs4 import BeautifulSoup
url_1 = "http://www.google.com"
page = request.urlopen(url_1)
soup = BeautifulSoup(page)
print(soup.prettify())
Import everything you need correctly. Read this.

Crawl site with infinite scrolling

I'm trying to crawl flipkart, but flipkart does not load its page at once. So I'm not able to crawl it. Please help.
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import urllib2
import urllib
url = "https://www.flipkart.com/offers-list/weekend-specials?screen=dynamic&pk=contentTheme%3DLS-Nov-Weekend_widgetType%3DdealCard&wid=4.dealCard.OMU&otracker=hp_omu_Weekend+Specials_1"
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
name=soup.find_all("div",{"class":"iUmrbN"})
for i in name:
print i.text
This is not giving any output.

Categories

Resources