Extract element from HTML with Python's BeautifulSoup library

Extract element from HTML with Python's BeautifulSoup library - python

I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot

to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text

I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.

Related

Crawl dynamic page crawl website elements

I am crawling through Python.
The discount price on the page above is shaded in red, and it exists in the form of text in the script tag when you search for the website developer tool.
from bs4 import BeautifulSoup as bs4
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
print(json_data2)
enter image description here
However, if you print the code on the terminal through the code, you can see that the discount price you saw on the web browser page is printed as the normal price as shown below. How can I get that value?
The selenium module takes a long time to implement, so I want to access requests or other directions.

Using regular expressions will do the trick.
from bs4 import BeautifulSoup as bs4
import re
import requests as req
import json
url = 'https://www.11st.co.kr/products/4976666261?NaPm=ct=ld6p5dso|ci=e5e093b328f0ae7bb7c9b67d5fd75928ea152434|tr=slsbrc|sn=17703|hk=87f5ed3e082f9a3cd79cdd0650afa9612c37d9e8&utm_term=&utm_campaign=%B3%D7%C0%CC%B9%F6pc_%B0%A1%B0%DD%BA%F1%B1%B3%B1%E2%BA%BB&utm_source=%B3%D7%C0%CC%B9%F6_PC_PCS&utm_medium=%B0%A1%B0%DD%BA%F1%B1%B3'
res = req.get(url)
soup = bs4(res.text,'html.parser')
# json_data1=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')[1].split('=')[1].replace(';',"")
# data=json.loads(json_data1)
# print(data)
json_data2=soup.find('body').find_all('script',type='text/javascript')[-4].text.split('\n')
for i in json_data2:
results = re.findall(r'lastPrc : (\d+?),',i)
if results:
print(results)
OUTPUT
['1310000']
The value that you are looking for is no longer there.

Retrieving the text output of a html website using bs4

I am currently trying to extract the text of what match name I have scraped.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.betexplorer.com/odds-movements/soccer/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
times = soup.select('span.table-main__time') #good
matches = soup.find_all("td",class_ ="table-main__tt")
I have located the tag/class and it seems the value i want to retrieve is behind the href in the a tag. the output I wish to achieve here is 'Can Tho - Long An'
this is a dynamic webpage so its likely that the same output for match wont be possible but I am looking for pointers on how I can extract just the text and not the whole html.

scraping table from a website result as empty

I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root> which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)

The table is in the source HTML but kinda hidden and then rendered by JavaScript. It's in one of the <script> tags. This can be located with bs4 and then parsed with regex. Finally, the table data can be dumped to json.loads then to a pandas and to a .csv file, but since I don't know any Persian, you'd have to see if it's of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium.
Here's how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this:

Might not the best solution, but with webdriver in headless mode you can get all what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()

It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.

How to extract value from span tag

I am writing a simple web scraper to extract the game times for the ncaa basketball games. The code doesn't need to be pretty, just work. I have extracted the value from other span tags on the same page but for some reason I cannot get this one working.
from bs4 import BeautifulSoup as soup
import requests
url = 'http://www.espn.com/mens-college-basketball/game/_/id/401123420'
response = requests.get(url)
soupy = soup(response.content, 'html.parser')
containers = soupy.findAll("div",{"class" : "team-container"})
for container in containers:
spans = container.findAll("span")
divs = container.find("div",{"class": "record"})
ranks = spans[0].text
team_name = spans[1].text
team_mascot = spans[2].text
team_abbr = spans[3].text
team_record = divs.text
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
refs_container = soupy.find("div", {"class" : "game-info-note__container"})
refs = refs_container.text
print(ranks)
print(team_name)
print(team_mascot)
print(team_abbr)
print(team_record)
print(game_times)
print(refs)
The specific code I am concerned about is this,
time_container = soupy.find("span", {"class":"time game-time"})
game_times = time_container.text
I just provided the rest of the code to show that the .text on other span tags work. The time is the only data I truly want. I just get an empty string with how my code is currently.
This is the output of the code I get when I call time_container
<span class="time game-time" data-dateformat="time1" data-showtimezone="true"></span>
or just '' when I do game_times.
Here is the line of the HTML from the website:
<span class="time game-time" data-dateformat="time1" data-showtimezone="true">6:10 PM CT</span>
I don't understand why the 6:10 pm is gone when I run the script.

The site is dynamic, thus, you need to use selenium:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
game_time = soup(d.page_source, 'html.parser').find('span', {'class':'time game-time'}).text
Output:
'7:10 PM ET'
See full selenium documentation here.

An alternative would be to use some of ESPN's endpoints. These endpoints will return JSON responses. https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard
You can see other endpoints at this GitHub link https://gist.github.com/akeaswaran/b48b02f1c94f873c6655e7129910fc3b
This will make your application pretty light weight compared to running Selenium.
I recommend opening up inspect and going to the network tab. You can see all sorts of cool stuff happening. You can see all the requests that are happening in the site.

You can easily grab from an attribute on the page with requests
import requests
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
r = requests.get('http://www.espn.com/mens-college-basketball/game/_/id/401123420')
soup = bs(r.content, 'lxml')
timing = soup.select_one('[data-date]')['data-date']
print(timing)
match_time = parse(timing).time()
print(match_time)

How to download tickers from webpage, beautifulsoup didnt get all content

I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())

The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract element from HTML with Python's BeautifulSoup library - python

to extract time you can use html tag and its class : time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text

Related

Crawl dynamic page crawl website elements

Retrieving the text output of a html website using bs4

scraping table from a website result as empty

How to extract value from span tag

How to download tickers from webpage, beautifulsoup didnt get all content

Categories

Resources