How to Extract a Division from Html with BeautifulSoup

How to Extract a Division from Html with BeautifulSoup - python

I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:
I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'
https://www.duden.de/rechtschreibung/aufmachen
Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
with open("aufmachen.html",encoding="utf8") as f:
doc = BeautifulSoup(f,"html.parser")
tags = doc.body.findAll(text = '<div class="division " id="bedeutungen">')
print(tags)
Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Could you please help me with this problem?
Thanks for your time in advance.

The main bug in your code is that you send BS a file, not a string. Call .read() on your file to get a string.
with open("aufmachen.html", "r",encoding="utf8") as f:
doc = BeautifulSoup(f.read(),"html.parser")
However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Your first call to .findAll() didn't work because the text kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.
tags = soup.body.findAll("div", class_="division", id="bedeutungen")

Related

Extractinf info form HTML that has no tags

I am using both selenium and BeautifulSoup in order to do some web scraping. I have managed myself to obtain the next piece of code:
from selenium.webdriver import Chrome
from bs4 import BeautifulSoup
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
The output soup produces has the following structure:
<html>
<head>
</head>
<body>
<rf-list-detail line-color="245,150,40" line-number="C2" line-text="Línea C2"
list="[{... ;direction":"Place1"}
,... ,
;direction":"Place2"}...
Recall both text and output style have been modified for reading reasons. I attach an image of the actual output just in case it is more convinient.
Does anyone know how could I obtain every PlaceN (in the image, Moixent would be Place1) in a list? Something like
places = [Place1,...,PlaceN]
I have tried parsing it, but as it has no tags (or at least my html knowledge, which is barely none, says so) I obtain nothing. I have also tried using a regular expression, which I have just found out where a thing, but I am not sure how to do it properly.
Any thoughts?
Thank you in advance!!
output of soup

This site responds with non-html structure. So, you need no html-parser like BeautifulSoup or lxml for this task.
Here example using requests library. You can install it like this
pip install requests
import requests
import html
import json
url = 'https://www.renfe.com/content/renfe/es/es/cercanias/cercanias-valencia/lineas/jcr:content/root/responsivegrid/rftabdetailline/item_1591014181985.html'
response = requests.get(url)
data = response.text # get data from site
raw_list = data.split("'")[1] # extract rf-list-detail.list attribute
json_list = html.unescape(raw_list) # decode html symbols
parsed_list = json.loads(json_list) # parse json
print(parsed_list) # printing result
directions = []
for item in parsed_list:
directions.append(item["direction"])
print(directions) # extracting directions
# ['Moixent', 'Vallada', 'Montesa', "L'Alcudia de Crespins", 'Xàtiva', "L'Enova-Manuel", 'La Pobla Llarga', 'Carcaixent', 'Alzira', 'Algemesí', 'Benifaió-Almussafes', 'Silla', 'Catarroja', 'Massanassa', 'Alfafar-Benetússer', 'València Nord']

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!

The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.

I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

scraping table from a website result as empty

I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root> which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)

The table is in the source HTML but kinda hidden and then rendered by JavaScript. It's in one of the <script> tags. This can be located with bs4 and then parsed with regex. Finally, the table data can be dumped to json.loads then to a pandas and to a .csv file, but since I don't know any Persian, you'd have to see if it's of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium.
Here's how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this:

Might not the best solution, but with webdriver in headless mode you can get all what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()

It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.

Extract element from HTML with Python's BeautifulSoup library

I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot

to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text

I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.

How to scrape total search results using Python

I am a beginner in Python and web scraping but I am really interested. What I want to do is to extract the total number of search results per day.
If you open it, you will see here:
Used Cars for Sale
Results 1 - 20 of 30,376
What I want is only the number 30,376. Is there any way to extract it on a daily basis automatically and save it to an excel file please? I have played around some packages in Python but all I got is error messages and something not relevant like below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "..."
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html, "lxml")
make_soup(base_url)
Can someone show me how to extract that particular number please? Thanks!

Here is the one way through requests module and soup.select function.
from bs4 import BeautifulSoup
import requests
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
def make_soup(url):
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
txt = soup.select('#result-header .result-count')[0].text
print txt.split()[-1]
make_soup(base_url)
soup.select accepts an css selector as argument. This #result-header .result-count selector means find the element having result-count class which was inside an element having result-header as id.

from bs4 import BeautifulSoup
from urllib.request import urlopen
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
html = urlopen(base_url).read()
soup = BeautifulSoup(html, 'lxml')
result_count = soup.find(class_="result-count").text.split('of ')[-1]
print(result_count)
out:
30,376

from bs4 import BeautifulSoup
import requests, re
base_url = "http://www.autotrader.co.nz/used-cars-for-sale"
a = BeautifulSoup(requests.get(base_url).content).select('div#result-header p.result-count')[0].text
num = re.search('([\w,]+)$',a)
print int(num.groups(1)[0].replace(',',''))
Output:
30378
Will get any other number also which is at the end of the statement.
Appending new rows to an Existing Excel File
Script to append today's date and the extracted number to existing excel file:
!!!Important!!!: Don't run this code directly on your main file. Instead make a copy of it first and run on that file. If it works properly then you can run it on your main file. I'm not responsible if you loose your data :)
import openpyxl
import datetime
wb = openpyxl.load_workbook('/home/yusuf/Desktop/data.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
a = sheet.get_highest_row()
sheet.cell(row=a,column=0).value=datetime.date.today()
sheet.cell(row=a,column=1).value=30378 # use a variable here from the above (previous) code.
wb.save('/home/yusuf/Desktop/data.xlsx')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Extract a Division from Html with BeautifulSoup - python

Related

Extractinf info form HTML that has no tags

Finding tables returns [] with bs4

scraping table from a website result as empty

Extract element from HTML with Python's BeautifulSoup library

How to scrape total search results using Python

Categories

Resources