DataLayer with python

DataLayer with python - python

is it possible to extract value from data layer?
This is url = https://www.repubblica.it/cronaca/2021/04/09/news/vaccini_ecco_chi_sono_gli_oltre_due_milioni_di_italiani_che_hanno_ricevuto_una_dose_fuori_dalle_liste_delle_priorita_-295650286/?ref=RHTP-BH-I0-P1-S1-T1
I need to extract "dateModified" from <script type="application/ld+json"
Thank you!
import requests
from bs4 import BeautifulSoup
import json
url='https://www.repubblica.it/cronaca/2021/04/09/news/vaccini_ecco_chi_sono_gli_oltre_due_milioni_di_italiani_che_hanno_ricevuto_una_dose_fuori_dalle_liste_delle_priorita_-295650286/?ref=RHTP-BH-I0-P1-S1-T1'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
script = soup.find_all('script')[0]
print(script.find('dateModified'))

Yes, you need to use the .string attribute and dump that to json.loads.
Here's how:
import json
import requests
from bs4 import BeautifulSoup
url='https://www.repubblica.it/cronaca/2021/04/09/news/vaccini_ecco_chi_sono_gli_oltre_due_milioni_di_italiani_che_hanno_ricevuto_una_dose_fuori_dalle_liste_delle_priorita_-295650286/?ref=RHTP-BH-I0-P1-S1-T1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(json.loads(soup.find_all('script')[0].string)["dateModified"])
Output:
2021-04-09T10:26:13Z

Related

how to put web scraped data into a list

this is the code I used to get the data from a website with all the wordle possible words, im trying to put them in a list so I can create a wordle clone but I get a weird output when I do this. please help
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
word_list = list(soup)

It do not need BeautifulSoup, simply split the text of the response:
import requests
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
requests.get(url).text.split()
Or if you like to do it wit BeautifulSoup anyway:
import requests
from bs4 import BeautifulSoup
url = "https://raw.githubusercontent.com/tabatkins/wordle-list/main/words"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.text.split()
Output:
['women',
'nikau',
'swack',
'feens',
'fyles',
'poled',
'clags',
'starn',...]

How to use find_all method in bs4 on an object without class

import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/').text
soup=BeautifulSoup (result, 'lxml')
stories=soup.find_all('tr')
print (stories)
The find method works but find_all doesn't I'm not sure why maybe it is because it doesn't have a class?

correct code is
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
stories=soup.find_all('tr')
you can access each 'tr' by
stories[0]
0 can be replaced with any number in list
You can also use Pandas
eg
import pandas
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
df=pandas.read_html(soup.prettify())
print(df)

Parsing a HTML Table gets empy soup with beautifulsoup and request

I'm trying to get all the table in this url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats" in a DataFrame (821 rows in total, need all the table). The code I'm using is this:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup) # It doesn't print anything
My idea is to get the info in soup and then look for the tag <script> jQuery.extend(Drupal.settings, {"basePath": ... and get inside the followig json link https://www.timeshighereducation.com/sites/default/files/the_data_rankings/life_sciences_rankings_2020_0__a2e62a5137c61efeef38fac9fb83a262.json where is all the data in the table. I already have a function to read this json link, but first need to find the info in soup and then get json link. Need to be in this way because I have to read many tables and get the json link by inspectioning manually is not an option for me.

You want the following regex pattern which finds the desired string after "url"
from bs4 import BeautifulSoup as bs
import requests
import re
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get('https://www.timeshighereducation.com/world-university-rankings/2020/subject-ranking/life-sciences#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
url = re.search('"url":"(.*?)"', r.text).groups(0)[0].replace('\/','/')
data = s.get(url).json()
print(data)

BS4 returns [] instead of the wanted HTML tag

I want to parse the given website and scrape the table. To me the code looks right. New to python and web parsing
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'lxml-xml')
cases = doc.find_all('div', {"class": "cell"})
print(cases)
doing this returns
[]

Change your parser and the class and there you have it.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://delhifightscorona.in/').text, 'html.parser').find('div', {"class": "grid-x grid-padding-x small-up-2"})
print(soup.find("h3").getText())
Output:
423,831

You can choose to print only the cases or the total stats with the date.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'html.parser')
stats = doc.find('div', {"class": "cell medium-5"})
print(stats.text) #Print the whole block with dates and the figures
cases = stats.find('h3')
print(cases.text) #Print the cases only

Beautifulsoup - Remove HTML tags

I am trying to strip away all the HTML tags from the ‘profile’ soup, whoever am I unable to perform the “.text.strip()” operation as it is a list, as shown in code below
import requests
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get("https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm").text
soup = BeautifulSoup(company_page, "html.parser")
info = {}
info['Profile'] = soup.select('div.text-desc-members')
pprint(info)

Just iterate through that list:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get("https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm").text
soup = BeautifulSoup(page, "html.parser")
info = {}
info['Profile'] = soup.select('div.text-desc-members')
for item in info['Profile']:
pprint(item.text.strip())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataLayer with python - python

Related

how to put web scraped data into a list

How to use find_all method in bs4 on an object without class

Parsing a HTML Table gets empy soup with beautifulsoup and request

BS4 returns [] instead of the wanted HTML tag

Beautifulsoup - Remove HTML tags

Categories

Resources