Scraping react-id - python

I'm trying to use scrapy in this page to extract the phone number in the element:
sel = selector(response)
sel.xpath('.//*[#class="ProfileSimpleContact-item"]/span/span/text()').extract()
but this returns:
['(11) 98528-27...']
I want to get the full number (without "..."), which only appears with dynamic clicking a react id. How can I get it?

You can use splash as last option, it will cause that your spider be more expensive and complex.
Luckily, in your case you can use one of the <script> tags to get the required data.
First you need to get the correct <script> tag:
ans = response.xpath("//script[contains(text(),'telephone')]/text()").extract_first()
It gives you a json like this:
{
"#context": "http://schema.org",
"#type": "Person",
"name": "Cynthia Hóss Rocha",
"description": "advogada há 15 anos.",
"telephone": "(11) 985282712",
"image": "imgs.jusbr.com/profiles/5368773/images/1419878998_standard.jpg",
"jobTitle": "Advogado",
"url": "https://cynthiahossrocha.jusbrasil.com.br",
"address": {
"#type": "PostalAddress",
"addressLocality": "São Paulo (SP)",
"streetAddress": "Rua Marconi, 131",
"postalCode": "01047-000"
}
}
To convert it into an object you need to import json and use json.loads:
json_ans = json.loads(ans)
Finally you only need to extract the required value:
phone = json_ans["telephone"]

Related

Reusable Beautiful Soup Parser/Config?

I have a Selenium-based web crawler application that monitors over 100 different medical publications, with more being added regularly. Each of these publications has a different site structure, so I've tried to make the web crawler as general and re-usable as possible (especially because this is intended for use by other colleagues). For each crawler, the user specifies a list of regex URL patterns that the crawler is allowed to crawl. From there, the crawler will grab any links found as well as specified sections of the HTML. This has been useful in downloading large amounts of content in a fraction of the time it would take to do manually.
I'm now trying to figure out a way to generate custom reports based on the HTML of a certain page. For example, when crawling X site, export a JSON file that shows the number of issues on the page, the name of each issue, the number of articles under each issue, then the title and author names of each of those articles. The page I'll use as an example and test case is https://www.paediatrieschweiz.ch/zeitschriften/
I've created a list comprehension that produces my desired output.
url = "https://www.paediatrieschweiz.ch/zeitschriften/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
report = [{
'Issue': (unit.find('p', {'class': 'section__spitzmarke'}).text).strip(),
'Articles': [{
'Title': ((article.find('h3', {'class': 'teaser__title'}).text).strip()),
'Author': ((article.find('p', {'class': 'teaser__authors'}).text).strip())
} for article in unit.find_all('article')]}
for unit in soup.find_all('div', {'class': 'section__inner'})]
Sample JSON output:
[
{
"Issue": "Paediatrica Vol. 33-3/2022",
"Articles": [
{
"Title": "Editorial",
"Author": "Daniela Kaiser, Florian Schaub"
},
{
"Title": "Interview mit Dr. med. Germann Clenin",
"Author": "Florian Schaub, Daniela Kaiser"
},
{
"Title": "Ern\u00e4hrung f\u00fcr sportliche Kinder und Jugendliche",
"Author": "Simone Reber"
},
{
"Title": "Diabetes und Sport",
"Author": "Paolo Tonella"
},
{
"Title": "Asthma und Belastung",
"Author": "Isabelle Rochat"
},
{
"Title": "Sport bei Kindern und Jugendlichen mit rheumatischer Erkrankung",
"Author": "Daniela Kaiser"
},
{
"Title": "H\u00e4mophilie und Sport",
"Author": "Denise Etzweiler, Manuela Albisetti"
},
{
"Title": "Apophysen - die Achillesferse junger Sportler",
"Author": "Florian Schaub"
},
{
"Title": "R\u00fcckenschmerzen bei Athleten im Wachstumsalter",
"Author": "Markus Renggli"
},
{
"Title": "COVID-19 bei jugendlichen AthletenInnen: Diagnose und Return to Sports",
"Author": "Susi Kriemler, Jannos Siaplaouras, Holger F\u00f6rster, Christine Joisten"
},
{
"Title": "Schutz von Kindern und Jugendlichen im Sport",
"Author": "Daniela Kaiser"
}
]
},
{
"Issue": "Paediatrica Vol. 33-2/2022",
"Articles": [
{
"Title": "Editorial",
"Author": "Caroline Schnider, Jean-Cristoph Caubet"
},
{
"Title": "Der prim\u00e4re Immundefekt \u2013 Ein praktischer Leitfaden f\u00fcr den Kinderarzt",
"Author": "Tiphaine Arlabosse, Katerina Theodoropoulou, Fabio Candotti"
},
{
"Title": "Arzneimittelallergien bei Kindern: was sollten Kinder\u00e4rzte wissen?",
"Author": "Felicitas Bellutti Enders, Mich\u00e8le Roth, Samuel Roethlisberger"
},
{
"Title": "Orale Immuntherapie bei Nahrungsmittelallergien im Kindesalter",
"Author": "Yannick Perrin, Caroline Roduit"
},
{
"Title": "Autoimmunit\u00e4t in der P\u00e4diatrie: \u00dcberlegungen der p\u00e4diatrischen Rheumatologie",
"Author": "Florence A. Aeschlimann, Raffaella Carlomagno"
However, if possible I'd like to avoid using a custom Python statement or function for each individual crawler, as each would require different code. Each crawler has it's own JSON config file which specifies the start URL, allowed URL patterns, desired content to download, etc.
My initial idea was to create a JSON config to specify the Beautiful Soup elements to scrape and store in a dictionary - something that a colleague who does not write code would be able to set up. My idea was something like this...
{
"name": "Paedriactia",
"unit": {
"selector": {
"name": "div",
"attrs": {"class": "section__inner"},
"find_all": true
},
"items": {
"issue": {
"name": "p", "attrs": {"class": "section__spitzmarke"}, "get_text": true
}
},
"subunits": {
"articles": {
"selector": {
"name": "article",
"find_all": true
},
"items": {
"Title": {
"name": "h3",
"attrs": {"class": "teaser__title"},
"get_text": true
}
}
}
}
}
}
...along with a Python function that would be able to parse the HTML according to the config and produce a JSON output.
However, I'm at a total loss as to how to do this, especially when it comes to handling nested elements. Each time I've tried to approach this on my own I've totally confused myself and have ended up back at the start.
If any of this makese sense, would anyone have any advice or idea of how to approach this sort of Beautiful Soup config?
I'm also fairly proficient in writing code with Beautiful Soup, so I'm open to the idea of writing custom Beautiful Soup functions and statements for each crawler (and perhaps even training others to do the same). If I take this route, where would be the best place to store all of that custom code and import it? Would I be able to have some sort of "parse.py" file in each crawler folder and import its function only as needed (I.e., when that specific crawler is run?)
If helpful, an example of the current structure of the web crawler projects is below:
With BeautifulSoup, I strongly prefer using select and select_one to using find_all and find when scraping nested elements. (If you're not used to working with CSS selectors, I find the w3schools reference page to be a good cheatsheet for them.)
If you defined a function like
def getSoupData(mSoup, dataStruct, maxDepth=None, curDepth=0):
if type(dataStruct) != dict:
# so selector/targetAttr can also be sent as a single string
if str(dataStruct).startswith('"ta":'):
dKey = 'targetAttr'
else:
dKey = 'cssSelector'
dataStruct = str(dataStruct).replace('"ta":', '', 1)
dataStruct = {dKey: dataStruct}
# default values: isList=False, items={}
isList = dataStruct['isList'] if 'isList' in dataStruct else False
if 'items' in dataStruct and type(dataStruct['items']) == dict:
items = dataStruct['items']
else: items = {}
# no selector -> just use the input directly
if 'cssSelector' not in dataStruct:
soup = mSoup if type(mSoup) == list else [mSoup]
else:
soup = mSoup.select(dataStruct['cssSelector'])
# so that unneeded parts are not processed:
if not isList: soup = soup[:1]
# return empty nothing was selected
if not soup: return [] if isList else None
# return text or attribute values - no more recursion
if items == {}:
if 'targetAttr' in dataStruct:
targetAttr = dataStruct['targetAttr']
else:
targetAttr = '"text"' # default
if targetAttr == '"text"':
sData = [s.get_text(strip=True) for s in soup]
# can put in more options with elif
else:
sData = [s.get(targetAttr) for s in soup]
return sData if isList else sData[0]
# return error - recursion limited
if maxDepth is not None and curDepth > maxDepth:
return {
'errorMsg': f'Maximum [{maxDepth}] exceeded at depth={curDepth}'
}
# recursively get items
sData = [dict([(i, getSoupData(
s, items[i], maxDepth, curDepth + 1
)) for i in items]) for s in soup]
return sData if isList else sData[0]
# return list only if isList is set
you can make your data structure as nested as your html structure [because the function is recursive]....if you want that for some reason; but also, you can set maxDepth to limit how nested it can get - if you don't want to set any limits, you can get rid of both maxDepth and curDepth as well as any parts involving them.
Then, you can make your config file something like
{
"name": "Paedriactia",
"data_structure": {
"cssSelector": "div.section__inner",
"items": {
"Issue": "p.section__spitzmarke",
"Articles": {
"cssSelector": "article",
"items": {
"Title": "h3.teaser__title",
"Author": "p.teaser__authors"
},
"isList": true
}
},
"isList": true
}
"url": "https://www.paediatrieschweiz.ch/zeitschriften/"
}
["isList": true here is equivalent to your "find_all": true; and your "subunits" are also defined as "items" here - the function can differentiate based on the structure/dataType.]
Now the same data that you showed [at the beginning of your question] can be extracted with
# import json
configC = json.load(open('crawlerConfig_paedriactia.json', 'r'))
url = configC['url']
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
dStruct = configC['data_structure']
getSoupData(soup, dStruct)
For this example, you could add the article links by adding {"cssSelector": "a.teaser__inner", "targetAttr": "href"} as ...Articles.items.Link.
Also, note that [because of the defaults set at the beginning in the function], "Title": "h3.teaser__title" is the same as
"Title": { "cssSelector": "h3.teaser__title" }
and
"Link": "\"ta\":href"
would be the same as
"Link": {"targetAttr": "href"}

Sending GET request to URL with fragment returns the content of the main page

I am trying to web-scrap this webpage but I always end up getting the "main" page (same URL but without "#face-a-face" at the end). It's the same problem as this guy encountered, see this forum. He got an answer but I am not able to generalize and apply this for the website I want to scrap.
import requests
from bs4 import BeautifulSoup
url_main = "https://www.lequipe.fr/Football/match-direct/ligue-1/2020-2021/ol-dijon-live/477168"
url_target = url_main + "#face-a-face"
soup_main = BeautifulSoup(requests.get(url_main, verify=False).content, "html.parser")
soup_target = BeautifulSoup(requests.get(url_target, verify=False).content, "html.parser")
print(soup_main == soup_target)
returns True. I would like to get different contents, this is not the case here.
For example, I would like to extract all the "confrontations depuis 2011" in the target webpage. How can I get the final content of this webpage with a GET request (or with another way) ? Thanks !
All the data comes from a highly nested JSON file.
You can get that file and extract the information you need.
Here's how:
import json
import requests
endpoint = "https://iphdata.lequipe.fr/iPhoneDatas/EFR/STD/ALL/V2/Football/Prelive/68/477168.json"
team_data = requests.get(endpoint).json()
specifics = team_data["items"][1]["objet"]["matches"][0]["specifics"]
print(json.dumps(specifics, indent=2))
This should get you a dictionary:
{
"__type": "specifics_sport_collectif",
"vainqueur": "domicile",
"score": {
"__type": "score",
"exterieur": "1",
"domicile": "4"
},
"exterieur": {
"__type": "effectif_sport_collectif",
"equipe": {
"__type": "equipe",
"id": "202",
"url_image": "https://medias.lequipe.fr/logo-football/202/{width}{declinaison}",
"nom": "Dijon",
"url_fiche": "https://www.lequipe.fr/Football/FootballFicheClub202.html"
}
},
"domicile": {
"__type": "effectif_sport_collectif",
"equipe": {
"__type": "equipe",
"id": "22",
"url_image": "https://medias.lequipe.fr/logo-football/22/{width}{declinaison}",
"nom": "Lyon",
"url_fiche": "https://www.lequipe.fr/Football/FootballFicheClub22.html"
}
},
"is_final": false,
"prolongation": false,
"vainqueur_final": "domicile",
"is_qualifier": false
}
And if you, for example, just want the socre, add this line:
just_the_score = specifics["score"]
print(just_the_score)
To get this:
{'__type': 'score', 'exterieur': '1', 'domicile': '4'}

Is it possible to retreive object_story_spec for an ad creative which was not created with that object_story_spec? [Python Facebook API]

I have a set of ad creatives that I retreive through the Facebook Business Python SDK. I need these specifically to retreive the outbound URL when someone clicks on the ad: AdCreative['object_story_spec']['video_data']['call_to_action']['value']['link'].
I use the following call:
adcreatives = set.get_ad_creatives(fields=[
AdCreative.Field.id,
AdCreative.Field.name,
AdCreative.Field.object_story_spec,
AdCreative.Field.effective_object_story_id ,
])
Where set is an ad set.
For some cases, the result looks like this (with actual data removed), which is expected:
<AdCreative> {
"body": "[<BODY>]",
"effective_object_story_id": "[<EFFECTIVE_OBJECT_STORY_ID>]",
"id": "[<ID>]",
"name": "[<NAME>]",
"object_story_spec": {
"instagram_actor_id": "[<INSTAGRAM_ACTOR_ID>]",
"page_id": "[<PAGE_ID>]",
"video_data": {
"call_to_action": {
"type": "[<TYPE>]",
"value": {
"link": "[<LINK>]", <== This is what I need
"link_format": "[<LINK_FORMAT>]"
}
},
"image_hash": "[<IMAGE_HASH>]",
"image_url": "[<IMAGE_URL>]",
"message": "[<MESSAGE>]",
"video_id": "[<VIDEO_ID>]"
}
}
}
While sometimes results look like this:
<AdCreative> {
"effective_object_story_id": "[<EFFECTIVE_OBJECT_STORY_ID>]",
"id": "[<ID>]",
"name": "[<NAME>]",
"object_story_spec": {
"instagram_actor_id": "[<INSTAGRAM_ACTOR_ID>]",
"page_id": "[<PAGE_ID>]"
}
}
According to this earlier question: Can't get AdCreative ObjectStorySpec this is due to the fact that the object_story_spec is not populated if it is linked to a creative, instead of created along with the creative.
However, the video_data (and as such, the link), should be saved somewhere. Is there a way to retreive this? Maybe through effective_object_story_id?
The documentation page for object_story_spec (https://developers.facebook.com/docs/marketing-api/reference/ad-creative-object-story-spec/v12.0) does not have the information I am looking for.

Beautiful Soup returns an empty string when website has text

Considering this website here: https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/
I'm looking to scrape the content under the headings on the right. Here is my sample code which should return the list of contents but is returning empty strings:
import requests as req
from bs4 import BeautifulSoup as bs
r = req.get('https://dlnr.hawaii.gov/dsp/parks/oahu/ahupuaa-o-kahana-state-park/').text
soup = bs(r)
par = soup.find('h3', text= 'Facilities')
for sib in par.next_siblings:
print(sib)
This returns:
<ul class="park_icon">
<div class="clearfix"></div>
</ul>
The website doesn't show any div element with that class. Also, the list items are not being captured.
Facilities, and other info in that frame, are loaded dynamically by JavaScript, so bs4 doesn't see them in the source HTML because they're simply not there.
However, you can query the endpoint and get all the info you need.
Here's how:
import json
import re
import time
import requests
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/90.0.4430.93 Safari/537.36",
"referer": "https://dlnr.hawaii.gov/",
}
endpoint = f"https://stateparksadmin.ehawaii.gov/camping/park-site.json?parkId=57853&_={int(time.time())}"
response = requests.get(endpoint, headers=headers).text
data = json.loads(re.search(r"callback\((.*)\);", response).group(1))
print("\n".join(f for f in data["park info"]["facilities"]))
Output:
Boat Ramp
Campsites
Picnic table
Restroom
Showers
Trash Cans
Water Fountain
Here's the entire JSON:
{
"park info": {
"name": "Ahupua\u02bba \u02bbO Kahana State Park",
"id": 57853,
"island": "Oahu",
"activities": [
"Beachgoing",
"Camping",
"Dogs on Leash",
"Fishing",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"Boat Ramp",
"Campsites",
"Picnic table",
"Restroom",
"Showers",
"Trash Cans",
"Water Fountain"
],
"prohibited": [
"No Motorized Vehicles/ATV's",
"No Alcoholic Beverages",
"No Open Fires",
"No Smoking",
"No Commercial Activities"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.556086,
"longitude": -157.875579
},
"hiking": [
{
"name": "Nakoa Trail",
"id": 17,
"activities": [
"Dogs on Leash",
"Hiking",
"Hunting",
"Sightseeing"
],
"facilities": [
"No Drinking Water"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [
"Flash Flood"
],
"photos": [],
"location": {
"latitude": 21.551087,
"longitude": -157.881228
},
"has_google_street": false
},
{
"name": "Kapa\u2018ele\u2018ele Trail",
"id": 18,
"activities": [
"Dogs on Leash",
"Hiking",
"Sightseeing"
],
"facilities": [
"No Drinking Water",
"Restroom",
"Trash Cans"
],
"prohibited": [
"No Bicycles",
"No Open Fires",
"No Littering/Dumping",
"No Camping",
"No Smoking"
],
"hazards": [],
"photos": [],
"location": {
"latitude": 21.554744,
"longitude": -157.876601
},
"has_google_street": false
}
]
}
}
You've already been given the necessary answer and I thought I would provide insight into another way you could have divined what was going on (other than looking in network traffic).
Let's start with your observation:
the list items are not being captured.
Examining each of the li elements we see that the html is of the form
class="parkicon facilities icon01" - where 01 is a variable number representing the particular icon visible on the page.
A quick search through the associated source files will show you that these numbers, and their corresponding facility reference are listed in
https://dlnr.hawaii.gov/dsp/wp-content/themes/hic_state_template_StateParks/js/icon.js:
var w_fac_icons={"ADA Accessible":"01","Boat Ramp":"02","Campsites":"03","Food Concession":"04","Lodging":"05","No Drinking Water":"06","Picnic Pavilion":"07","Picnic table":"08","Pier Fishing":"09","Restroom":"10","Showers":"11","Trash Cans":"12","Walking Path":"13","Water Fountain":"14","Gift Shop":"15","Scenic Viewpoint":"16"}
If you then search the source html for w_fac_icons you will come across (lines 560-582):
// Icon Facilities
var i_facilities =[];
for(var i=0, l=parkfac.length; i < l ; ++i) {
var icon_fac = '<li class="parkicon facilities icon' + w_fac_icons[parkfac[i]] + '"><span>' + parkfac[i] + '</span></li>';
i_facilities.push(icon_fac);
};
if (l > 0){
jQuery('#i_facilities ul').html(i_facilities.join(''));
} else {
jQuery('#i_facilities').hide();
}
This shows you how the li element html is constructed through javascript running on the page with parkfac[i] returning the text description in the span, and w_fac_icons[parkfac[i]] returning the numeric value associated with the icon in the class value.
If you track back parkfac you will arrive at line 472
var parkfac = parkinfo.facilities;
If you then track back function parkinfo you will arrive at line 446 onwards, where you will find the ajax request which dynamically grabs the json data used to update the webpage:
function parkinfo() {
var campID = 57853;
jQuery.ajax( {
type:'GET',
url: 'https://stateparksadmin.ehawaii.gov/camping/park-site.json',
data:"parkId=" + campID,
data can be passed within a querystring as params using a GET.
This is therefore the request you are looking for in the network tab.
While the above answers technically answer the question, if you're scraping data from multiple pages its not feasible to look into endpoints each time.
The simpler approach when you know you're handling a javascript page is to simply load it with scrapy-splash or selenium. Then the javascript elements can be parsed with BeautifulSoup.

How can I scrape the content of this specific website (cineatlas)?

I am trying to scrape the content of this particular website : https://www.cineatlas.com/
I tried scraping the date part as shown in the print screen :
I used this basic beautifulsoup code
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text,'html.parser')
type(soup)
time = soup.find('ul',class_='slidee')
This is what I get instead of the list of elements
<ul class="slidee">
<!-- adding dates -->
</ul>
The site creates HTML elements dynamically from the Javascript content. You can get the JS content by using re for example:
import re
import json
import requests
from ast import literal_eval
url = 'https://www.cineatlas.com/'
html_data = requests.get(url).text
movieData = re.findall(r'movieData = ({.*?}), movieDataByReleaseDate', html_data, flags=re.DOTALL)[0]
movieData = re.sub(r'\s*/\*.*?\*/\s*', '', movieData) # remove comments
movieData = literal_eval(movieData) # in movieData you have now the information about the current movies
print(json.dumps(movieData, indent=4)) # print data to the screen
Prints:
{
"2019-08-06": [
{
"url": "fast--furious--hobbs--shaw",
"image-portrait": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603443098_891497ecc8b16b3a662ad8b036820ed1_500x735.jpg",
"image-landscape": "https://d10u9ygjms7run.cloudfront.net/dd2qd1xaf4pceqxvb41s1xpzs0/1562603421049_7c233477779f25725bf22aeaacba469a_700x259.jpg",
"title": "FAST & FURIOUS : HOBBS & SHAW",
"releaseDate": "2019-08-07",
"endpoint": "ST00000392",
"duration": "120 mins",
"rating": "Classification TOUT",
"director": "",
"actors": "",
"times": [
{
"time": "7:00pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8388?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
"attributes": [
{
"_id": "5d468c20f67cc430833a5a2b",
"shortName": "VF",
"description": "Version Fran\u00e7aise"
},
{
"_id": "5d468c20f67cc430833a5a2a",
"shortName": "3D",
"description": "3D"
}
]
},
{
"time": "9:50pm",
"bookingLink": "https://ticketing.eu.veezi.com/purchase/8389?siteToken=b4ehk19v6cqkjfwdsyctqra72m",
... and so on.
lis = time.findChildren()
This returns a list of child nodes

Categories

Resources