Beautiful soup, get into app.start(text) - python

I have the gotten the following lxml code using beautifulSoup:
<script>
app.start({
el: $('body'),
property: {
p_A: 'A',
p_B: 'B'
});
</script>
from which I would like to get the list {p_A : 'A', p_B : 'B'}, but I do not know how to get into the app.start.

Related

specify time range using request-html when scraping a dynamic page

I'm recently scraping soccer results for a friend, and it gets me. Below is the link:
https://www.mlssoccer.com/mlsnext/schedule/2021-2022/u16_mls-next-schedule
I'm trying to switch from selenium to request-html. The very reason that I use selenium in the first place is that is the only option I know to click the calendar.
The default date is the current date, but I need all the match history, is it possible to change the default date using request-html, if so, how?
Thanks in advance for your time and effort, any useful suggestion is appreciated.
---------------------------- EDIT------------------------------
After searching for a while, I find the following possible way to do this using request-html,
url= "https://www.modular11.com/public_schedule/league/get_matches"
session = HTMLSession()
response = session.post(url, data={"start_date": "2021-10-30 00:00:00"})
print(response.url)
# print(response.text)
response.html.render(timeout=1200)
print(response.html.text)
Please select the gender, league & age of the matches you are looking for.
$(function () { // Fix event duplicate $('.main_row').unbind('click'); $('.main_row').on('click', function () { $(this) .find('button.icon') .children('span') .toggleClass('glyphicon glyphicon-menu-down') .toggleClass('glyphicon glyphicon-menu-up'); if ($(this).find('button.icon span').hasClass('glyphicon-menu-down')) { $(this) .closest('.container-row') .find('.table-row-heading') .addClass('hidden-mobile'); $(this) .closest('.container-row') .find('.table-content-row') .addClass('hidden-xs hidden-sm'); $(this).css({ 'background-color': 'inherit', 'color': '#0B0B33' }); $(this) .closest('.container-row') .find('.mobile-scrolling') .removeClass('hide-scrolling'); } else if ($(this).find('button.icon span').hasClass('glyphicon-menu-up')) { $(this) .closest('.container-row') .find('.table-row-heading') .removeClass('hidden-mobile'); $(this) .closest('.container-row') .find('.table-content-row') .removeClass('hidden-xs hidden-sm'); $(this).css({ 'background-color': '#2A3851', 'color': '#EEEEEE !important' }); $(this) .closest('.container-row') .find('.mobile-scrolling') .addClass('hide-scrolling'); } }); });
but it seems like the data part has not been send to the server correctly, thus returning:
instead of (https://www.modular11.com/schedule?year=14):
the calendar makes an xhr request (that you can monitor with the network tab of your browser web tools), with an easy to customize query dict. the request returns html data that you'll have to parse with beautiful soup
import requests
query_dict={'open_page': ['0'], 'academy': ['0'], 'league': ['12'], 'gender': ['1'], 'age': ['["14"]'], 'brackets': ['null'], 'groups': ['null'], 'group': ['null'], 'match_number': ['0'], 'status': ['scheduled'], 'match_type': ['2'], 'schedule': ['0'], 'start_date': ['2021-11-03 00:00:00'], 'end_date': ['2021-11-10 23:59:00']}
r = requests.post('https://www.modular11.com/public_schedule/league/get_matches', data = query_dict)
print(r.text)

Why is my function stripping the dir attribute despite it being in my list of allowed attributes?

I am using the bleach package to strip away invalid html. I am puzzled why the dir attribute is being stripped from my string. Is dir not an attribute, or could it just be that the package does not support dir?
I have included the entire script, so you can run it for your convenience.
import bleach
string = """<p dir="rtl">asdasdasd <span>asdasdasd</span> asdsadasdsad .<br data-mce-bogus="1"></p>"""
def strip_invalid_html(html):
""" strips invalid tags/attributes """
allowed_tags = [
'p', 'a', 'blockquote',
'h1', 'h2', 'h3', 'h4', 'h5',
'strong', 'em',
'br',
'span',
]
allowed_attributes = {
'a': ['href', 'title'],
'dir': ['rtl', 'ltr']
}
cleaned_html = bleach.clean(
html,
attributes=allowed_attributes,
strip=True,
tags=allowed_tags
)
print(cleaned_html)
strip_invalid_html(string)
If you pass a dict for attributes, the dict should map tag names to allowed attribute names, not map attribute names to allowed attribute values.
If you want 'dir' to be an allowed attribute for p tags, you need a 'p': ['dir'] entry, not a 'dir': ['rtl', 'ltr'] entry:
allowed_attributes = {
'a': ['href', 'title'],
'p': ['dir'],
}

How to find specific script tag from a webpage using Beautifulsoup

I'm new to python and beautifulsoup. I'm trying to find a json data inside script tag. My problem is the webpage contains many script tags.
I need to get this script tag :
<script type="text/javascript">
P.when('A').register("ImageBlockATF", function(A){
var data = {
'colorImages': { 'initial': [{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SL1003_.jpg",
"thumb":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_US40_.jpg",
"large":"https://images-na.ssl-images-amazon.com/images/I/41lv4ReBL4L._AC_.jpg",
"main":{"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg":[355,355],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY450_.jpg":[450,450],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX425_.jpg":[425,425],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX466_.jpg":[466,466],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX522_.jpg":[522,522],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX569_.jpg":[569,569],
"https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SX679_.jpg":[679,679]},
"variant":"MAIN","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41shdN1aAoL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/61kOw5lC%2B%2BL._AC_SX679_.jpg":[679,679]},"variant":"PT01","lowRes":null},{"hiRes":"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SL1005_.jpg","thumb":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_US40_.jpg","large":"https://images-na.ssl-images-amazon.com/images/I/41pt8OOHsaL._AC_.jpg","main":{"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY355_.jpg":[355,355],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SY450_.jpg":[450,450],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX425_.jpg":[425,425],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX466_.jpg":[466,466],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX522_.jpg":[522,522],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX569_.jpg":[569,569],"https://images-na.ssl-images-amazon.com/images/I/511019WE7xL._AC_SX679_.jpg":[679,679]},"variant":"PT02","lowRes":null}]},
'colorToAsin': {'initial': {}},
'holderRatio': 1.0,
'holderMaxHeight': 700,
'heroImage': {'initial': []},
'heroVideo': {'initial': []},
'spin360ColorData': {'initial': {}},
'spin360ColorEnabled': {'initial': 0},
'spin360ConfigEnabled': false,
'spin360LazyLoadEnabled': false,
'showroomEnabled': false,
'showroomViewModel': {'initial': {}},
'playVideoInImmersiveView':true,
'useTabbedImmersiveView':true,
'totalVideoCount':'0',
'videoIngressATFSlateThumbURL':'',
'mediaTypeCount':'0',
'atfEnhancedHoverOverlay' : true,
'winningAsin': 'B08373YYCM',
'weblabs' : {},
'aibExp3Layout' : 1,
'aibRuleName' : 'frank-powered',
'acEnabled' : true,
'dp60VideoPosition': 0,
'dp60VariantList': '',
'dp60VideoThumb': '',
'dp60MainImage': 'https://images-na.ssl-images-amazon.com/images/I/61mw5BDEYoL._AC_SY355_.jpg',
'airyConfig' :A.$.parseJSON('{"jsUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/js/airy.skin._CB485981857_.js","cssUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/css/beacon._CB485971591_.css","swfUrl":"https://images-na.ssl-images-amazon.com/images/G/01/vap/video/airy2/prod/2.0.1460.0/flash/AiryBasicRenderer._CB485925577_.swf","foresterMetadataParams":{"marketplaceId":"A2VIGQ35RCS4UG","method":"Kitchen.ImageBlock","requestId":"4MGH16D6R7WCR018779W","session":"259-8488476-1037262","client":"Dpx"}}')
};
A.trigger('P.AboveTheFold'); // trigger ATF event.
return data;
});
</script>
How i can get this script tag which starts "P.when('A').register("ImageBlockATF", function(A){" from the webpage using reqular expression ?
you can get all script tags by
page = requests.get("url")
soup = BeautifulSoup(page.text, "html.parser")
results = soup.find_all("script")
and then you could have your filtering as
your_script_tag = [x for x in results if str(x).__contains__("P.when('A').register")]
print(your_script_tag)

Parse multiple similar field values from XML file with Python Regular Expression

I am trying to parse an xml file with regular expression.
Whichever script tag has "catch" alias, I need to collect "type" and "value".
<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>
I tried this regular expression with multiline and dotall:
>>> re.findall(r'script\s+type=\"(\w+)\".*alias=\"catch\"\s+value=\"(\d+)\"', a, re.MULTILINE|re.DOTALL)
Output which I am getting is:
[('abc', '8')]
Expected output is:
[('abc', '4'), ('xyz', '8')]
Can someone help me in figuring out what I am missing here?
I recommend using BeautifulSoup. You can parse through the tags and, with a little bit of data re-structuring, easily check for the right alias values and store the related attributes of interest. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
to_keep = []
for script in soup.find_all("script"):
t = script["type"]
attrs = {
k:v for k, v in [attr.split("=")
for attr in script.contents[0].split()
if "=" in attr]
}
if attrs["alias"] == '"catch"':
to_keep.append({"type": t, "value": attrs["value"]})
to_keep
# [{'type': 'abc', 'value': '"4"'}, {'type': 'xyz', 'value': '"8"'}]
Data:
data = """<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>"""
Found the answer. Thanks downvoter. I don't think there was any need to downvote this question.
>>> re.findall(r'script\s+type=\"(\w+)\".*?alias=\"catch\"\s+value=\"(\d+)\".*?\<\/script\>', a, re.MULTILINE|re.DOTALL)
[('abc', '4'), ('xyz', '8')]

How to extract text from file (with in script tag) using Python or beautifulsoup

Could you please help me with this little thing. I am looking to extract lat and lng value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python) or python. I am new to Python and blog are recommending to use Beautiful soup for extracting.
I want these two values lat: 21.25335 , lng: 81.649445
I am using regular expression for this . My regular expresion "^l([a-t])(:) ([0-9])([^,]+)"
Check this link for Regular expression and html file -
http://regexr.com/3glde
I get those two value with this regular expression but i want only those lat and lng value (numeric part ) to be stored in variable .
Here below is my python code which I am using
import re
pattern = re.compile("^[l]([a-t])([a-t])(\:) ([0-9])([^,]+)")
for i, line in enumerate(open('C:\hile_text.html')):
for match in re.finditer(pattern, line):
print 'Found on line %s: %s' % (i+1, match.groups())
Output:
Found on line 3218: ('a', 't', ':', '2', '1.244791')
Found on line 3219: ('n', 'g', ':', '8', '1.643486')
I want only those numeric value as output like 21.25335,81.649445 and want to store these values in variables or else you can provide alternate code to this.
plzz soon help me out .Thanks in anticipation.
This is the script tag in html file .
<script type="text/javascript">
window.mapDivId = 'map0Div';
window.map0Div = {
lat: 21.25335,
lng: 81.649445,
zoom: null,
locId: 5897747,
geoId: 297595,
isAttraction: false,
isEatery: true,
isLodging: false,
isNeighborhood: false,
title: "Aman Age Roll & Chicken ",
homeIcon: true,
url: "/Restaurant_Review-g297595-d5897747-Reviews-Aman_Age_Roll_Chicken-Raipur_Raipur_District_Chhattisgarh.html",
minPins: [
['hotel', 20],
['restaurant', 20],
['attraction', 20],
['vacation_rental', 0] ],
units: 'km',
geoMap: false,
tabletFullSite: false,
reuseHoverDivs: false,
noSponsors: true };
ta.store('infobox_js', 'https://static.tacdn.com/js3/infobox-c-v21051733989b.js');
ta.store("ta.maps.apiKey", "");
(function() {
var onload = function() {
if (window.location.hash == "#MAPVIEW") {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, true);
}
}
if (window.addEventListener) {
if (window.history && window.history.pushState) {
window.addEventListener("popstate", function(e) {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, false);
}, false);
}
window.addEventListener('load', onload, false);
}
else if (window.attachEvent) {
window.attachEvent('onload', onload);
}
})();
ta.store("mapsv2.show_sidebar", true);
ta.store('mapsv2_restaurant_reservation_js', ["https://static.tacdn.com/js3/ta-mapsv2-restaurant-reservation-c-v2430632369b.js"]);
ta.store('mapsv2.typeahead_css', "https://static.tacdn.com/css2/maps_typeahead-v21940478230b.css");
// Feature gate VR price pins on SRP map. VRC-14803
ta.store('mapsv2.vr_srp_map_price_enabled', true);
ta.store('mapsv2.geoName', 'Raipur');
ta.store('mapsv2.map_addressnotfound', "Address not found"); ta.store('mapsv2.map_addressnotfound3', "We couldn\'t find that location near {0}. Please try another search."); ta.store('mapsv2.directions', "Directions from {0} to {1}"); ta.store('mapsv2.enter_dates', "Enter dates for best prices"); ta.store('mapsv2.best_prices', "Best prices for your stay"); ta.store('mapsv2.list_accom', "List of accommodations"); ta.store('mapsv2.list_hotels', "List of hotels"); ta.store('mapsv2.list_vrs', "List of holiday rentals"); ta.store('mapsv2.more_accom', "More accommodations"); ta.store('mapsv2.more_hotels', "More hotels"); ta.store('mapsv2.more_vrs', "More Holiday Homes"); ta.store('mapsv2.sold_out_on_1', "SOLD OUT on 1 site"); ta.store('mapsv2.sold_out_on_y', "SOLD OUT on 2 sites"); </script>
Your regular expression is a bit messed up.
^l says you are trying to match an 'l' that is the very first character on a line.
^\s+(l[an][gt])(:\s+)(\d+\.\d+) would work better. Checkout an regerx analyzer tool such as http://www.myezapp.com/apps/dev/regexp/show.ws to get a breakdown of what is happening.
Here is a breakdown
Sequence: match all of the followings in order
BeginOfLine
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:1
Sequence: match all of the followings in order
l
AnyCharIn[ a n]
AnyCharIn[ g t]
CapturingGroup
GroupNumber:2
Sequence: match all of the followings in order
:
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:3
Sequence: match all of the followings in order
Repeat
Digit
one or more times
.
Repeat
Digit
one or more times

Categories

Resources