Python: How to get full match with RegEx - python

I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).
Here is the script part:
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
Here is what I do:
import re
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = re.search(search_for, text)
return debug
What I want is the href link and I kind of get it, but for some reason only like this
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>
and not like I want it to be
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">
So my question is how to get the full link and not only a part of it.
Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.

I've modified it slightly, but for me it returns the complete string you desire now.
import re
text = """
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
"""
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = search_for.findall(text)
print(debug)
get_link_from_text(text)
Output:
["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]

Related

python/ruby regex for domain in text

I have text as result i need to find all domains on this text, domains with subdomain and without with https and without;
opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
https%3a%2f%2fwww.anotherdomain.com%2fv2%2
i try this regex but its not capture all i need.
re.compile(
r'''((?<=x2[fF]|02[fF]|%2[fF])|(?<=//))(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}|(?<=["'])(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}(?=["'])''',
re.VERBOSE)
captured result by regex:
{'showbox-tr.dropbox.com', 'api.dropbox.com', 'api-v1.mydomain.co', 'www-oud.mydomain.com', 'cfl.dropboxstatic.com', 'hub-spot.mydomain.com', 'paper.dropbox.com', 'superdomain.com', 'api.superdomain.com', 'linux.dropbox.com', 'embedded.hellosign.com', 'api.location.com', 'api.dropboxapi.com', 'www.dropbox.com', 'location.com', 'api.domain2.com', 'dropbox.com', 'api.noddos.com', 'dropboxcaptcha.com', 'learn-stage.dropbox.com', 'test.composer.dropbox.com', 'help-stg.dropbox.com', 'replay.dropbox.com', 'domain2.com', 'hubspot.mydomain.com', 'learn.dropbox.com', 'help.dropbox.com', 'collections.dropbox.com', 'app.hellosign.com', 'api.mydomain.co', 'noddos.com', 'docsend.com', 'mydomain.co', 'notes.dropbox.com', 'photos.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'OfficeHome.All', 'showcase.dropbox.com', 'carousel.dropbox.com', 'experience.dropbox.com'}
expected results so in general its not capture "opt-i.mydomain.com":
{'opt-i.mydomain.com', 'hubspot.mydomain.com', 'embedded.hellosign.com', 'carousel.dropbox.com', 'api.dropboxapi.com', 'experience.dropbox.com', 'linux.dropbox.com', 'api.superdomain.com', 'noddos.com', 'showcase.dropbox.com', 'app.hellosign.com', 'www-oud.mydomain.com', 'showbox-tr.dropbox.com', 'help-stg.dropbox.com', 'api.domain2.com', 'notes.dropbox.com', 'paper.dropbox.com', 'services.pp.dropbox.com', 'collections.dropbox.com', 'learn.dropbox.com', 'location.com', 'api.location.com', 'docsend.com', 'api.dropbox.com', 'replay.dropbox.com', 'mydomain.co', 'hub-spot.mydomain.com', 'www.dropbox.com', 'learn-stage.dropbox.com', 'domain2.com', 'help.dropbox.com', 'api.mydomain.co', 'api-v1.mydomain.co', 'superdomain.com', 'dropboxcaptcha.com', 'api.noddos.com', 'dropbox.com', 'test.composer.dropbox.com', 'cfl.dropboxstatic.com', 'client-web.dropbox.com', 'opt-i.mydomain.com', 'photos.dropbox.com'}
i also test this regex it match all domains better than previus but problem is with doamin`s that has unicode example:
"https\\u003a\\u002f\\u002fapi.noddos.com" will capture "u002fapi.noddos.com" but we need "api.noddos.com"
re.compile(
r'''
([a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]
''', re.VERBOSE)
import re
regex = r"(?:[a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]"
data = """opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce"""
data = re.sub(r"u[0-9a-z]{4}", "", data)
matches = re.findall(regex, data)
#output matches:
['opt-i.mydomain.com', 'www-oud.mydomain.com', 'api.noddos.com', 'ht.mydomain.com', 'hub-spot.mydomain.com', 'showbox-tr.dropbox.com', 'cfl.dropboxstatic.com', 'paper.dropbox.com', 'www.dropbox.com', 'dropbox.com', 'api.dropboxapi.com', 'api.dropbox.com', 'linux.dropbox.com', 'photos.dropbox.com', 'carousel.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'www.dropbox.com', 'docsend.com', 'paper.dropbox.com', 'notes.dropbox.com', 'test.composer.dropbox.com', 'showcase.dropbox.com', 'collections.dropbox.com', 'embedded.hellosign.com', 'help.dropbox.com', 'help-stg.dropbox.com', 'experience.dropbox.com', 'learn.dropbox.com', 'learn-stage.dropbox.com', 'app.hellosign.com', 'replay.dropbox.com', 'showcase.dropbox.com', 'dropboxcaptcha.com', 'mydomain.co', 'api.mydomain.co', 'api-v1.mydomain.co', 'somain.com', 'api.somain.com', 'api.noddos.com', 'noddos.com', 'config.location', 'location.com', 'config.location', 'api.location.com', 'api.domain2.com', 'domain2.com']

How to extract link from html script in python?

How can I extract the URL from a script of HTML with Python?
The HTML provided:
function download() {
window.open('https:somelink.com');
}
const text = `<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'>
<iframe allowfullscreen=\'allowfullscreen\' src=\'URL\' style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe>
</div>`;
function embed() {
var element = document.getElementById('embed-text');
console.log(element);
element.innerHTML = text
}
Desired output will be:
https://somelink.com
Any help will do. Thanks!
You should use regex like this:
var urlRegex = /(https?:\/\/[^\s]+)/; // the regex
// your string
var input = "<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'><iframe allowfullscreen=\'allowfullscreen\' src=\" https://my-url.com/test \" style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe></div>";
console.log(input.match(urlRegex)[1]); // use regex and lot result

How to extract text from file (with in script tag) using Python or beautifulsoup

Could you please help me with this little thing. I am looking to extract lat and lng value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python) or python. I am new to Python and blog are recommending to use Beautiful soup for extracting.
I want these two values lat: 21.25335 , lng: 81.649445
I am using regular expression for this . My regular expresion "^l([a-t])(:) ([0-9])([^,]+)"
Check this link for Regular expression and html file -
http://regexr.com/3glde
I get those two value with this regular expression but i want only those lat and lng value (numeric part ) to be stored in variable .
Here below is my python code which I am using
import re
pattern = re.compile("^[l]([a-t])([a-t])(\:) ([0-9])([^,]+)")
for i, line in enumerate(open('C:\hile_text.html')):
for match in re.finditer(pattern, line):
print 'Found on line %s: %s' % (i+1, match.groups())
Output:
Found on line 3218: ('a', 't', ':', '2', '1.244791')
Found on line 3219: ('n', 'g', ':', '8', '1.643486')
I want only those numeric value as output like 21.25335,81.649445 and want to store these values in variables or else you can provide alternate code to this.
plzz soon help me out .Thanks in anticipation.
This is the script tag in html file .
<script type="text/javascript">
window.mapDivId = 'map0Div';
window.map0Div = {
lat: 21.25335,
lng: 81.649445,
zoom: null,
locId: 5897747,
geoId: 297595,
isAttraction: false,
isEatery: true,
isLodging: false,
isNeighborhood: false,
title: "Aman Age Roll & Chicken ",
homeIcon: true,
url: "/Restaurant_Review-g297595-d5897747-Reviews-Aman_Age_Roll_Chicken-Raipur_Raipur_District_Chhattisgarh.html",
minPins: [
['hotel', 20],
['restaurant', 20],
['attraction', 20],
['vacation_rental', 0] ],
units: 'km',
geoMap: false,
tabletFullSite: false,
reuseHoverDivs: false,
noSponsors: true };
ta.store('infobox_js', 'https://static.tacdn.com/js3/infobox-c-v21051733989b.js');
ta.store("ta.maps.apiKey", "");
(function() {
var onload = function() {
if (window.location.hash == "#MAPVIEW") {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, true);
}
}
if (window.addEventListener) {
if (window.history && window.history.pushState) {
window.addEventListener("popstate", function(e) {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, false);
}, false);
}
window.addEventListener('load', onload, false);
}
else if (window.attachEvent) {
window.attachEvent('onload', onload);
}
})();
ta.store("mapsv2.show_sidebar", true);
ta.store('mapsv2_restaurant_reservation_js', ["https://static.tacdn.com/js3/ta-mapsv2-restaurant-reservation-c-v2430632369b.js"]);
ta.store('mapsv2.typeahead_css', "https://static.tacdn.com/css2/maps_typeahead-v21940478230b.css");
// Feature gate VR price pins on SRP map. VRC-14803
ta.store('mapsv2.vr_srp_map_price_enabled', true);
ta.store('mapsv2.geoName', 'Raipur');
ta.store('mapsv2.map_addressnotfound', "Address not found"); ta.store('mapsv2.map_addressnotfound3', "We couldn\'t find that location near {0}. Please try another search."); ta.store('mapsv2.directions', "Directions from {0} to {1}"); ta.store('mapsv2.enter_dates', "Enter dates for best prices"); ta.store('mapsv2.best_prices', "Best prices for your stay"); ta.store('mapsv2.list_accom', "List of accommodations"); ta.store('mapsv2.list_hotels', "List of hotels"); ta.store('mapsv2.list_vrs', "List of holiday rentals"); ta.store('mapsv2.more_accom', "More accommodations"); ta.store('mapsv2.more_hotels', "More hotels"); ta.store('mapsv2.more_vrs', "More Holiday Homes"); ta.store('mapsv2.sold_out_on_1', "SOLD OUT on 1 site"); ta.store('mapsv2.sold_out_on_y', "SOLD OUT on 2 sites"); </script>
Your regular expression is a bit messed up.
^l says you are trying to match an 'l' that is the very first character on a line.
^\s+(l[an][gt])(:\s+)(\d+\.\d+) would work better. Checkout an regerx analyzer tool such as http://www.myezapp.com/apps/dev/regexp/show.ws to get a breakdown of what is happening.
Here is a breakdown
Sequence: match all of the followings in order
BeginOfLine
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:1
Sequence: match all of the followings in order
l
AnyCharIn[ a n]
AnyCharIn[ g t]
CapturingGroup
GroupNumber:2
Sequence: match all of the followings in order
:
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:3
Sequence: match all of the followings in order
Repeat
Digit
one or more times
.
Repeat
Digit
one or more times

Python regular expression matching a multiline javascript code

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. My example is:
function initialize()
{
var myLatlng = new google.maps.LatLng(23.800567,5.942068);
var myOptions =
{
panControl: true,
zoomControl: true,
scaleControl: false,
streetViewControl: true,
zoom: 11,
center: myLatlng,
mapTypeId: google.maps.MapTypeId.HYBRID
}
var map = new google.maps.Map(document.getElementById("map"), myOptions);
var bounds = new google.maps.LatLngBounds();
var locations = [
['<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>', 54.538665,24.885818, 1, 'text']
,
['<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text']
]
What I want to extract is context in locations. As result I want to look like:
- '<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>',
54.538665,24.885818, 1, 'text'
- '<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text'
I try regex like this:
regex = r"var locations =\[\[(.+?)\]\]"
But it doesnt work.
hello you can try this regex
re.findall("(<div.+)\]",toparse)

re.compile regex assistance (python, beautifulsoup)

Using this code from a different thread
import re
import requests
from bs4 import BeautifulSoup
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
soup = BeautifulSoup(data, "xml")
pattern = re.compile(r'\"street\":\s*\"(.*?)\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
This gets me the desired result:
21st Street
However, i was trying to get the whole thing by trying different variations of the regex and couldn't achieve the output to be:
{"street": "21st Street", "apartment": "2101", "available": false}
I have tried the following:
pattern = re.compile(r'\"property\":\s*\"(.*?)\{\"', re.MULTILINE | re.DOTALL)
Its not producing the desired result.
Your help is appreciated!
Thanks.
As per commented above , correct your typo and you use this
r"property\W+({.*?})"
RegexDemo
property : look for exact string
\W+ : matches any non-word character
({.*?}) : capture group one
.* matches any character inside braces {}
? matches as few times as possible
You can try this:
\"property\":\s*(\{.*?\})
capture group 1 contains yor desired data
Explanation
Sample Code:
import re
regex = r"\"property\":\s*(\{.*?\})"
test_str = ("window._propertyData = \n"
" { *** a lot of random code and other data ***\n"
" \"property\": {\"street\": \"21st Street\", \"apartment\": \"2101\", \"available\": false}\n"
" *** more data ***\n"
" }")
matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)
for matchNum, match in enumerate(matches):
print(match.group(1))
Run it here
Try this, It may be long but work's fine
\"property\"\:\s*(\{((?:\"\w+\"\:\s*\"?[\w\s]+\"?\,?\s?)+?)\})
https://regex101.com/r/7KzzRV/3
import re
import ast
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
property = re.search(r'"property": ({.+?})', data)
str_form = property.group(1)
print('str_form: ' + str_form)
dict_form = ast.literal_eval(str_form.replace('false', 'False'))
print('dict_form: ', dict_form)
out:
str_form: {"street": "21st Street", "apartment": "2101", "available": false}
dict_form: {'available': False, 'street': '21st Street', 'apartment': '2101'}

Categories

Resources