I have text as result i need to find all domains on this text, domains with subdomain and without with https and without;
opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
https%3a%2f%2fwww.anotherdomain.com%2fv2%2
i try this regex but its not capture all i need.
re.compile(
r'''((?<=x2[fF]|02[fF]|%2[fF])|(?<=//))(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}|(?<=["'])(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}(?=["'])''',
re.VERBOSE)
captured result by regex:
{'showbox-tr.dropbox.com', 'api.dropbox.com', 'api-v1.mydomain.co', 'www-oud.mydomain.com', 'cfl.dropboxstatic.com', 'hub-spot.mydomain.com', 'paper.dropbox.com', 'superdomain.com', 'api.superdomain.com', 'linux.dropbox.com', 'embedded.hellosign.com', 'api.location.com', 'api.dropboxapi.com', 'www.dropbox.com', 'location.com', 'api.domain2.com', 'dropbox.com', 'api.noddos.com', 'dropboxcaptcha.com', 'learn-stage.dropbox.com', 'test.composer.dropbox.com', 'help-stg.dropbox.com', 'replay.dropbox.com', 'domain2.com', 'hubspot.mydomain.com', 'learn.dropbox.com', 'help.dropbox.com', 'collections.dropbox.com', 'app.hellosign.com', 'api.mydomain.co', 'noddos.com', 'docsend.com', 'mydomain.co', 'notes.dropbox.com', 'photos.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'OfficeHome.All', 'showcase.dropbox.com', 'carousel.dropbox.com', 'experience.dropbox.com'}
expected results so in general its not capture "opt-i.mydomain.com":
{'opt-i.mydomain.com', 'hubspot.mydomain.com', 'embedded.hellosign.com', 'carousel.dropbox.com', 'api.dropboxapi.com', 'experience.dropbox.com', 'linux.dropbox.com', 'api.superdomain.com', 'noddos.com', 'showcase.dropbox.com', 'app.hellosign.com', 'www-oud.mydomain.com', 'showbox-tr.dropbox.com', 'help-stg.dropbox.com', 'api.domain2.com', 'notes.dropbox.com', 'paper.dropbox.com', 'services.pp.dropbox.com', 'collections.dropbox.com', 'learn.dropbox.com', 'location.com', 'api.location.com', 'docsend.com', 'api.dropbox.com', 'replay.dropbox.com', 'mydomain.co', 'hub-spot.mydomain.com', 'www.dropbox.com', 'learn-stage.dropbox.com', 'domain2.com', 'help.dropbox.com', 'api.mydomain.co', 'api-v1.mydomain.co', 'superdomain.com', 'dropboxcaptcha.com', 'api.noddos.com', 'dropbox.com', 'test.composer.dropbox.com', 'cfl.dropboxstatic.com', 'client-web.dropbox.com', 'opt-i.mydomain.com', 'photos.dropbox.com'}
i also test this regex it match all domains better than previus but problem is with doamin`s that has unicode example:
"https\\u003a\\u002f\\u002fapi.noddos.com" will capture "u002fapi.noddos.com" but we need "api.noddos.com"
re.compile(
r'''
([a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]
''', re.VERBOSE)
import re
regex = r"(?:[a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]"
data = """opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce"""
data = re.sub(r"u[0-9a-z]{4}", "", data)
matches = re.findall(regex, data)
#output matches:
['opt-i.mydomain.com', 'www-oud.mydomain.com', 'api.noddos.com', 'ht.mydomain.com', 'hub-spot.mydomain.com', 'showbox-tr.dropbox.com', 'cfl.dropboxstatic.com', 'paper.dropbox.com', 'www.dropbox.com', 'dropbox.com', 'api.dropboxapi.com', 'api.dropbox.com', 'linux.dropbox.com', 'photos.dropbox.com', 'carousel.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'www.dropbox.com', 'docsend.com', 'paper.dropbox.com', 'notes.dropbox.com', 'test.composer.dropbox.com', 'showcase.dropbox.com', 'collections.dropbox.com', 'embedded.hellosign.com', 'help.dropbox.com', 'help-stg.dropbox.com', 'experience.dropbox.com', 'learn.dropbox.com', 'learn-stage.dropbox.com', 'app.hellosign.com', 'replay.dropbox.com', 'showcase.dropbox.com', 'dropboxcaptcha.com', 'mydomain.co', 'api.mydomain.co', 'api-v1.mydomain.co', 'somain.com', 'api.somain.com', 'api.noddos.com', 'noddos.com', 'config.location', 'location.com', 'config.location', 'api.location.com', 'api.domain2.com', 'domain2.com']
Could you please help me with this little thing. I am looking to extract lat and lng value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python) or python. I am new to Python and blog are recommending to use Beautiful soup for extracting.
I want these two values lat: 21.25335 , lng: 81.649445
I am using regular expression for this . My regular expresion "^l([a-t])(:) ([0-9])([^,]+)"
Check this link for Regular expression and html file -
http://regexr.com/3glde
I get those two value with this regular expression but i want only those lat and lng value (numeric part ) to be stored in variable .
Here below is my python code which I am using
import re
pattern = re.compile("^[l]([a-t])([a-t])(\:) ([0-9])([^,]+)")
for i, line in enumerate(open('C:\hile_text.html')):
for match in re.finditer(pattern, line):
print 'Found on line %s: %s' % (i+1, match.groups())
Output:
Found on line 3218: ('a', 't', ':', '2', '1.244791')
Found on line 3219: ('n', 'g', ':', '8', '1.643486')
I want only those numeric value as output like 21.25335,81.649445 and want to store these values in variables or else you can provide alternate code to this.
plzz soon help me out .Thanks in anticipation.
This is the script tag in html file .
<script type="text/javascript">
window.mapDivId = 'map0Div';
window.map0Div = {
lat: 21.25335,
lng: 81.649445,
zoom: null,
locId: 5897747,
geoId: 297595,
isAttraction: false,
isEatery: true,
isLodging: false,
isNeighborhood: false,
title: "Aman Age Roll & Chicken ",
homeIcon: true,
url: "/Restaurant_Review-g297595-d5897747-Reviews-Aman_Age_Roll_Chicken-Raipur_Raipur_District_Chhattisgarh.html",
minPins: [
['hotel', 20],
['restaurant', 20],
['attraction', 20],
['vacation_rental', 0] ],
units: 'km',
geoMap: false,
tabletFullSite: false,
reuseHoverDivs: false,
noSponsors: true };
ta.store('infobox_js', 'https://static.tacdn.com/js3/infobox-c-v21051733989b.js');
ta.store("ta.maps.apiKey", "");
(function() {
var onload = function() {
if (window.location.hash == "#MAPVIEW") {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, true);
}
}
if (window.addEventListener) {
if (window.history && window.history.pushState) {
window.addEventListener("popstate", function(e) {
ta.run("ta.mapsv2.Factory.handleHashLocation", {}, false);
}, false);
}
window.addEventListener('load', onload, false);
}
else if (window.attachEvent) {
window.attachEvent('onload', onload);
}
})();
ta.store("mapsv2.show_sidebar", true);
ta.store('mapsv2_restaurant_reservation_js', ["https://static.tacdn.com/js3/ta-mapsv2-restaurant-reservation-c-v2430632369b.js"]);
ta.store('mapsv2.typeahead_css', "https://static.tacdn.com/css2/maps_typeahead-v21940478230b.css");
// Feature gate VR price pins on SRP map. VRC-14803
ta.store('mapsv2.vr_srp_map_price_enabled', true);
ta.store('mapsv2.geoName', 'Raipur');
ta.store('mapsv2.map_addressnotfound', "Address not found"); ta.store('mapsv2.map_addressnotfound3', "We couldn\'t find that location near {0}. Please try another search."); ta.store('mapsv2.directions', "Directions from {0} to {1}"); ta.store('mapsv2.enter_dates', "Enter dates for best prices"); ta.store('mapsv2.best_prices', "Best prices for your stay"); ta.store('mapsv2.list_accom', "List of accommodations"); ta.store('mapsv2.list_hotels', "List of hotels"); ta.store('mapsv2.list_vrs', "List of holiday rentals"); ta.store('mapsv2.more_accom', "More accommodations"); ta.store('mapsv2.more_hotels', "More hotels"); ta.store('mapsv2.more_vrs', "More Holiday Homes"); ta.store('mapsv2.sold_out_on_1', "SOLD OUT on 1 site"); ta.store('mapsv2.sold_out_on_y', "SOLD OUT on 2 sites"); </script>
Your regular expression is a bit messed up.
^l says you are trying to match an 'l' that is the very first character on a line.
^\s+(l[an][gt])(:\s+)(\d+\.\d+) would work better. Checkout an regerx analyzer tool such as http://www.myezapp.com/apps/dev/regexp/show.ws to get a breakdown of what is happening.
Here is a breakdown
Sequence: match all of the followings in order
BeginOfLine
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:1
Sequence: match all of the followings in order
l
AnyCharIn[ a n]
AnyCharIn[ g t]
CapturingGroup
GroupNumber:2
Sequence: match all of the followings in order
:
Repeat
WhiteSpaceCharacter
one or more times
CapturingGroup
GroupNumber:3
Sequence: match all of the followings in order
Repeat
Digit
one or more times
.
Repeat
Digit
one or more times
I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. My example is:
function initialize()
{
var myLatlng = new google.maps.LatLng(23.800567,5.942068);
var myOptions =
{
panControl: true,
zoomControl: true,
scaleControl: false,
streetViewControl: true,
zoom: 11,
center: myLatlng,
mapTypeId: google.maps.MapTypeId.HYBRID
}
var map = new google.maps.Map(document.getElementById("map"), myOptions);
var bounds = new google.maps.LatLngBounds();
var locations = [
['<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>', 54.538665,24.885818, 1, 'text']
,
['<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text']
]
What I want to extract is context in locations. As result I want to look like:
- '<div CLASS="Tekst"><B>tss fsdf<\/B><BR>hopp <BR><\/div>',
54.538665,24.885818, 1, 'text'
- '<div CLASS="Tekst"><\/div>', 24.465462,24.966919, 1, 'text'
I try regex like this:
regex = r"var locations =\[\[(.+?)\]\]"
But it doesnt work.
hello you can try this regex
re.findall("(<div.+)\]",toparse)
Using this code from a different thread
import re
import requests
from bs4 import BeautifulSoup
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
soup = BeautifulSoup(data, "xml")
pattern = re.compile(r'\"street\":\s*\"(.*?)\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
This gets me the desired result:
21st Street
However, i was trying to get the whole thing by trying different variations of the regex and couldn't achieve the output to be:
{"street": "21st Street", "apartment": "2101", "available": false}
I have tried the following:
pattern = re.compile(r'\"property\":\s*\"(.*?)\{\"', re.MULTILINE | re.DOTALL)
Its not producing the desired result.
Your help is appreciated!
Thanks.
As per commented above , correct your typo and you use this
r"property\W+({.*?})"
RegexDemo
property : look for exact string
\W+ : matches any non-word character
({.*?}) : capture group one
.* matches any character inside braces {}
? matches as few times as possible
You can try this:
\"property\":\s*(\{.*?\})
capture group 1 contains yor desired data
Explanation
Sample Code:
import re
regex = r"\"property\":\s*(\{.*?\})"
test_str = ("window._propertyData = \n"
" { *** a lot of random code and other data ***\n"
" \"property\": {\"street\": \"21st Street\", \"apartment\": \"2101\", \"available\": false}\n"
" *** more data ***\n"
" }")
matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)
for matchNum, match in enumerate(matches):
print(match.group(1))
Run it here
Try this, It may be long but work's fine
\"property\"\:\s*(\{((?:\"\w+\"\:\s*\"?[\w\s]+\"?\,?\s?)+?)\})
https://regex101.com/r/7KzzRV/3
import re
import ast
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
property = re.search(r'"property": ({.+?})', data)
str_form = property.group(1)
print('str_form: ' + str_form)
dict_form = ast.literal_eval(str_form.replace('false', 'False'))
print('dict_form: ', dict_form)
out:
str_form: {"street": "21st Street", "apartment": "2101", "available": false}
dict_form: {'available': False, 'street': '21st Street', 'apartment': '2101'}