python/ruby regex for domain in text

python/ruby regex for domain in text - python

I have text as result i need to find all domains on this text, domains with subdomain and without with https and without;
opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
https%3a%2f%2fwww.anotherdomain.com%2fv2%2
i try this regex but its not capture all i need.
re.compile(
r'''((?<=x2[fF]|02[fF]|%2[fF])|(?<=//))(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}|(?<=["'])(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}(?=["'])''',
re.VERBOSE)
captured result by regex:
{'showbox-tr.dropbox.com', 'api.dropbox.com', 'api-v1.mydomain.co', 'www-oud.mydomain.com', 'cfl.dropboxstatic.com', 'hub-spot.mydomain.com', 'paper.dropbox.com', 'superdomain.com', 'api.superdomain.com', 'linux.dropbox.com', 'embedded.hellosign.com', 'api.location.com', 'api.dropboxapi.com', 'www.dropbox.com', 'location.com', 'api.domain2.com', 'dropbox.com', 'api.noddos.com', 'dropboxcaptcha.com', 'learn-stage.dropbox.com', 'test.composer.dropbox.com', 'help-stg.dropbox.com', 'replay.dropbox.com', 'domain2.com', 'hubspot.mydomain.com', 'learn.dropbox.com', 'help.dropbox.com', 'collections.dropbox.com', 'app.hellosign.com', 'api.mydomain.co', 'noddos.com', 'docsend.com', 'mydomain.co', 'notes.dropbox.com', 'photos.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'OfficeHome.All', 'showcase.dropbox.com', 'carousel.dropbox.com', 'experience.dropbox.com'}
expected results so in general its not capture "opt-i.mydomain.com":
{'opt-i.mydomain.com', 'hubspot.mydomain.com', 'embedded.hellosign.com', 'carousel.dropbox.com', 'api.dropboxapi.com', 'experience.dropbox.com', 'linux.dropbox.com', 'api.superdomain.com', 'noddos.com', 'showcase.dropbox.com', 'app.hellosign.com', 'www-oud.mydomain.com', 'showbox-tr.dropbox.com', 'help-stg.dropbox.com', 'api.domain2.com', 'notes.dropbox.com', 'paper.dropbox.com', 'services.pp.dropbox.com', 'collections.dropbox.com', 'learn.dropbox.com', 'location.com', 'api.location.com', 'docsend.com', 'api.dropbox.com', 'replay.dropbox.com', 'mydomain.co', 'hub-spot.mydomain.com', 'www.dropbox.com', 'learn-stage.dropbox.com', 'domain2.com', 'help.dropbox.com', 'api.mydomain.co', 'api-v1.mydomain.co', 'superdomain.com', 'dropboxcaptcha.com', 'api.noddos.com', 'dropbox.com', 'test.composer.dropbox.com', 'cfl.dropboxstatic.com', 'client-web.dropbox.com', 'opt-i.mydomain.com', 'photos.dropbox.com'}
i also test this regex it match all domains better than previus but problem is with doamin`s that has unicode example:
"https\\u003a\\u002f\\u002fapi.noddos.com" will capture "u002fapi.noddos.com" but we need "api.noddos.com"
re.compile(
r'''
([a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]
''', re.VERBOSE)

import re
regex = r"(?:[a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]"
data = """opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce"""
data = re.sub(r"u[0-9a-z]{4}", "", data)
matches = re.findall(regex, data)
#output matches:
['opt-i.mydomain.com', 'www-oud.mydomain.com', 'api.noddos.com', 'ht.mydomain.com', 'hub-spot.mydomain.com', 'showbox-tr.dropbox.com', 'cfl.dropboxstatic.com', 'paper.dropbox.com', 'www.dropbox.com', 'dropbox.com', 'api.dropboxapi.com', 'api.dropbox.com', 'linux.dropbox.com', 'photos.dropbox.com', 'carousel.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'www.dropbox.com', 'docsend.com', 'paper.dropbox.com', 'notes.dropbox.com', 'test.composer.dropbox.com', 'showcase.dropbox.com', 'collections.dropbox.com', 'embedded.hellosign.com', 'help.dropbox.com', 'help-stg.dropbox.com', 'experience.dropbox.com', 'learn.dropbox.com', 'learn-stage.dropbox.com', 'app.hellosign.com', 'replay.dropbox.com', 'showcase.dropbox.com', 'dropboxcaptcha.com', 'mydomain.co', 'api.mydomain.co', 'api-v1.mydomain.co', 'somain.com', 'api.somain.com', 'api.noddos.com', 'noddos.com', 'config.location', 'location.com', 'config.location', 'api.location.com', 'api.domain2.com', 'domain2.com']

Related

How to extract link from html script in python?

How can I extract the URL from a script of HTML with Python?
The HTML provided:
function download() {
window.open('https:somelink.com');
}
const text = `<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'>
<iframe allowfullscreen=\'allowfullscreen\' src=\'URL\' style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe>
</div>`;
function embed() {
var element = document.getElementById('embed-text');
console.log(element);
element.innerHTML = text
}
Desired output will be:
https://somelink.com
Any help will do. Thanks!

You should use regex like this:
var urlRegex = /(https?:\/\/[^\s]+)/; // the regex
// your string
var input = "<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'><iframe allowfullscreen=\'allowfullscreen\' src=\" https://my-url.com/test \" style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe></div>";
console.log(input.match(urlRegex)[1]); // use regex and lot result

Scrapy crawling through pages with PostBack data javascript url doesn't change

I'm crawling through some directories with ASP.NET programming via Scrapy.
The pages to crawl through are encoded as such:
javascript:__doPostBack('MoreInfoListZbgs1$Pager','X')
where X is an int between 1 and 180. The problem is that the url remains the same when I clicked next page or any page.
I've written down some codes below which can only extract each link within the first page.
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import re
from scrapy.http import FormRequest
import js2xml
import requests
from datetime import datetime
class nnggzySpider(scrapy.Spider):
name = 'nnggzygov'
start_urls = [
'https://www.nnggzy.org.cn/gxnnzbw/showinfo/zbxxmore.aspx?categorynum=001004001'
]
base_url = 'https://www.nnggzy.org.cn'
custom_settings = {
'LOG_LEVEL': 'ERROR'
}
def parse(self, response):
_response = response.text
self.data = {}
soup = BeautifulSoup(response.body, 'html.parser')
tags = soup.find_all('a', href=re.compile(r"InfoDetail"))
# 获取翻页参数
__VIEWSTATE = re.findall(r'id="__VIEWSTATE" value="(.*?)" />', _response)
A = __VIEWSTATE[0]
# print(A)
__EVENTTARGET = 'MoreInfoListZbgs1$Pager'
B = __EVENTTARGET
__CSRFTOKEN = re.findall(r'id="__CSRFTOKEN" value="(.*?)" />', _response)
C = __CSRFTOKEN
page_num = re.findall(r'title="转到第(.*?)页"', _response)
max_page = page_num[-1]
content = {
'__VIEWSTATE': A,
'__EVENTTARGET': B,
'__CSRFTOKEN': C,
'page_num': max_page
}
infoid = re.findall(r'InfoID=(.*?)&CategoryNum', _response)
print(infoid)
yield scrapy.Request(url=response.url, callback=self.parse_detail, meta={"data": content})
def parse_detail(self, response):
max_page = response.meta['data']['page_num']
for i in range(2, int(max_page)):
data = {
'__CSRFTOKEN': '{}'.format(response.meta['data']['__CSRFTOKEN']),
'__VIEWSTATE': '{}'.format(response.meta['data']['__VIEWSTATE']),
'__EVENTTARGET': 'MoreInfoListZbgs1$Pager',
'__EVENTARGUMENT': '{}'.format(i),
# '__VIEWSTATEENCRYPTED': '',
# 'txtKey': ''
}
yield scrapy.FormRequest(url=response.url, callback=self.parse, formdata=data, method="POST", dont_filter=True)
Can anyone help me with this?

Looks like the pagination over mentioned website is made by sending POST requests with formdata like:
{
"__CSRFTOKEN": ...,
"__VIEWSTATE": ...,
"__EVENTTARGET": "MoreInfoListZbgs1$Pager",
"__EVENTARGUMENT": page_number,
"__VIEWSTATEENCRYPTED": "",
"txtKey": ""
}

I know this is a year old thread but I am posting the answer for future visitors from Google search.
Your form submission didn't work because there must be some more hidden fields at the bottom of the web page but inside the form. In my case, it is and here's the working submission
# This is the next page link
# <a id="nextId" href="javascript:__doPostBack('MoreInfoListZbgs1$Pager','')"> Next </a>
# This is how the website evaluate the next link
# <script type="text/javascript">
# //<![CDATA[
# var theForm = document.forms['Form1'];
# if (!theForm) {
# theForm = document.Form1;
# }
# function __doPostBack(eventTarget, eventArgument) {
# if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
# theForm.__EVENTTARGET.value = eventTarget;
# theForm.__EVENTARGUMENT.value = eventArgument;
# theForm.submit();
# }
# }
# //]]>
# </script>
# According to above js code, we need to pass in the following arguments:
data = {
'__EVENTTARGET': 'MoreInfoListZbgs1$Pager', # first argument from javascript:__doPostBack('MoreInfoListZbgs1$Pager','') next link
'__EVENTARGUMENT': '', # second argument from javascript:__doPostBack('MoreInfoListZbgs1$Pager','') next link, in my case it is empty
'__VIEWSTATE': response.css('input[name=__VIEWSTATE]::attr("value")').get(),
# These are the more hidden input fields you need to pass in
'__VIEWSTATEGENERATOR': response.css('input[name=__VIEWSTATEGENERATOR]::attr("value")').get(),
'__EVENTVALIDATION': response.css('input[name=__EVENTVALIDATION]::attr("value")').get(),
}
yield scrapy.FormRequest(url=form_action_url_here, formdata=data, callback=self.parse)

I can not see the response of the web service through the swagger

I have a problem with swagger. I made a web service that is invoked by web page. There, the information returned is visible, but when I want to see it through swagger (because I need to check this in both places), it shows this error
It's my second web service that I do in Python, I've been doing it at the tip of my heart and reading Google, so I know it's not the best or the most beautiful.
Here I leave the code of my page and the code of the python to read their opinions:
from app import APP
import cx_Oracle
import json
from flask import render_template, request, jsonify
import log
import database
from flask_restplus import Api, Resource, fields
with open('./config/config_countries.json', 'r') as config_file:
config = json.load(config_file)
with open('./config/config_general.json', 'r') as config_file:
config_general = json.load(config_file)
global max_return
global log_tag
log_tag = config["CONFIG"]["LOG_TAG"]
srv_name = config["CONFIG"]["LOG_TAG"]
desc = config["CONFIG"]["DESCRIPTION"]
profile_name = config["CONFIG"]["PROFILE_NAME"]
conn_str = config_general["CONFIG"]["DB"]
max_return = config_general["CONFIG"]["MAX_RETURN"]
limite = config_general["CONFIG"]["LIMIT_PER_SECOND"]
log.init(profile_name)
log.dbg('Start')
database.init()
api = Api(APP, version='1.0', title=srv_name,
description=desc + '\n'
'Database:' + conn_str + '\n'
'Max request by second:' + limite)
ns = api.namespace('getCountries', description='getCountries Predictions')
#APP.route('/getCountries', methods=['GET', 'POST'])
#ns.response(200, 'Success')
#ns.response(404, 'Not found')
#ns.response(429, 'Too many request')
#ns.param('country', 'Substring Country (ej:ARGE)')
def getCountries():
try:
list_parametros = (request.get_json())
country = list_parametros["country"]
cur = database.db.cursor()
list = cur.var(cx_Oracle.STRING)
cur.callproc('PREDICTIVO.get_country', (country, max_return, list))
except Exception as e:
database.init()
if database.db is not None:
log.inf('Reconexion OK')
cur = database.db.cursor()
list = cur.var(cx_Oracle.STRING)
cur.callproc('PREDICTIVO.get_country', (country, max_return, list))
else:
log.err('Connection Fail')
list = None
response = list.getvalue()
return json.dumps(response), 200
And html:
<html>
<head>
<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.1.min.js"></script>
</head>
<body>
<input list="lista_paises" id="pais" type="text" placeholder="Pais"/>
<datalist id="lista_paises"></datalist><br />
</body>
</html>
<script type="text/javascript">
// <![CDATA[
$("#pais").keyup(function (e) {
if ($("#pais").val().length > 0) {
$.ajax({
type: 'POST',
url: 'http://127.0.0.1:5100/getCountries',
data: JSON.stringify({ "country": $("#pais").val() }),
contentType: 'application/json; charset=utf-8',
dataType: "json",
success: function (response) {
console.log('RESPUESTA:' + response);
var parsedJSON = JSON.parse(response);
var options = '';
for (var i = 0; i < parsedJSON.list_pais.length; i++) {
options += '<option value="' + parsedJSON.list_pais[i].pais + '" />';
}
document.getElementById('lista_paises').innerHTML = options;
},
error: function (error) {
console.log(error);
}
});
} else {
document.getElementById('lista_paises').innerHTML = '';
}
});
// ]]>
</script>

Python: How to get full match with RegEx

I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).
Here is the script part:
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
Here is what I do:
import re
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = re.search(search_for, text)
return debug
What I want is the href link and I kind of get it, but for some reason only like this
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>
and not like I want it to be
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">
So my question is how to get the full link and not only a part of it.
Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.

I've modified it slightly, but for me it returns the complete string you desire now.
import re
text = """
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
"""
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = search_for.findall(text)
print(debug)
get_link_from_text(text)
Output:
["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]

re.compile regex assistance (python, beautifulsoup)

Using this code from a different thread
import re
import requests
from bs4 import BeautifulSoup
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
soup = BeautifulSoup(data, "xml")
pattern = re.compile(r'\"street\":\s*\"(.*?)\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)
This gets me the desired result:
21st Street
However, i was trying to get the whole thing by trying different variations of the regex and couldn't achieve the output to be:
{"street": "21st Street", "apartment": "2101", "available": false}
I have tried the following:
pattern = re.compile(r'\"property\":\s*\"(.*?)\{\"', re.MULTILINE | re.DOTALL)
Its not producing the desired result.
Your help is appreciated!
Thanks.

As per commented above , correct your typo and you use this
r"property\W+({.*?})"
RegexDemo
property : look for exact string
\W+ : matches any non-word character
({.*?}) : capture group one
.* matches any character inside braces {}
? matches as few times as possible

You can try this:
\"property\":\s*(\{.*?\})
capture group 1 contains yor desired data
Explanation
Sample Code:
import re
regex = r"\"property\":\s*(\{.*?\})"
test_str = ("window._propertyData = \n"
" { *** a lot of random code and other data ***\n"
" \"property\": {\"street\": \"21st Street\", \"apartment\": \"2101\", \"available\": false}\n"
" *** more data ***\n"
" }")
matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)
for matchNum, match in enumerate(matches):
print(match.group(1))
Run it here

Try this, It may be long but work's fine
\"property\"\:\s*(\{((?:\"\w+\"\:\s*\"?[\w\s]+\"?\,?\s?)+?)\})
https://regex101.com/r/7KzzRV/3

import re
import ast
data = """
<script type="text/javascript">
window._propertyData =
{ *** a lot of random code and other data ***
"property": {"street": "21st Street", "apartment": "2101", "available": false}
*** more data ***
}
</script>
"""
property = re.search(r'"property": ({.+?})', data)
str_form = property.group(1)
print('str_form: ' + str_form)
dict_form = ast.literal_eval(str_form.replace('false', 'False'))
print('dict_form: ', dict_form)
out:
str_form: {"street": "21st Street", "apartment": "2101", "available": false}
dict_form: {'available': False, 'street': '21st Street', 'apartment': '2101'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python/ruby regex for domain in text - python

Related

How to extract link from html script in python?

Scrapy crawling through pages with PostBack data javascript url doesn't change

I can not see the response of the web service through the swagger

Python: How to get full match with RegEx

re.compile regex assistance (python, beautifulsoup)

Categories

Resources