How to extract link from html script in python?

How to extract link from html script in python? - python

How can I extract the URL from a script of HTML with Python?
The HTML provided:
function download() {
window.open('https:somelink.com');
}
const text = `<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'>
<iframe allowfullscreen=\'allowfullscreen\' src=\'URL\' style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe>
</div>`;
function embed() {
var element = document.getElementById('embed-text');
console.log(element);
element.innerHTML = text
}
Desired output will be:
https://somelink.com
Any help will do. Thanks!

You should use regex like this:
var urlRegex = /(https?:\/\/[^\s]+)/; // the regex
// your string
var input = "<div style=\'position: relative;padding-bottom: 56.25%;height: 0;overflow: hidden;\'><iframe allowfullscreen=\'allowfullscreen\' src=\" https://my-url.com/test \" style=\'border: 0;height: 100%;left: 0;position: absolute;top: 0;width: 100%;\' ></iframe></div>";
console.log(input.match(urlRegex)[1]); // use regex and lot result

Related

Python save embed base64 pdf string into file

I have html document embeded with pdf document in base64 encoded format. I like to extract the string and save it as pdf file. using below code to save it as pdf file.
but its on opening in adobe reader, saying invalid format. looking to fix this issue.
I think pdf file encoded using Javascript encodeURIComponent function. need to convert using Python.
sample embed tag
<embed type="application/pdf" src="data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D">
Code
import base64
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D"
buffer = BytesIO.BytesIO()
content = base64.b64decode(b64)
buffer.write(content)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(buffer.getvalue())
if __name__ == "__main__":
decode_b64()
=== Update 1:
found the way to convert using JavaScript: It will be nice if we can port this code to Python.
const {readFileSync, writeFile, promises: fsPromises} = require('fs');
var data=readFileSync("pdf-file.html", 'utf-8')
var DOMParser = require('xmldom').DOMParser;
var parser = new DOMParser();
const virtualDoc = parser.parseFromString(data, 'text/html');
var elem = virtualDoc.getElementsByTagName('embed')[0];
for (var i = 0; i < elem.attributes.length; i++) {
var attrib = elem.attributes[i];
if (attrib.specified) {
if( attrib.name == "src") {
var result =attrib.value
result=result.replace('data:application/pdf;base64,','');
let buff = Buffer.from(decodeURIComponent(result), 'base64');
writeFile('pdf-file.pdf', buff, err => {
if (err) {
console.error(err);
}
});
}
}
}

This is a situation that you should have been able to chase down yourself. I wasn't 100% sure how Javascript encoded those two characters, so I wrote up a simple HTML page:
<script>
var s = "abcde++defgh//";
alert(encodeURIComponent(s));
</script>
When I ran that page, the result was "abcde%2B%2Bdefgh%2F%2F", and that is all the information you need to fix up those strings.
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D......"
b64 = b64.replace('%2B','+').replace('%2F','/')
content = base64.b64decode(b64)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(content)

python/ruby regex for domain in text

I have text as result i need to find all domains on this text, domains with subdomain and without with https and without;
opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
https%3a%2f%2fwww.anotherdomain.com%2fv2%2
i try this regex but its not capture all i need.
re.compile(
r'''((?<=x2[fF]|02[fF]|%2[fF])|(?<=//))(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}|(?<=["'])(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}(?=["'])''',
re.VERBOSE)
captured result by regex:
{'showbox-tr.dropbox.com', 'api.dropbox.com', 'api-v1.mydomain.co', 'www-oud.mydomain.com', 'cfl.dropboxstatic.com', 'hub-spot.mydomain.com', 'paper.dropbox.com', 'superdomain.com', 'api.superdomain.com', 'linux.dropbox.com', 'embedded.hellosign.com', 'api.location.com', 'api.dropboxapi.com', 'www.dropbox.com', 'location.com', 'api.domain2.com', 'dropbox.com', 'api.noddos.com', 'dropboxcaptcha.com', 'learn-stage.dropbox.com', 'test.composer.dropbox.com', 'help-stg.dropbox.com', 'replay.dropbox.com', 'domain2.com', 'hubspot.mydomain.com', 'learn.dropbox.com', 'help.dropbox.com', 'collections.dropbox.com', 'app.hellosign.com', 'api.mydomain.co', 'noddos.com', 'docsend.com', 'mydomain.co', 'notes.dropbox.com', 'photos.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'OfficeHome.All', 'showcase.dropbox.com', 'carousel.dropbox.com', 'experience.dropbox.com'}
expected results so in general its not capture "opt-i.mydomain.com":
{'opt-i.mydomain.com', 'hubspot.mydomain.com', 'embedded.hellosign.com', 'carousel.dropbox.com', 'api.dropboxapi.com', 'experience.dropbox.com', 'linux.dropbox.com', 'api.superdomain.com', 'noddos.com', 'showcase.dropbox.com', 'app.hellosign.com', 'www-oud.mydomain.com', 'showbox-tr.dropbox.com', 'help-stg.dropbox.com', 'api.domain2.com', 'notes.dropbox.com', 'paper.dropbox.com', 'services.pp.dropbox.com', 'collections.dropbox.com', 'learn.dropbox.com', 'location.com', 'api.location.com', 'docsend.com', 'api.dropbox.com', 'replay.dropbox.com', 'mydomain.co', 'hub-spot.mydomain.com', 'www.dropbox.com', 'learn-stage.dropbox.com', 'domain2.com', 'help.dropbox.com', 'api.mydomain.co', 'api-v1.mydomain.co', 'superdomain.com', 'dropboxcaptcha.com', 'api.noddos.com', 'dropbox.com', 'test.composer.dropbox.com', 'cfl.dropboxstatic.com', 'client-web.dropbox.com', 'opt-i.mydomain.com', 'photos.dropbox.com'}
i also test this regex it match all domains better than previus but problem is with doamin`s that has unicode example:
"https\\u003a\\u002f\\u002fapi.noddos.com" will capture "u002fapi.noddos.com" but we need "api.noddos.com"
re.compile(
r'''
([a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]
''', re.VERBOSE)

import re
regex = r"(?:[a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]"
data = """opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce"""
data = re.sub(r"u[0-9a-z]{4}", "", data)
matches = re.findall(regex, data)
#output matches:
['opt-i.mydomain.com', 'www-oud.mydomain.com', 'api.noddos.com', 'ht.mydomain.com', 'hub-spot.mydomain.com', 'showbox-tr.dropbox.com', 'cfl.dropboxstatic.com', 'paper.dropbox.com', 'www.dropbox.com', 'dropbox.com', 'api.dropboxapi.com', 'api.dropbox.com', 'linux.dropbox.com', 'photos.dropbox.com', 'carousel.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'www.dropbox.com', 'docsend.com', 'paper.dropbox.com', 'notes.dropbox.com', 'test.composer.dropbox.com', 'showcase.dropbox.com', 'collections.dropbox.com', 'embedded.hellosign.com', 'help.dropbox.com', 'help-stg.dropbox.com', 'experience.dropbox.com', 'learn.dropbox.com', 'learn-stage.dropbox.com', 'app.hellosign.com', 'replay.dropbox.com', 'showcase.dropbox.com', 'dropboxcaptcha.com', 'mydomain.co', 'api.mydomain.co', 'api-v1.mydomain.co', 'somain.com', 'api.somain.com', 'api.noddos.com', 'noddos.com', 'config.location', 'location.com', 'config.location', 'api.location.com', 'api.domain2.com', 'domain2.com']

How to store python output recieved in node js?

I'm invoking a python script from node js. The python script retrieves data from a REST API and stores it in a dataframe and then there's a search function based on user input. I'm confused as to what variable type does python send the data to node js in? I've tried to convert into a string but in node js it says it is an unresolved variable type. Here's the code:
r = requests.get(url)
data = r.json()
nested = json.loads(r.text)
nested_full = json_normalize(nested)
req_data= json_normalize(nested,record_path ='items')
search = req_data.get(["name", "id"," ])
#search.head(10)
filter = sys.argv[1:]
print(filter)
input = filter[0]
print(input)
result = search[search["requestor_name"].str.contains(input)]
result = result.to_String(index=false)
response = '```' + str(result) + '```'
print(response)
sys.stdout.flush()
Here's the node js program that invokes the above python script. How do i store the output in a format which i can pass to another function in node?
var input = 'robert';
var childProcess = require("child_process").spawn('python', ['./search.py', input], {stdio: 'inherit'})
const stream = require('stream');
const format = require('string-format')
childProcess.on('data', function(data){
process.stdout.write("python script output",data)
result += String(data);
console.log("Here it is", data);
});
childProcess.on('close', function(code) {
if ( code === 1 ){
process.stderr.write("error occured",code);
process.exit(1);
}
else{
process.stdout.write('done');
}
});

According to the docs:
childProcess.stdout.on('data', (data) => {
console.log(`stdout: ${data}`);
});

Google text detection api - Web demo result is different from using api

I have tried to use Google Vision API Text detection feature and the web demo of Google to OCR my image. Two results is not same.
Firstly, i tried it with demo at url, https://cloud.google.com/vision/docs/drag-and-drop. Finally, i tried it with google api code by python language. Two results is not same and i don't know why . Could you please help me this problem?
My image: http://dfp.crawl.kyanon.digital/crawled_images/m.vta/1931/m.vta-home-slidebanner-image/2/assets/100000_samsung-galaxy-m20.png
My api result: "SAMSUNG Galaxy M20Siêu Pin vô doi, sac nhanh tuc thiMoiSAMSUNG4.990.000dTrà gop 0%Mua ngay"
My web demo result: https://imge.to/i/q4gRw
Thank you very much
my python code here:
client = vision.ImageAnnotatorClient()
raw_byte = cv2.imencode('.jpg', image)[1].tostring()
post_image = types.Image(content=raw_byte)
image_context = vision.types.ImageContext()
response = client.text_detection(image=post_image, image_context=image_context)

This is Typescript code.
But the idea is not to use text_detection but something like document_text_detection (unsure what the python API specifically provides).
Using documentTextDetection() instead of textDetection() solved the exact same problem for me.
const fs = require("fs");
const path = require("path");
const vision = require("#google-cloud/vision");
async function quickstart() {
let text = '';
const fileName = "j056vt-_800w_800h_sb.jpg";
const imageFile = fs.readFileSync(fileName);
const image = Buffer.from(imageFile).toString("base64");
const client = new vision.ImageAnnotatorClient();
const request = {
image: {
content: image
},
imageContext: {
languageHints: ["vi-VN"]
}
};
const [result] = await client.documentTextDetection(request);
// OUTPUT METHOD A
for (const tmp of result.textAnnotations) {
text += tmp.description + "\n";
}
console.log(text);
const out = path.basename(fileName, path.extname(fileName)) + ".txt";
fs.writeFileSync(out, text);
// OUTPUT METHOD B
const fullTextAnnotation = result.fullTextAnnotation;
console.log(`Full text: ${fullTextAnnotation.text}`);
fullTextAnnotation.pages.forEach(page => {
page.blocks.forEach(block => {
console.log(`Block confidence: ${block.confidence}`);
block.paragraphs.forEach(paragraph => {
console.log(`Paragraph confidence: ${paragraph.confidence}`);
paragraph.words.forEach(word => {
const wordText = word.symbols.map(s => s.text).join("");
console.log(`Word text: ${wordText}`);
console.log(`Word confidence: ${word.confidence}`);
word.symbols.forEach(symbol => {
console.log(`Symbol text: ${symbol.text}`);
console.log(`Symbol confidence: ${symbol.confidence}`);
});
});
});
});
});
}
quickstart();

Actually, comparing both of your results, the only difference I see is the way the results are displayed. The Google Cloud Drag and Drop site displays the results with the bounding boxes and tries to find areas of text.
The response you get with your python script includes the same information. A few examples:
texts = response.text_annotations
print([i.description for i in texts])
# prints all the words that were found in the image
print([i.bounding_poly.vertices for i in texts])
# prints all boxes around detected words
Feel free to ask more questions for clarification.
A few other thoughts:
Are you preprocessing your images?
What size are the images?

Python: How to get full match with RegEx

I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).
Here is the script part:
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
Here is what I do:
import re
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = re.search(search_for, text)
return debug
What I want is the href link and I kind of get it, but for some reason only like this
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>
and not like I want it to be
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">
So my question is how to get the full link and not only a part of it.
Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.

I've modified it slightly, but for me it returns the complete string you desire now.
import re
text = """
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
"""
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = search_for.findall(text)
print(debug)
get_link_from_text(text)
Output:
["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract link from html script in python? - python

Related

Python save embed base64 pdf string into file

python/ruby regex for domain in text

How to store python output recieved in node js?

Google text detection api - Web demo result is different from using api

Python: How to get full match with RegEx

Categories

Resources