I have html document embeded with pdf document in base64 encoded format. I like to extract the string and save it as pdf file. using below code to save it as pdf file.
but its on opening in adobe reader, saying invalid format. looking to fix this issue.
I think pdf file encoded using Javascript encodeURIComponent function. need to convert using Python.
sample embed tag
<embed type="application/pdf" src="data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D">
Code
import base64
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D"
buffer = BytesIO.BytesIO()
content = base64.b64decode(b64)
buffer.write(content)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(buffer.getvalue())
if __name__ == "__main__":
decode_b64()
=== Update 1:
found the way to convert using JavaScript: It will be nice if we can port this code to Python.
const {readFileSync, writeFile, promises: fsPromises} = require('fs');
var data=readFileSync("pdf-file.html", 'utf-8')
var DOMParser = require('xmldom').DOMParser;
var parser = new DOMParser();
const virtualDoc = parser.parseFromString(data, 'text/html');
var elem = virtualDoc.getElementsByTagName('embed')[0];
for (var i = 0; i < elem.attributes.length; i++) {
var attrib = elem.attributes[i];
if (attrib.specified) {
if( attrib.name == "src") {
var result =attrib.value
result=result.replace('data:application/pdf;base64,','');
let buff = Buffer.from(decodeURIComponent(result), 'base64');
writeFile('pdf-file.pdf', buff, err => {
if (err) {
console.error(err);
}
});
}
}
}
This is a situation that you should have been able to chase down yourself. I wasn't 100% sure how Javascript encoded those two characters, so I wrote up a simple HTML page:
<script>
var s = "abcde++defgh//";
alert(encodeURIComponent(s));
</script>
When I ran that page, the result was "abcde%2B%2Bdefgh%2F%2F", and that is all the information you need to fix up those strings.
def decode_b64():
b64 = "JVBERi0xLjQKJeLjz9MKMSAwIG9iago8PC9D......"
b64 = b64.replace('%2B','+').replace('%2F','/')
content = base64.b64decode(b64)
with open(Path(Path.home(), 'Downloads', 'mytest.pdf'), "wb") as f:
f.write(content)
I have text as result i need to find all domains on this text, domains with subdomain and without with https and without;
opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
https%3a%2f%2fwww.anotherdomain.com%2fv2%2
i try this regex but its not capture all i need.
re.compile(
r'''((?<=x2[fF]|02[fF]|%2[fF])|(?<=//))(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}|(?<=["'])(\w\.|\w[A-Za-z0-9-]{0,61}\w\.){1,3}[A-Za-z]{2,6}(?=["'])''',
re.VERBOSE)
captured result by regex:
{'showbox-tr.dropbox.com', 'api.dropbox.com', 'api-v1.mydomain.co', 'www-oud.mydomain.com', 'cfl.dropboxstatic.com', 'hub-spot.mydomain.com', 'paper.dropbox.com', 'superdomain.com', 'api.superdomain.com', 'linux.dropbox.com', 'embedded.hellosign.com', 'api.location.com', 'api.dropboxapi.com', 'www.dropbox.com', 'location.com', 'api.domain2.com', 'dropbox.com', 'api.noddos.com', 'dropboxcaptcha.com', 'learn-stage.dropbox.com', 'test.composer.dropbox.com', 'help-stg.dropbox.com', 'replay.dropbox.com', 'domain2.com', 'hubspot.mydomain.com', 'learn.dropbox.com', 'help.dropbox.com', 'collections.dropbox.com', 'app.hellosign.com', 'api.mydomain.co', 'noddos.com', 'docsend.com', 'mydomain.co', 'notes.dropbox.com', 'photos.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'OfficeHome.All', 'showcase.dropbox.com', 'carousel.dropbox.com', 'experience.dropbox.com'}
expected results so in general its not capture "opt-i.mydomain.com":
{'opt-i.mydomain.com', 'hubspot.mydomain.com', 'embedded.hellosign.com', 'carousel.dropbox.com', 'api.dropboxapi.com', 'experience.dropbox.com', 'linux.dropbox.com', 'api.superdomain.com', 'noddos.com', 'showcase.dropbox.com', 'app.hellosign.com', 'www-oud.mydomain.com', 'showbox-tr.dropbox.com', 'help-stg.dropbox.com', 'api.domain2.com', 'notes.dropbox.com', 'paper.dropbox.com', 'services.pp.dropbox.com', 'collections.dropbox.com', 'learn.dropbox.com', 'location.com', 'api.location.com', 'docsend.com', 'api.dropbox.com', 'replay.dropbox.com', 'mydomain.co', 'hub-spot.mydomain.com', 'www.dropbox.com', 'learn-stage.dropbox.com', 'domain2.com', 'help.dropbox.com', 'api.mydomain.co', 'api-v1.mydomain.co', 'superdomain.com', 'dropboxcaptcha.com', 'api.noddos.com', 'dropbox.com', 'test.composer.dropbox.com', 'cfl.dropboxstatic.com', 'client-web.dropbox.com', 'opt-i.mydomain.com', 'photos.dropbox.com'}
i also test this regex it match all domains better than previus but problem is with doamin`s that has unicode example:
"https\\u003a\\u002f\\u002fapi.noddos.com" will capture "u002fapi.noddos.com" but we need "api.noddos.com"
re.compile(
r'''
([a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]
''', re.VERBOSE)
import re
regex = r"(?:[a-z0-9][a-z0-9\-]{0,61}[a-z0-9]\.)+[a-z0-9][a-z0-9\-]*[a-z0-9]"
data = """opt-i.mydomain.com 'www-oud.mydomain.com'
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
<script type="text/javascript">
(function(){
var baseUrl = 'https\x3A\x2F\x2Fhubspot.mydomain.com';
var baseUrl = 'https\x3A\x2F\x2Fhub-spot.mydomain.com';
</script>
#=========================================================
define("modules/constants/env", [], function() {
return {"BATCH_THUMB_ENDPOINTS": [], "LIVE_TRANSCODE_SERVER": "showbox-tr.dropbox.com", "STATIC_CONTENT_HOST": "cfl.dropboxstatic.com", "NOTES_WEBSERVER": "paper.dropbox.com", "REDIRECT_SAFE_ORIGINS": ["www.dropbox.com", "dropbox.com", "api.dropboxapi.com", "api.dropbox.com", "linux.dropbox.com", "photos.dropbox.com", "carousel.dropbox.com", "client-web.dropbox.com", "services.pp.dropbox.com", "www.dropbox.com", "docsend.com", "paper.dropbox.com", "notes.dropbox.com", "test.composer.dropbox.com", "showcase.dropbox.com", "collections.dropbox.com", "embedded.hellosign.com", "help.dropbox.com", "help-stg.dropbox.com", "experience.dropbox.com", "learn.dropbox.com", "learn-stage.dropbox.com", "app.hellosign.com", "replay.dropbox.com"], "PROF_SHARING_WEBSERVER": "showcase.dropbox.com", "FUNCAPTCHA_SERVER": "dropboxcaptcha.com", "__esModule": true};
});
#========================================================
https://mydomain.co/path/2
https://api.mydomain.co/path/2
https://api-v1.mydomain.co/path/2
https://superdomain.com:443
https://api.superdomain.com:443
https\\u003a\\u002f\\u002fapi.noddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
https\\u003a\\u002f\\u002fnoddos.com\\u002fabout\\u002fen-us\\u002fsignin\\u002f"
root.I13N_config.location = "https:\u002F\u002Flocation.com\u002Faccount\u002Fchallenge\u002Frecaptcha
root.I13N_config.location = "https:\u002F\u002Fapi.location.com\u002Faccount\u002Fchallenge\u002Frecaptcha
&scope=openid%20profile%20https%3A%2F%2Fapi.domain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce
&scope=openid%20profile%20https%3A%2F%2Fdomain2.com%2Fv2%2FOfficeHome.All&response_mode=form_post&nonce"""
data = re.sub(r"u[0-9a-z]{4}", "", data)
matches = re.findall(regex, data)
#output matches:
['opt-i.mydomain.com', 'www-oud.mydomain.com', 'api.noddos.com', 'ht.mydomain.com', 'hub-spot.mydomain.com', 'showbox-tr.dropbox.com', 'cfl.dropboxstatic.com', 'paper.dropbox.com', 'www.dropbox.com', 'dropbox.com', 'api.dropboxapi.com', 'api.dropbox.com', 'linux.dropbox.com', 'photos.dropbox.com', 'carousel.dropbox.com', 'client-web.dropbox.com', 'services.pp.dropbox.com', 'www.dropbox.com', 'docsend.com', 'paper.dropbox.com', 'notes.dropbox.com', 'test.composer.dropbox.com', 'showcase.dropbox.com', 'collections.dropbox.com', 'embedded.hellosign.com', 'help.dropbox.com', 'help-stg.dropbox.com', 'experience.dropbox.com', 'learn.dropbox.com', 'learn-stage.dropbox.com', 'app.hellosign.com', 'replay.dropbox.com', 'showcase.dropbox.com', 'dropboxcaptcha.com', 'mydomain.co', 'api.mydomain.co', 'api-v1.mydomain.co', 'somain.com', 'api.somain.com', 'api.noddos.com', 'noddos.com', 'config.location', 'location.com', 'config.location', 'api.location.com', 'api.domain2.com', 'domain2.com']
I have tried to use Google Vision API Text detection feature and the web demo of Google to OCR my image. Two results is not same.
Firstly, i tried it with demo at url, https://cloud.google.com/vision/docs/drag-and-drop. Finally, i tried it with google api code by python language. Two results is not same and i don't know why . Could you please help me this problem?
My image: http://dfp.crawl.kyanon.digital/crawled_images/m.vta/1931/m.vta-home-slidebanner-image/2/assets/100000_samsung-galaxy-m20.png
My api result: "SAMSUNG Galaxy M20Siêu Pin vô doi, sac nhanh tuc thiMoiSAMSUNG4.990.000dTrà gop 0%Mua ngay"
My web demo result: https://imge.to/i/q4gRw
Thank you very much
my python code here:
client = vision.ImageAnnotatorClient()
raw_byte = cv2.imencode('.jpg', image)[1].tostring()
post_image = types.Image(content=raw_byte)
image_context = vision.types.ImageContext()
response = client.text_detection(image=post_image, image_context=image_context)
This is Typescript code.
But the idea is not to use text_detection but something like document_text_detection (unsure what the python API specifically provides).
Using documentTextDetection() instead of textDetection() solved the exact same problem for me.
const fs = require("fs");
const path = require("path");
const vision = require("#google-cloud/vision");
async function quickstart() {
let text = '';
const fileName = "j056vt-_800w_800h_sb.jpg";
const imageFile = fs.readFileSync(fileName);
const image = Buffer.from(imageFile).toString("base64");
const client = new vision.ImageAnnotatorClient();
const request = {
image: {
content: image
},
imageContext: {
languageHints: ["vi-VN"]
}
};
const [result] = await client.documentTextDetection(request);
// OUTPUT METHOD A
for (const tmp of result.textAnnotations) {
text += tmp.description + "\n";
}
console.log(text);
const out = path.basename(fileName, path.extname(fileName)) + ".txt";
fs.writeFileSync(out, text);
// OUTPUT METHOD B
const fullTextAnnotation = result.fullTextAnnotation;
console.log(`Full text: ${fullTextAnnotation.text}`);
fullTextAnnotation.pages.forEach(page => {
page.blocks.forEach(block => {
console.log(`Block confidence: ${block.confidence}`);
block.paragraphs.forEach(paragraph => {
console.log(`Paragraph confidence: ${paragraph.confidence}`);
paragraph.words.forEach(word => {
const wordText = word.symbols.map(s => s.text).join("");
console.log(`Word text: ${wordText}`);
console.log(`Word confidence: ${word.confidence}`);
word.symbols.forEach(symbol => {
console.log(`Symbol text: ${symbol.text}`);
console.log(`Symbol confidence: ${symbol.confidence}`);
});
});
});
});
});
}
quickstart();
Actually, comparing both of your results, the only difference I see is the way the results are displayed. The Google Cloud Drag and Drop site displays the results with the bounding boxes and tries to find areas of text.
The response you get with your python script includes the same information. A few examples:
texts = response.text_annotations
print([i.description for i in texts])
# prints all the words that were found in the image
print([i.bounding_poly.vertices for i in texts])
# prints all boxes around detected words
Feel free to ask more questions for clarification.
A few other thoughts:
Are you preprocessing your images?
What size are the images?
I'm trying to filter out a link from some java script. The java script part isin't relevant anymore because I transfromed it into a string (text).
Here is the script part:
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
Here is what I do:
import re
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = re.search(search_for, text)
return debug
What I want is the href link and I kind of get it, but for some reason only like this
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/dow>
and not like I want it to be
<_sre.SRE_Match object; span=(30, 112), match="href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'">
So my question is how to get the full link and not only a part of it.
Might the problem be that re.search isin't returning longer strings? Because I tried altering the RegEx, I even tried matching the link 1 by 1, but it still returns only the part I called out earlier.
I've modified it slightly, but for me it returns the complete string you desire now.
import re
text = """
<script>
setTimeout("location.href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe';", 2000);
$(function() {
$("#whats_new_panels").bxSlider({
controls: false,
auto: true,
pause: 15000
});
});
setTimeout(function(){
$("#download_messaging").hide();
$("#next_button").show();
}, 10000);
</script>
"""
def get_link_from_text(text):
text = text.replace('\n', '')
text = text.replace('\t', '')
text = re.sub(' +', ' ', text)
search_for = re.compile("href[ ]*=[ ]*'[^;]*")
debug = search_for.findall(text)
print(debug)
get_link_from_text(text)
Output:
["href = 'https://airdownload.adobe.com/air/win/download/30.0/AdobeAIRInstaller.exe'"]