Google Document Ai giving different outputs for the same file - python

I was using Document OCR API to extract text from a pdf file, but part of it is not accurate. I found that the reason may be due to the existence of some Chinese characters.
The following is a made-up example in which I cropped part of the region that the extracted text is wrong and add some Chinese characters to reproduce the problem.
When I use the website version, I cannot get the Chinese characters but the remaining characters are correct.
When I use Python to extract the text, I can get the Chinese characters correctly but part of the remaining characters are wrong.
The actual string that I got.
Are the versions of Document AI in the website and API different? How can I get all the characters correctly?
Update:
When I print the detected_languages (don't know why for lines = page.lines, the detected_languages for both lines are empty list, need to change to page.blocks or page.paragraphs first) after printing the text, I get the following output.
Code:
from google.cloud import documentai_v1beta3 as documentai
project_id= 'secret-medium-xxxxxx'
location = 'us' # Format is 'us' or 'eu'
processor_id = 'abcdefg123456' # Create processor in Cloud Console
opts = {}
if location == "eu":
opts = {"api_endpoint": "eu-documentai.googleapis.com"}
client = documentai.DocumentProcessorServiceClient(client_options=opts)
def get_text(doc_element: dict, document: dict):
"""
Document AI identifies form fields by their offsets
in document text. This function converts offsets
to text snippets.
"""
response = ""
# If a text segment spans several lines, it will
# be stored in different text segments.
for segment in doc_element.text_anchor.text_segments:
start_index = (
int(segment.start_index)
if segment in doc_element.text_anchor.text_segments
else 0
)
end_index = int(segment.end_index)
response += document.text[start_index:end_index]
return response
def get_lines_of_text(file_path: str, location: str = location, processor_id: str = processor_id, project_id: str = project_id):
# You must set the api_endpoint if you use a location other than 'us', e.g.:
# opts = {}
# if location == "eu":
# opts = {"api_endpoint": "eu-documentai.googleapis.com"}
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"
# Read the file into memory
with open(file_path, "rb") as image:
image_content = image.read()
document = {"content": image_content, "mime_type": "application/pdf"}
# Configure the process request
request = {"name": name, "raw_document": document}
result = client.process_document(request=request)
document = result.document
document_pages = document.pages
response_text = []
# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document
# Read the text recognition output from the processor
print("The document contains the following paragraphs:")
for page in document_pages:
lines = page.blocks
for line in lines:
block_text = get_text(line.layout, document)
confidence = line.layout.confidence
response_text.append((block_text[:-1] if block_text[-1:] == '\n' else block_text, confidence))
print(f"Text: {block_text}")
print("Detected Language", line.detected_languages)
return response_text
if __name__ == '__main__':
print(get_lines_of_text('/pdf path'))
It seems the language code is wrong, will this affect the result?

Posting this Community Wiki for better visibility.
One of features of DocumentAI is OCR - Optical Character Recognition which allows recognizing text from various files.
OP in this scenario received difference outputs using Try it function and Client Libraries - Python.
Why are there discrepancies between Try it and Python library?
It's hard to say as both methods use the same API documentai_v1beta3. It might be related to some files modifications when pdf is uploading to Try it Demo, different endpoints, language alphabet recognition or some random stuff.
When you are using Python Client you also get accuracy % of text identification. Below examples from my testes:
However, OP's identification is about 0,73 so it might get wrong results and in this situation is a visible issue. I guess it cannot be anyhow improved using code. Maybe if there would be different quality of PDF (in shown OPs example there are some dots which might affect identification).

Related

I am looking to create an API endpoint route that returns txt in a json format -Python

I'm new to developing and my question(s) involves creating an API endpoint in our route. The api will be used for a POST from a Vuetify UI. Data will come from our MongoDB. We will be getting a .txt file for our shell script but it will have to POST as a JSON. I think these are the steps for converting the text file:
1)create a list for the lines of the .txt
2)add each line to the list
3) join the list elements into a string
4)create a dictionary with the file/file content and convert it to JSON
This is my current code for the steps:
import json
something.txt: an example of the shell script ###
f = open("something.txt")
create a list to put the lines of the file in
file_output = []
add each line of the file to the list
for line in f:
file_output.append(line)
mashes all of the list elements together into one string
fileoutput2 = ''.join(file_output)
print(fileoutput2)
create a dict with file and file content and then convert to JSON
json_object = {"file": fileoutput2}
json_response = json.dumps(json_object)
print(json_response)
{"file": "Hello\n\nSomething\n\nGoodbye"}
I have the following code for my baseline below that I execute on my button press in the UI
#bp_customer.route('/install-setup/<string:customer_id>', methods=['POST'])
def install_setup(customer_id):
cust = Customer()
customer = cust.get_customer(customer_id)
### example of a series of lines with newline character between them.
script_string = "Beginning\nof\nscript\n"
json_object = {"file": script_string}
json_response = json.dumps(json_object)
get the install shell script content
replace the values (somebody has already done this)
attempt to return the below example json_response
return make_response(jsonify(json_response), 200)
my current Vuetify button press code is here: so I just have to ammend it to a POST and the new route once this is established
onClickScript() {
console.log("clicked");
axios
.get("https://sword-gc-eadsusl5rq-uc.a.run.app/install-setup/")
.then((resp) => {
console.log("resp: ", resp.data);
this.scriptData = resp.data;
});
},
I'm having a hard time combining these 2 concepts in the correct way. Any input as to whether I'm on the right path? Insight from anyone who's much more experienced than me?
You're on the right path, but needlessly complicating things a bit. For example, the first bit could be just:
import json
with open("something.txt") as f:
json_response = json.dumps({'file': f.read()})
print(json_response)
And since you're looking to pass everything through jsonify anyway, even this would suffice:
with open("something.txt") as f:
data = {'file': f.read()}
Where you can pass data directly through jsonify. The rest of it isn't sufficiently complete to offer any concrete comments, but the basic idea is OK.
If you have a working whole, you could go to https://codereview.stackexchange.com/ to ask for some reviews, you should limit questions on StackOverflow to actual questions about getting something to work.

Problem passing markdown template via gitlab projects api in a python script

I am trying to use the gitlab projects api to edit multiple project MR templates. The problem is, that it only sends the first line of the markdown template.
While messing around with the script, I was toying with converting it to html when I found that it sent the whole template when converted to html.
I am probably missing something super simple but for the life of me, I cant figure out why it would be able to send the entire template in html but only send the first line of it natively in markdown.
I have been searching for a solution for a bit now so I apologize if my googlefu missed an obvious answer here.
Here is the script...
#! /usr/bin/env python3
import argparse
import requests
gitlab_addr = "https://gitlab.com/api/v4"
# Insert your project IDs into the array below.
project_IDs = [xxxx, yyyy, zzzz]
# Insert your MR template info below.
with open('/.gitlab/merge_request_templates/DefaultMRTemplate.md', 'r') as file:
MR_template = file.read()
#print(MR_template)
def getArgs():
parser = argparse.ArgumentParser(
description='This tool updates the default template for a single '
'or multiple program\'s MRs. \n\nYou will need to edit '
'the script to input your MR template and projects IDs.'
'\nYou will also need to pass in your API Token via '
' command line.\n\nYou want to see "200 OK" on the '
' command line as confirmation.',
formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("token", type=str,
help="API Token. Create one at User Settings / Access Tokens")
return parser.parse_args()
def ChangeTemplate():
token = getArgs().token
headers = {"PRIVATE-TOKEN": token, }
for x in project_IDs:
addr = f"{gitlab_addr}/projects/{x}/?merge_requests_template={MR_template}"
response = requests.put(addr, headers=headers)
# You want to see "200 OK" on the command line.
print(response.status_code, response.reason)
def main():
ChangeTemplate()
if __name__ == '__main__':
main()
Here is a sample template...
See guidance here: https://example.com/Gitlab+MR+Guide
## Description
%% Put a description here %%
%% Add an issue link here %%
## Tests
%% Include test listing here %%
## Checklists
**Author Checklist**
- [ ] A: Did you fill out the description, add an issue link (in title or desc) and fill out the test section?
- [ ] A: Add a peer to the MR
**Assignee 1 Checklist:**
- [ ] P: Verify the description field is filled out, issue link is included (in title or desc) and the test section is filled out
- [ ] P: Add a code owner to the MR
**Assignee 2 (Code Owner) Checklist:**
- [ ] O: Verify the description field is filled out, issue link is included (in title or desc) and the test section is filled out
- [ ] O: Verify unit test coverage is at least 40% line coverage with a goal of 90%
problem output...
See guidance here: https://example.com/Gitlab MR Guide
Your data needs to be properly encoded in the request. Trying to format the literal contents of the file into the query string won't work here.
Use the data keyword argument to requests.put, which will pass the data in the request body (or use params to set query params). requests will handle the proper encoding of the data.
addr = f"{gitlab_addr}/projects/{x}/"
payload = {'merge_requests_template': MR_template}
response = requests.put(addr, headers=headers, data=payload)
# or params=payload to use query string

How to assign a number to a word from an API response

I've created a custom machine learning model using Watson Knowledge studio and deployed to an NLU service. I've also managed to access my model in python. My custom model has been designed to identify specific entity types such as (Advice, Cancellation, Awareness, and so on). What I want to do is to extract these entity types from the API JSON response and assign a number to them (e.g Advice = 1, Cancellation = 2, Awareness = 3, etc.) and then write them along with the sample text (e.g "I want to cancel my subscription with Gameloft.") to a CSV file with column headings (ID, Sentence, Entity Type). I have already managed to extract the entity types and the sample text and written them to a .txt file, however I need to write them to a CSV file.
import json
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
import Features, EntitiesOptions, KeywordsOptions
natural_language_understanding = NaturalLanguageUnderstandingV1(
username='**************',
password='*********',
version='2018-03-16')
text="I want to cancel my subscription with Gameloft."
response = natural_language_understanding.analyze(
text =text,
features=Features(
entities=EntitiesOptions(
emotion=True,
sentiment=True,
limit=2,
model="**************"),
keywords=KeywordsOptions(
emotion=True,
sentiment=True,
limit=2)))
print(json.dumps(response, indent=2))
response['keywords'][0]['text']
response ['entities'][0]['type']
if response['entities'][0]['type'] == "Cancellation":
print ('1')
with open('C:\\Users\\Results.txt', "w") as f:
for x in response['entities']:
f.write(x['type'] + ' ')
Please help me with the following:
How can I assign numbers to my entity types?
Is there a way to create a loop that loads multiple sentences/text to be analyzed by the NLU API?
How can I write everything(entity types, text and the numbers assigned to the entity types) to a CSV file?
You can use a dictionary to map entity types with number.
For example
entity_number_dict = {
'Advice': 1,
'Cancellation': 2,
'Awareness': 3
}
print(entity_number_dict[response['entities'][0]['type']])

Python adding hyperlink to Outlook task through win32com

I would like to create an hyperlink in the body of a task created through win32com.
This is my code so far:
outlook = win32com.client.Dispatch("Outlook.Application")
outlook_task_item = 3
recipient = "my_email#site.com"
task = outlook.CreateItem(outlook_task_item)
task.Subject = "hello world"
task.Body = "please update the file here"
task.DueDate = dt.datetime.today()
task.ReminderTime = dt.datetime.today()
task.ReminderSet = True
task.Save()
I have tried to set the property task.HTMLBody but I get the error:
AttributeError: Property 'CreateItem.HTMLBody' can not be set.
I have also tried
task.Body = "Here is the <a href='http://www.python.org'>link</a> I need"
but I am not getting a proper hyperlink.
However if I create a task front end in Outlook, I am able to add hyperlinks.
You can also try:
task.HTMLBody = "Here is the <a href='http://www.python.org'>link</a> I need"
this will overwrite data in 'task.Body' to the HTML format provides in 'task.HTMLBody'
so whichever (Body or HTMLBody) is last will be taken as the Body of the mail.
Tasks do not support HTML. Instead, you have to provide RTF.
You can investigate -- but not set -- the RTF of a given task through task.RTFBody (and task.RTFBody.obj to get a convenient view of it). To use RTF in the body of a task, simply use the task.Body property; setting this to a byte array containing RTF will automatically use that RTF in the body. Concretely, to get the body you want, you could let
task.Body = rb'{\rtf1{Here is the }{\field{\*\fldinst { HYPERLINK "https://www.python.org" }}{\fldrslt {link}}}{ I need}}'

Setting pgNumType property in python-docx is without effect

I'm trying to set page numbers in a word document using python-docx. I found an attribute pgNumType (pageNumberType), which I'm setting with this code:
document = Document()
section = document.add_section(WD_SECTION.CONTINUOUS)
sections = document.sections
sectPr = sections[0]._sectPr
pgNumType = OxmlElement('w:pgNumType')
pgNumType.set(qn('w:fmt'), 'decimal')
pgNumType.set(qn('w:start'), '1')
sectPr.append(pgNumType)
This code does nothing, no page numbers are in the output document. I did the same with a lnNumType attribute which is for line numbers and it worked fine. So what is it with the pgNumType attribute? The program executes without error, so the attribute exists. But anyone knows why it has no effect?
While your page style setting is fine, but it does not automatically insert a page number field anywhere in your document. Normally, page numbers appear in a header or a footer, but unfortunately, python-docx does not currently support headers, footers or fields. The former two appear to be an ongoing work in progress: https://github.com/python-openxml/python-docx/issues/104.
The linked issue mentions a number of workarounds. The one I have found to be most robust is to create an otherwise blank document with the headers and footers set up exactly the way you want in MS Word. You can then load and append to that document instead of the default template that docx.Document returns.
This technique is implied to be the suggested method for editing headers in the official documentation:
A lot of how a document looks is determined by the parts that are left when you delete all the content. Things like styles and page headers and footers are contained separately from the main content, allowing you to place a good deal of customization in your starting document that then appears in the document you produce.
#T.Poe
I was working on same problem.
new_section = document.add_section() # Added new section for assigning different footer on each page.
sectPr = new_section._sectPr
pgNumType = OxmlElement('w:pgNumType')
pgNumType.set(qn('w:fmt'), 'decimal')
pgNumType.set(qn('w:start'), '1')
sectPr.append(pgNumType)
new_footer = new_section.footer # Get footer-area of the recent section in document
new_footer.is_linked_to_previous = False
footer_para = new_footer.add_paragraph()
run_footer = footer_para.add_run("Your footer here")
_add_number_range(run_footer)
font = run_footer.font
font.name = 'Arial'
font.size = Pt(8)
footer_para.paragraph_format.page_break_before = True
This worked for me :)
I don't know whats not working for you. I just created a new section
Some code for function I used to define field is as follows:
def _add_field(run, field):
""" add a field to a run
"""
fldChar1 = OxmlElement('w:fldChar') # creates a new element
fldChar1.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = field
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
t = OxmlElement('w:t')
t.text = "Seq"
fldChar2.append(t)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar1)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
def _add_number_range(run):
""" add a number range field to a run
"""
_add_field(run, r'Page')

Categories

Resources