Strange looking Swedish characters in generated PDF

Strange looking Swedish characters in generated PDF - python

I am building backend for a system in Django and I am generating PDF files using ReportLab. Please notice that the dots and circle above some letters are moved to the right for some reason. Why does this occur? Font is Times New Roman.

I found that this issue is coming from "paste". I had original text in a PDF document, that I copied and pasted to .py file (utf8 formatted). When I wrote same words for myself, problem disappeared.

Related

How to replace scanned text in a pdf from real text

I have a scanned pdf and using Python I have already converted it to a searchable pdf but that's not exactly what I want.
It is well-known fact that in a scanned pdf comprising of images of text, pixelation occurs at some point of zooming in, whereas it doesn't in a text pdf. And that's somewhat I want. I want to replace all of the text in my scanned pdf from real text same as a typed pdf. Or possibly create a new one with all the contents of the old pdf.
Actually, I am writing this post after searching well on the internet and due to the fact that I didn't find anywhere to start from. I went through the libraries such as pytesseract and reportlab but couldn't find anything promising. And this is the reason for no code posted in this question.
Hence, I want to know a way to possibly replace all of the text from the scanned pdf with a real text at exactly those positions.
Note: In the pdf, I am working on, there are images as well and I want everything to be exactly at its original place with only the text replaced.
Also, I have already mentioned the reason for giving no code in this post as I haven't found a starting point yet.

Removing formatted images from a word document using Python

I have a word document from a colleague who gave me a .docx Microsoft Word file with 90 images on it that need to be extracted so they can be turned into flashcards. I tried using the Python module "docx2txt" which worked ok, but only extracted 34 images. Upon further inspection, I found that it was because when my coworker made the original file, he took screenshots of PowerPoint slides that he had made with about 4-6 of the images on one slide. Then, he would put them in Word and use the built in Word trimming tool to copy the picture several times and trim down to each individual picture he needed in a particular line of the document. Docx2txt copied the pictures files to my designated directly perfectly, but did not keep the formatting. Any picture file he had inserted and "trimmed down" to size, was copied as the full image. Does anyone know of a way to keep the formatting so I don't have to go through and manually copy 90 pictures one by one? Perhaps converting to a .pdf file and using a pdf related module or something? Or might be there some way of using another Python library which will keep the picture formatting? Thanks for any help you can provide! I'm somewhat of a beginner with Python, but love it when I can get it to automate stuff... even if it ends up taking longer to figure out how to do it than just boring myself to death saving the photos manually, lol.

https://support.microsoft.com/en-us/topic/reduce-the-file-size-of-a-picture-in-microsoft-office-8db7211c-d958-457c-babd-194109eb9535
Important: Cropped parts of the picture are not removed from the file, and can potentially be seen by others; including search engines if the cropped image is posted online. Only the Office desktop apps have the ability to remove cropped areas from the underlying image file.
Follow the relevant section for Desktop Office (Windows or Mac) note from above it CANNOT work on Web 365.
go to "Other kinds of cropping"
Important: If you delete cropped areas and later change your mind, you can click the Undo Button Image button to restore them. Deletions can [ONLY] be undone until the file is saved.
So make a backup copy of the file
Select the picture or pictures (If you want all selected that should be easy with CTRL + A to highlight everything)
Then follow the instructions
Picture Tools > Format, and in the Adjust group, click Compress Pictures
Be sure that the Delete cropped areas of pictures check box is selected
DEselect the Apply only to this picture check box.
Double check a few manually to verify all is well then save a copy.

Does python have font face for strings?

I recently used Google Vision API to extract text from a pdf. Now I searching for a keyword in the response text (from API). When I compare the given string and found string, they do not match even they have same characters.
The only reason I can see is font types of given and found string which looks different which lead to different ascii/utf-8 code of the characters in the string. (I never came across such a problem)
How to solve this? How can I bring these two string to same characters? I am using Jupyter notebook but I even pasted the comparison on terminal but still its evaluates it to False.
Here are the strings I am trying to match:
'КА Р5259' == 'KA P5259'
But they look the same on Stack Overflow so here's a screenshot:

Thanks everyone for the your comments.
I found the solution. I am posting it here, it might be helpful for someone. Actually it's correct that python does not support font faces. So if one copies a font faced character and paste it to python console or jupyter notebook (which renders the font faces due to the fact that it uses html to display information) it is considered a different unicode character.
So the idea is to first bring the text response in a plain text format which I achieved by storing the response in a .txt file (or .pkl file more precisely) which I had to do anyway to preserve the response objects for later data analysis purposes. Once the response in stored in plain text file you can read it without any font face problem unlike I faced above.

Parse arabic PDF to plain text

I've got a problem with parsing Arabic PDF to plain text. I have tried Apache Tika, PDFBox (both in Java and Python) and a few less popular tools like PyPDF2 every time getting mixed order of of signs. For PDFBox I have used the hint from the documentation for RTL languages link but it didn't work.
The example is presented below:
Original PDF:
Generated text:
The order is changed in every line that Latin sings occurs. Has anyone faced similar problem and solved it?
Thanks for help!

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):
<input class="question_data" value="{"text":"<p>[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.</p>","fields":[{"id":"1","type":"fill","element":{"sirina":"103","maxDuzina":"12","odgovor":["Информатика"]}}]}" name="question:1:data" id="id3a1"/>
When I try to print out this data in python using:
print "OLD_DATA:", data
It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)

You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.
If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters
I also found a couple other StackOverflow posts that might be helpful in your efforts:
How do I get Cyrillic in the output, Python?
What is right way to use cyrillic in python lxml library
I would also recommend this article and python manual entry:
https://docs.python.org/2/howto/unicode.html
http://www.joelonsoftware.com/articles/Unicode.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Strange looking Swedish characters in generated PDF - python

I am building backend for a system in Django and I am generating PDF files using ReportLab. Please notice that the dots and circle above some letters are moved to the right for some reason. Why does this occur? Font is Times New Roman.

I found that this issue is coming from "paste". I had original text in a PDF document, that I copied and pasted to .py file (utf8 formatted). When I wrote same words for myself, problem disappeared.

Related

How to replace scanned text in a pdf from real text

Removing formatted images from a word document using Python

Does python have font face for strings?

Parse arabic PDF to plain text

HTML Decoding in Python

Categories

Resources