Unusual font when extracting text from PDF

Unusual font when extracting text from PDF - python

I have been trying to extract text from PDF files and most of the files seem to work fine. However, one particular document has text in this unusual font: ｉｎ ｓｏｌｉｄ
I have tried extraction using PHP and then Python and both were unable to fix this font. I tried copying text and tried to see if I can get it fixed in text editing tools but couldn't do much.Please note that the original PDF document looks fine but when text is copied and pasted in a text editing tool, the gap between characters starts to appear. I am completely clueless on what to do. Please suggest a solution to fix this in PHP/Python (preferably PHP).

Pre-unicode, some character encodings allowed you to compose Japanese/Korean/Chinese characters either as two half width characters or one full width character. In that case, latin characters could be full width to be mixed evenly with the other characters. You have Full Width Latin characters on your hands and that's why the space out oddly.
You can normalize the string with NFKD compatibility decomposition to get to regular latin. This will also change any half/full width Japanese/Korean/Chinese characters by, um ... I'm not sure, but I think into characters built from multi code point characters.
>>> import unicodedata
>>> t="ｉｎ ｓｏｌｉｄ"
>>> unicodedata.normalize("NFKC", t)
'in solid'

Related

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.
The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.
Here's an example :
Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:
When you extract the pdf text using the library text extraction method, you get this:
كارتــــــشلاا
Notice how all chars are in reverse order except "لا".
And here's the final result after applying bidi algorithm:
االشــــــتراك
Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?

Most likely, the actual text on the PDF page isn't Unicode, but font CIDs (identifying the glyph used) and that the program converting the CIDs to Unicode doesn't take RTL into account.
An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003).
If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.
In this case, there's no way of knowing that the 'f' and 'i' characters should actually compose a ligature and be flipped around. I'm assuming that's the case for these Arabic characters.
It's unlikely that the tools you're using know anything about RTL or are going to be much help here. You'll need different tools, or to use an approach that can get you the CID's directly so you can recompose the Unicode in the correct order.

How to pad Chinese/Japanese characters so that they are aligned with normal characters, have the same width in .srt file (python)

I wanted to create 3 strings in separate rows that would be displayed in this way:
In terminal
That is, aligned horizontally. The result strings saved in .srt file look like that:
In web video player
I already tried replacing spaces with various whitespaces, without success.
The bit of code that I use for padding:
def pad(strings):#"strings" contain 3 strings seen as one column separated by "|" in terminal
lengths=[2*len(strings[0]),len(strings[1]),len(strings[2])]#japanese/chinese chars are multiplied by 2
max_len=max(lengths) #because in terminal they take up two spaces
for i in range(len(lengths)):
if lengths[i] < max_len:
diff_len = max_len-lengths[i]
append_spaces=(int)(diff_len/2)
#below I add spaces so that they are aligned, this works in terminal,
#not in video player with loaded srt
strings[i] = " "*append_spaces + strings[i] + " "*append_spaces + " "*(diff_len%2)
return strings
It's important for me to use .srt files, without any styling. That is the only way it'll be recognized properly in most web media players.
If there is no way around it, how could I achieve the result seen in terminal with any other subtitle format?

For anyone that stumbled upon the same problem, solution is to use monospaced font, where (at least in my case) every kanji and kana symbol has twice the width of normal character. The only one that worked for me is "unifont_jp-14.0.01.ttf" from here.
Having that font installed I'm able to use it in vlc wih .srt file, or use subtitle editor to create subtitles with that font embedded and in more advanced file format.

Python, Convert white spaces to invisible ASCII character

A quick question. I have no idea how to google for an answer.
I have a python program for taking input from users and generating string output.
I then use this string variables to populate text boxes on other software (Illustrator).
The string is made out of: number + '%' + text, e.g.
'50% Cool', '25% Alright', '25% Decent'.
These three elements are imputed into one Text Box (next to one another), and as it is with text boxes if one line does not fit the whole text, the text is moved down to another line as soon as it finds a white space ' '. Like So:
50% Cool 25% Alright 25%
Decent
I need to keep this feature in (Where text gets moved down to a lower line if it does not fit) but I need it to move the whole element and not split it.
Like So:
50% Cool 25% Alright
25% Decent
The only way I can think of to stop this from happening; is to use some sort of invisible ASCII code which connects each element together (while still retaining human visible white spaces).
Does anyone know of such ASCII connector that could be used?

So, understand first of all that what you're asking about is encoding specific. In ASCII/ANSI encoding (or Latin1), a non-breaking space can either be character 160 or character 255. (See this discussion for details.) Eg:
non_breaking_space = ord(160)
HOWEVER, that's for encoded ASCII binary strings. Assuming you're using Python 3 (which you should consider if you're not), your strings are all Unicode strings, not encoded binary strings.
And this also begs the question of how you plan on getting these strings into Illustrator. Does the user copy and paste them? Are they encoded into a file? That will affect how you want to transmit a non-breaking space.
Assuming you're using Python 3 and not worrying about encoding, this is what you want:
'Alright\u002025%'
\u0020 inserts a Unicode non-breaking space.

remap scrambled PDF characters to readable text

I do have a problem due to cups-PDF creating PDF documents where characters are mapped to strange symbols [on Ubuntu Linux 14.04 and 16.04}. I think its some kind of unicode even if Python is telling me its string type. type(object) python returns "string"
No difference if I grab the text out of the PDF via Mouse copy paste from evince / Firefox or via Python PDFminer module. So its true, the PDF has broken text information which is rendered correct on PDF document itself. I did not know that, but text, and text-graphic on PDF document seem to be no bound very tight together.
When I do copy text from such created PDF document by example the name "Raphael" turns into "✡✍✑✒✍☛✓" so each single character maps to "✡=R ✍=a ✑=p ✒=h ✍=a ☛=e ✓=l"
Another example is: "Devel" turns into "✭☛✮☛✓"
How can I write a function in Python which shifts this "wrong" information to the correct one? On the PDF Document everything is perfectly readable.
This has something todo with cups-PDF using postscript to create the PDF but not adding the correct font/character information to the document.
If the letter 'l' is always the Symbol '✓' which is this checkmark unicode character
How can I do a remap of the characters in this strange representation to correct representation in Python? So how can I shift or remap symbol '✓' to letter 'l'? Any Idea?
Why I need this?
I need to search for a text value in this documents.

The PDF appears to be using a specialised font to prevent copying. The text is scrambled, but so are the letters in the font. So if a once was mapped to Unicode codepoint U+0061, the PDF has replaced all those a's with U+270D instead, and the special font replaced the normal "WRITING HAND" glyph with the letter a.
In other words, it's using a substitution cypher.
You'll have to unscramble this like any other substitution cypher: you need to create a reverse mapping from encrypted codepoint to un-encrypted codepoint. You can use the PDF as a guide; as a human you can easily read the actual text, and you can also see how it relates to the copied Unicode codepoints.
For example, we know that U+270D maps to U+0061:
>>> hex(ord('✍'))
'0x270d'
>>> hex(ord('a'))
'0x61'
because when you copy an a from the PDF, you got the 270d codepoint instead. Simply build up a table for the rest of the alphabet. That may sound like a lot of manual work, but you already have the plaintext. Imagine not knowing what the text contains (e.g. you only had the symbols that copying the text produces); then you'd have to do a full cryptanalysis first (for a substitution cypher, assume a specific language, and count symbols; each language has a typical frequency distribution for its letters and such a distribution can often be matched in an encrypted body of text to map back to the original letters).
Theoretically, you should be able to extract the specialised font, then analyse that to produce a translation table. This would require some form of computer vision however; the computer won't easily know that the raster of pixels or series of vector lines form a specific letter. For roughly 70 codepoints (uppercase, lowercase, digits, some punctuation) it'll probably easier to just create the table by hand.
Once you have a table, Python can do the translation for you; I've taken your clues and created a partial table for just those letters:
mapping = {
0x270d: 'a',
0x261b: 'e',
0x2712: 'h',
0x2713: 'l',
0x2711: 'p',
0x272e: 'v',
0x272d: 'D',
0x2721: 'R',
}
print(encrypted.translate(mapping))
All you need to do is fill in the remaining mappings; the str.translate() method will then take care of the rest.
Demo using the above partial table on your sample encrypted text samples:
>>> print("✡✍✑✒✍☛✓".translate(mapping))
Raphael
>>> print("✭☛✮☛✓".translate(mapping))
Devel

Display width of unicode strings in Python [duplicate]

This question already has answers here:
Normalizing Unicode
(2 answers)
Closed 8 years ago.
How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?
Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.
>>> for title in d.keys():
>>> print("{:<20} | {}".format(title, d[title]))
zootehni- | zooteh.
zootekni- | zootek.
zoothèque | zooth.
zooveterinar- | zoovet.
zoovetinstitut- | zoovetinst.
母 | 母母
>>> s = 'è'
>>> len(s)
2
>>> [ord(c) for c in s]
[101, 768]
>>> unicodedata.name(s[1])
'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
1
As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.
Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".
Edit: Added info about normalization.
Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:
zwemwater | zwemw.
zwia̢z- | zw.
>>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
2

You have several options:
Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.
Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.
Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.
Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.
Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.
Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.
Create a HTML file with a table and look at it in a browser.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unusual font when extracting text from PDF - python

Related

RTL (Arabic) ligatures problem when extracting text from PDF

How to pad Chinese/Japanese characters so that they are aligned with normal characters, have the same width in .srt file (python)

Python, Convert white spaces to invisible ASCII character

remap scrambled PDF characters to readable text

Display width of unicode strings in Python [duplicate]

Categories

Resources