RTL (Arabic) ligatures problem when extracting text from PDF

RTL (Arabic) ligatures problem when extracting text from PDF - python

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.
The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.
Here's an example :
Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:
When you extract the pdf text using the library text extraction method, you get this:
كارتــــــشلاا
Notice how all chars are in reverse order except "لا".
And here's the final result after applying bidi algorithm:
االشــــــتراك
Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?

Most likely, the actual text on the PDF page isn't Unicode, but font CIDs (identifying the glyph used) and that the program converting the CIDs to Unicode doesn't take RTL into account.
An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003).
If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.
In this case, there's no way of knowing that the 'f' and 'i' characters should actually compose a ligature and be flipped around. I'm assuming that's the case for these Arabic characters.
It's unlikely that the tools you're using know anything about RTL or are going to be much help here. You'll need different tools, or to use an approach that can get you the CID's directly so you can recompose the Unicode in the correct order.

Related

Unusual font when extracting text from PDF

I have been trying to extract text from PDF files and most of the files seem to work fine. However, one particular document has text in this unusual font: ｉｎ ｓｏｌｉｄ
I have tried extraction using PHP and then Python and both were unable to fix this font. I tried copying text and tried to see if I can get it fixed in text editing tools but couldn't do much.Please note that the original PDF document looks fine but when text is copied and pasted in a text editing tool, the gap between characters starts to appear. I am completely clueless on what to do. Please suggest a solution to fix this in PHP/Python (preferably PHP).

Pre-unicode, some character encodings allowed you to compose Japanese/Korean/Chinese characters either as two half width characters or one full width character. In that case, latin characters could be full width to be mixed evenly with the other characters. You have Full Width Latin characters on your hands and that's why the space out oddly.
You can normalize the string with NFKD compatibility decomposition to get to regular latin. This will also change any half/full width Japanese/Korean/Chinese characters by, um ... I'm not sure, but I think into characters built from multi code point characters.
>>> import unicodedata
>>> t="ｉｎ ｓｏｌｉｄ"
>>> unicodedata.normalize("NFKC", t)
'in solid'

Python, Convert white spaces to invisible ASCII character

A quick question. I have no idea how to google for an answer.
I have a python program for taking input from users and generating string output.
I then use this string variables to populate text boxes on other software (Illustrator).
The string is made out of: number + '%' + text, e.g.
'50% Cool', '25% Alright', '25% Decent'.
These three elements are imputed into one Text Box (next to one another), and as it is with text boxes if one line does not fit the whole text, the text is moved down to another line as soon as it finds a white space ' '. Like So:
50% Cool 25% Alright 25%
Decent
I need to keep this feature in (Where text gets moved down to a lower line if it does not fit) but I need it to move the whole element and not split it.
Like So:
50% Cool 25% Alright
25% Decent
The only way I can think of to stop this from happening; is to use some sort of invisible ASCII code which connects each element together (while still retaining human visible white spaces).
Does anyone know of such ASCII connector that could be used?

So, understand first of all that what you're asking about is encoding specific. In ASCII/ANSI encoding (or Latin1), a non-breaking space can either be character 160 or character 255. (See this discussion for details.) Eg:
non_breaking_space = ord(160)
HOWEVER, that's for encoded ASCII binary strings. Assuming you're using Python 3 (which you should consider if you're not), your strings are all Unicode strings, not encoded binary strings.
And this also begs the question of how you plan on getting these strings into Illustrator. Does the user copy and paste them? Are they encoded into a file? That will affect how you want to transmit a non-breaking space.
Assuming you're using Python 3 and not worrying about encoding, this is what you want:
'Alright\u002025%'
\u0020 inserts a Unicode non-breaking space.

python write unicode to file and retrieve it as a symbol

I am working on a Python program which contains an Arabic-English database and allows to update this database and also to study the vocabluary. I am almost done with implementing all the functions I need but the most important part is missing: The encoding of the Arabic strings. To append new vocabulary to the data base txt file, a dictionary is created and then its content is appended to the file. To study vocabulary, the content of the txt file is converted into a dictionary again, a random word is printed to the console and the user is asked for its translation. Now the idea is that the user has the possibility to write the Englisch word as well as the Arabic word in latin letters and the program will internally convert the pseudo-arabic string to Arabic letters. For example, if the user writes 'b' when asked for the Arabic word, I want to append 'ب‎'.
1. There are about 80 signs I have to consider in the implementation. Is there a way of creating some mapping between the latin-letter input string and the respective Arabic signs? For me, the most intuitive idea would be to write one if statement after the other but that's probably super slow.
2. I have trouble printing the Arabic string to the console. This input
print('bla{}!'.format(chr(0xfe9e)))
print('bla{}!'.format(chr(int('0x'+'0627',16))))
will result in printing the Arabic sign whereas this won't:
print('{}'.format(chr(0xfe9e)))
What can I do in order to avoid this problem, since I want a sequence which consists of unicode symbols only?

Did you try encode/decode function? for example you can write
u = ("سلام".encode('utf-8'))
print(u.decode('utf-8'))

This is not the final answer but can give you a start.
First of all check your encoding:
import sys
sys.getdefaultencoding()
Edit:
sys.setdefaultencoding('UTF8') was removed from sys module. But still, you can comment what sys.getdefaultencoding() returns in your box.
However, for Arabic characters, you can range them all at once:
According to this website, Arabic characters are from 0x620 to 0x64B and Basic Latin characters are from 0x0061 to 0x007B (for lower cases).
So:
arabic_chr = [chr(k) for k in range(0x620, 0x064B, 1)]
latin_chr = [chr(k) for k in range(0x0061, 0x007B, 1)]
Now, all what you have to do, is finding a relation between the two lists, orr maybe extend more the ranges (I speak arabic and i know that there is many forms of one char and a character can change from a word to another).

remap scrambled PDF characters to readable text

I do have a problem due to cups-PDF creating PDF documents where characters are mapped to strange symbols [on Ubuntu Linux 14.04 and 16.04}. I think its some kind of unicode even if Python is telling me its string type. type(object) python returns "string"
No difference if I grab the text out of the PDF via Mouse copy paste from evince / Firefox or via Python PDFminer module. So its true, the PDF has broken text information which is rendered correct on PDF document itself. I did not know that, but text, and text-graphic on PDF document seem to be no bound very tight together.
When I do copy text from such created PDF document by example the name "Raphael" turns into "✡✍✑✒✍☛✓" so each single character maps to "✡=R ✍=a ✑=p ✒=h ✍=a ☛=e ✓=l"
Another example is: "Devel" turns into "✭☛✮☛✓"
How can I write a function in Python which shifts this "wrong" information to the correct one? On the PDF Document everything is perfectly readable.
This has something todo with cups-PDF using postscript to create the PDF but not adding the correct font/character information to the document.
If the letter 'l' is always the Symbol '✓' which is this checkmark unicode character
How can I do a remap of the characters in this strange representation to correct representation in Python? So how can I shift or remap symbol '✓' to letter 'l'? Any Idea?
Why I need this?
I need to search for a text value in this documents.

The PDF appears to be using a specialised font to prevent copying. The text is scrambled, but so are the letters in the font. So if a once was mapped to Unicode codepoint U+0061, the PDF has replaced all those a's with U+270D instead, and the special font replaced the normal "WRITING HAND" glyph with the letter a.
In other words, it's using a substitution cypher.
You'll have to unscramble this like any other substitution cypher: you need to create a reverse mapping from encrypted codepoint to un-encrypted codepoint. You can use the PDF as a guide; as a human you can easily read the actual text, and you can also see how it relates to the copied Unicode codepoints.
For example, we know that U+270D maps to U+0061:
>>> hex(ord('✍'))
'0x270d'
>>> hex(ord('a'))
'0x61'
because when you copy an a from the PDF, you got the 270d codepoint instead. Simply build up a table for the rest of the alphabet. That may sound like a lot of manual work, but you already have the plaintext. Imagine not knowing what the text contains (e.g. you only had the symbols that copying the text produces); then you'd have to do a full cryptanalysis first (for a substitution cypher, assume a specific language, and count symbols; each language has a typical frequency distribution for its letters and such a distribution can often be matched in an encrypted body of text to map back to the original letters).
Theoretically, you should be able to extract the specialised font, then analyse that to produce a translation table. This would require some form of computer vision however; the computer won't easily know that the raster of pixels or series of vector lines form a specific letter. For roughly 70 codepoints (uppercase, lowercase, digits, some punctuation) it'll probably easier to just create the table by hand.
Once you have a table, Python can do the translation for you; I've taken your clues and created a partial table for just those letters:
mapping = {
0x270d: 'a',
0x261b: 'e',
0x2712: 'h',
0x2713: 'l',
0x2711: 'p',
0x272e: 'v',
0x272d: 'D',
0x2721: 'R',
}
print(encrypted.translate(mapping))
All you need to do is fill in the remaining mappings; the str.translate() method will then take care of the rest.
Demo using the above partial table on your sample encrypted text samples:
>>> print("✡✍✑✒✍☛✓".translate(mapping))
Raphael
>>> print("✭☛✮☛✓".translate(mapping))
Devel

Python3 src encodings of Emojis

I'd like to print emojis from python(3) src
I'm working on a project that analyzes Facebook Message histories and in the raw htm data file downloaded I find a lot of emojis are displayed as boxes with question marks, as happens when the value can't be displayed. If I copy paste these symbols into terminal as strings, I get values such as \U000fe328. This is also the output I'm getting when I run the htm files through BeautifulSoup and output the data.
I Googled this string (and others), and consistently one of the only sites that comes up with them is iemoji.com, in the case of the above string this page, that lists the string as a Python Src. I want to be able to print out these strings as their corresponding emojis (after all, they were originallly emojis when being messaged), and after looking around I found a mapping of src encodings at this page, that mapped the above like strings to emoji string names. I then found this emoji string names to Unicode list, that for the most part seems to map the emoji names to Unicode. If I try printing out these values, I get good output. Like following
>>> print(u'\U0001F624')
😤
Is there a way to map these "Python src" encodings to their unicode values? Chaining both libraries would work if not for the fact that the original src mapping is missing around 50% of the unicode values found in the unicode library. And if I do end up having to do that, is there a good way to find the Python Src value of a given emoji? From my testing emoji as strings equal their Unicode, such as '😤' == u'\U0001F624', but I can't seem to get any sort of relations to \U000fe328

This has nothing to do with Python. An escape like \U000fe328 just contains the hexadecimal representation of a code point, so this one is U+0FE328 (which is a private use character).
These days a lot of emoji are assigned to code points, eg. 😤 is U+01F624 — FACE WITH LOOK OF TRIUMPH.
Before these were assigned, various programs used various code points in the private use ranges to represent emoji. Facebook apparently used the private use character U+0FE328. The mapping from these code points to the standard code points is arbitrary. Some of them may not have a standard equivalent at all.
So what you have to look for is a table which tells you which of these old assignments correspond to which standard code point.
There's php-emoji on GitHub which appears to contain these mappings. But note that this is PHP code, and the characters are represented as UTF-8 (eg. the character above would be "\xf3\xbe\x8c\xa8").

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

RTL (Arabic) ligatures problem when extracting text from PDF - python

Related

Unusual font when extracting text from PDF

Python, Convert white spaces to invisible ASCII character

python write unicode to file and retrieve it as a symbol

remap scrambled PDF characters to readable text

Python3 src encodings of Emojis

Categories

Resources