Convert PDF to text: Adobe Reader vs. Python libraries

Convert PDF to text: Adobe Reader vs. Python libraries - python

I have a PDF which I try to convert to text for further processing.
The structure of the PDF is stable but tricky, as it also contains elements and graph that sometimes also serve as a background for the text that is written in the particular position. Therefore, I'd like to extract as much text as possible.
I first tried the Adobe Reader function to save the PDF as text which gives good results but doesn't allow to have this process fully automated. At least I don't know a way to interact with the Adobe Reader through the command line or.
Therefore, I tried some python libraries designed for this purpose but it seems that they have a different way to convert the pdf to text. I tried PdfMiner, PyPDF2 and pdftotext. None of the libraries give me the same result as the Adobe Reader.
The PDF looks like the following (a little cropped due to sensitive data which isn't relevant):
Adobe extracts the following text:
OCT 15° (4.3 mm) ART (25) Q: 34 [HR]
ILMILM200μm200μm 04590135180225270315360
TMPTSNSNASNITITMP
1000 800 600 400 200 0
Position [°]
CC
7.7 (APS)
G227(12%) T206(54%) TS226(20%) TI304(38%) N203(5%) NS213(6%)
NI276(12%) Segmentationunconfirmed! Classification MRW Within
Normal Limits
OCT ART (100) Q: 31 [HS]
ILMILMRNFLRNFL200μm200μm 111 04590135180225270315360
300 240 180 120 60 0
TMP TS NS NAS NI TI TMP
Position [°]
CC
7.7 (APS)
Classification RNFLT Outside Normal Limits
G78<1% T62(15%) TS103(5%) TI134(10%) N65(7%) NS77(3%) NI73(3%)
Segmentationunconfirmed! RNFL Thickness (3.5 mm) [μm]
WithinNormalLimits(>5%) Borderline(<5%)OutsideNormalLimits(<1%)
While, for example, PDFminer extracts:
Average Thickness [�m]
Vol [mm�]
8.26
200 �m 200 �m
OCT 20.0� (5.6 mm) ART (21) Q: 25 [HS]
267
1.42
321
0.50
335
0.53
299
1.59
Center:
Central Min:
Central Max:
222 �m
221 �m
314 �m
Circle Diameters: 1, 3, 6 mm ETDRS
292
1.55
331
0.52
272
0.21
326
0.51
271
1.44
ILMILM
BMBM
200 �m 200 �m
Which is a lot different. Is there any reason for that and do you know any python library that has the same ability of the Adobe Reader to convert PDF to text?

Not necessarily an explanation as to why Adobe Reader extracts the text from a pdf differently as opposed to some python libraries but I have achieved a really good solution with tika.
This is was tika extracted:
OCT 15� (4.2 mm) ART (26) Q: 31 [HR]
NITSTMP NAS TMPTINSM in
im u
m R
im W
id th
[ �
m ]
1000 800 600 400 200
0
Position [�]
36031527022518013590450
ILMILM
RNFLRNFL
200 �m200 �m
OCT ART (100) Q: 27 [HS]
NITSTMP NAS TMPTINS
R N
F L T
h ickn
e ss (3
.5 m
m ) [�
m ]
300 240 180 120 60 0
Position [�]
36031527022518013590450
40
G 240
(10%)
T 239
(70%)
TS 213 (9%)
TI 285
(22%)
N 230 (5%)
NS 209 (3%)
NI 283 (9%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification MRW
Borderline
G 78
<1%
T 58
(8%)
TS 91
(2%)
TI 124 (6%)
N 64
(8%)
NS 110
(43%)
NI 71
(4%)
CC 7.7 (APS)
Segmentation unconfirmed!
Classification RNFLT
Outside Normal Limits
Within Normal Limits (>5%)
Borderline (<5%) Outside Normal Limits (<1%)
Reference database: European Descent (2014)

Related

Tesseract OCR with Python

i am looking for a way too get the text from the image below. I tried to use tesseract but the outputs weren't good at all (see code block below). Do i have to edit the picture to get a better output? What params should i use for tesseract? Or is there even a better way? Is Tesseract confused by the small icons?
enter image description here
TEAM 1 41 / 28 / 63
& ¢ 18 #) BanemBanem
v7 é 18 # Feldwebel Nick
* 3 18 C) Eldijaner
6 & 15 ) MarkusLanz187
4 a a
be 18 = benjamin2436
TEAM 2 27 / 41 / 47
w 8 17 e) grummeldom
* 5 15 © Edelmann
é § 18 cB) BanemBanem
6 é 14 &) DefreezeLP
# & 15 # Berboinsens
72,105
BE Dwr w
WE MIS AE #
ZV we
BoQwD Bb
tore &
64,599
WEL See #
Soe Pew we
PRESS
LINEAR AD
25
418
4/2
417
413
413
/9
419
/9
43
‘7
168
302
209
44
161
198
138
274
42
227
14,298
22,143
12,554
9,925
13,185
13,462
12,096
16,722
8,588
13,731
BANS + OBJECTIVES
"yy y
usd dl ia
4,
V3 Gs,
8 2 1 5 0
BANS + OBJECTIVES
vf ie af
o %
I tried these commands for tesseract
tesseract test2.png out3 digits
tesseract statsedit.png out.txt -l eng

In order for tesseract to make the OCR, you have to give it a well processed image (and the tune tesseract based on what you're trying to 'OCR').
Usually, opencv-python is the used library for making this pre-processing, by firstly converting it to a grayscale image, then applying a little blur, and finally thresholding it.
I'm pretty sure you can find on YouTube some kind of tutorials that show u step-by-step how to correctly pre-process an image, based on your use-case.

PyTesseract not seeing some single-digit numbers in table

I have this image of a table
I'm trying to parse it using PyTesseract. I've gotten pretty darn close using this code:
from PIL import Image, ImageOps
import pytesseract
og_image = Image.open('og_image.png')
grayscale = ImageOps.grayscale(og_image)
inverted = ImageOps.invert(grayscale.convert('RGB'))
print(pytesseract.image_to_string(inverted))
This seems to be very accurate, except the single-digit numbers in the second-to-last column are blank. Do I need to do something different to pick up on those numbers?

Tesseract has several modes of page segmentation, and choosing the right one is necessary to help it getting best result. See documentation.
Also in this case, you can restrict tesseract to a certain character set.
Another thing, tesseract is sensitive to the fonts and image size. A simple resizing can change the results greatly. Here I change image size horizontally by factor 2 and vertically to get best result ;)
Combining all the above, you will get:
custom_config = r'--psm 6 -c tessedit_char_whitelist=0123456789.'
print(pytesseract.image_to_string(inverted.resize((1506, 412), Image.ANTIALIAS), config=custom_config))
1525 .199 303 82 161 162 7 .241
1464 .290 424 70 139 198 25 .352
1456 .292 425 116 224 224 0 .345
1433 .240 346 81 130 187 15 .275
1390 .273 373 108 217 216 3 .345
1386 .276 383 54 181 154 18 .315
1225 .208 255 68 148 129 1 .242
1218 .238 230 46 128 127 18 .273
1117 .240 268 43 113 1193 1 .308

Finding the minimum distance for each row using pandas

I am trying to match DNA spiral of different bacterias with their ancestors and I have around 1 million observations. I want to identify the closest ancestor for each bacteria, i.e. I want to compare them with same or older generation ( equal or smaller generation numbers) so my data frame looks like this (for simplicity let's assume DNA vector consist of one number):
bacteria_id generation DNA_vector
213 230 23
254 230 18
256 229 39
289 229 16
310 228 24
324 228 45
I tried to create a matrix and choose the smallest value from that matrix for each bacteria but as it will consist of lot of rows and columns, I get memory error before matrix is created.
Let's assume that it is not bacteria but car and I compare each car with its own generation (e.g. cars launched in 2010) and with the older ones. And also let's change DNA_vector to number of features. And I will assume it is more similar to other car if the difference between number of features is smaller.
So I want to create two additional columns. First one will tell the minimum difference (e.g. for the first one it will be 1, and the most similar car will be model 310)
Expected output is:
bacteria_id generation DNA_vector most_similar_bacteria distance
213 230 23 310 1 (i.e. 24 -23)
254 230 18 289 2
256 229 39 324 6
289 229 16 228 8
310 228 24 324 19
324 228 45 NA NA
Do you have any recommendations?

If you're running into memory errors because of a large dataset, you could try using dask. It is a 'parallel' computing library extremely similar to pandas that allows you to process larger datasets by using your hard drive instead of RAM.
https://dask.pydata.org/en/latest/
May not be something exactly as what you're looking for, but I have had good luck using it with large datasets as you describe.

Graphically displaying BLAST alignments from local source

I have an issue that I am trying to work through. I have a large dataset of about 25,000 genes that seem to the product of domain shuffling or gene fusions. I would like to view these alignments in pdf format based on BLAST outfmt 6 output.
I have BLAST output files for each of these genes with 1 sequence (the recombinogenic gene) and a varying number of subject genes with the following columns:
qseqid sseqid evalue qstart qend qlen sstart send slen length
I was hoping to parse the files through some code to produce images like the attached file, using the following example blast output file:
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-30 5 100 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-10 200 230 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_3___Aver00900 1e-20 5 100 300 10 105 125 100
Cluster_1___Hsap10003 Cluster_3___Atha00809 1e-20 5 110 300 5 115 120 105
Cluster_1___Hsap10003 Cluster_4___Ecol00002 1e-10 70 170 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_4___Ecol00003 1e-30 75 175 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_4___Sfle00009 1e-10 80 180 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Spom00010 1e-30 160 260 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_5___Scer01566 1e-10 170 270 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Afla00888 1e-30 175 275 300 10 105 240 95
I am looking for the query sequence to be a thick coloured bar, and the alignment section of each subject to be thick colourful bars with thin black lines showing the rest of the gene length (one subject per line showing all alignment sections against the query).
Does anyone know any software or know of any github code that may do something like this?
Thanks so much!

Python barbs wrong direction

There is probably a really simple answer to this and I'm only asking as a last resort as I usually get my answers by searching but I can't figure this out or find an answer. Basically I'm plotting some wind barbs in Python but they are pointing in the wrong direction and I don't know why.
Data is imported from a file and put into lists, I found on another stackoverflow post how to set the U, V for barbs using np.sin and np.cos, which results in the correct wind speed but the direction is wrong. I'm basically plotting a very simple tephigram or Skew-T.
# Program to read in radiosonde data from a file named "raob.dat"
# Import numpy since we are going to use numpy arrays and the loadtxt
# function.
import numpy as np
import matplotlib.pyplot as plt
# Open the file for reading and store the file handle as "f"
# The filename is 'raob.dat'
f=open('data.dat')
# Read the data from the file handle f. np.loadtxt() is useful for reading
# simply-formatted text files.
datain=np.loadtxt(f)
# Close the file.
f.close();
# We can copy the different columns into
# pressure, temperature and dewpoint temperature
# Note that the colon means consider all elements in that dimension.
# and remember indices start from zero
p=datain[:,0]
temp=datain[:,1]
temp_dew=datain[:,2]
wind_dir=datain[:,3]
wind_spd=datain[:,4]
print 'Pressure/hPa: ', p
print 'Temperature/C: ', temp
print 'Dewpoint temperature: ', temp_dew
print 'Wind Direction/Deg: ', wind_dir
print 'Wind Speed/kts: ', wind_spd
# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
#change units
#p=p/10
#temp=temp/10
#temp_dew=temp_dew/10
#plot graphs
fig1=plt.figure()
x1=temp
x2=temp_dew
y1=p
y2=p
x=np.linspace(50,50,len(y1))
#print x
plt.plot(x1,y1,'r',label='Temp')
plt.plot(x2,y2,'g',label='Dew Point Temp')
plt.legend(loc=3,fontsize='x-small')
plt.gca().invert_yaxis()
#fig2=plt.figure()
plt.barbs(x,y1,u,v)
plt.yticks(y1)
plt.grid(axis='y')
plt.show()
The barbs should all mostly be in the same direction as you can see in the direction in degrees from the data.
Any help is appreciated. Thank you.
Here is the data that is used:
996 25.2 24.9 290 12
963.2 24.5 22.6 315 42
930.4 23.8 20.1 325 43
929 23.8 20 325 43
925 23.4 19.6 325 43
900 22 17 325 43
898.6 21.9 17 325 43
867.6 20.1 16.5 320 41
850 19 16.2 320 44
807.9 16.8 14 320 43
779.4 15.2 12.4 320 44
752 13.7 10.9 325 43
725.5 12.2 9.3 320 44
700 10.6 7.8 325 45
649.7 7 4.9 315 44
603.2 3.4 1.9 325 49
563 0 -0.8 325 50
559.6 -0.2 -1 325 50
500 -3.5 -4.9 335 52
499.3 -3.5 -5 330 54
491 -4.1 -5.5 332 52
480.3 -5 -6.4 335 50
427.2 -9.7 -11 330 45
413 -11.1 -12.3 335 43
400 -12.7 -14.4 340 42
363.9 -16.9 -19.2 350 37
300 -26.3 -30.2 325 40
250 -36.7 -41.2 330 35
200 -49.9 0 335 0
150 -66.6 0 0 10
100 -83.5 0 0 30
Liam

# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
Instead try:
u=wind_spd*np.sin((np.pi/180)*wind_dir)
v=wind_spd*np.cos((np.pi/180)*wind_dir)
(http://tornado.sfsu.edu/geosciences/classes/m430/Wind/WindDirection.html)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert PDF to text: Adobe Reader vs. Python libraries - python

Related

Tesseract OCR with Python

PyTesseract not seeing some single-digit numbers in table

Finding the minimum distance for each row using pandas

Graphically displaying BLAST alignments from local source

Python barbs wrong direction

Categories

Resources