Don't understand the result I have with pytesseract

Don't understand the result I have with pytesseract - python

I'm trying to read the following image :
try:
import Image
except ImportError:
from PIL import Image
import pytesseract as tes
results = tes.image_to_string(Image.open('./test.png'),boxes=True)
print(results)
And here is the result I have :
_ 239 780 263 787 0
. 239 758 263 767 0
L 235 737 263 761 0
1 220 763 229 783 0
1 220 741 229 761 0
‘ 129 763 137 784 0
1 129 741 136 761 0
1 220 650 229 670 0
‘ 220 628 229 648 0
F 235 537 263 561 0
. 239 531 263 540 0
A 239 511 268 534 0
_ 199 554 223 561 0
I 260 401 268 421 0
r 235 424 263 448 0
. 239 418 263 427 0
_ 239 398 263 404 0
{ 220 424 229 444 0
I 220 401 229 421 0
“ 220 288 229 331 0
What does this mean ? How I can interpret this result ?
Thanks a lot!

As you set boxes=True in tes.image_to_string(), the output is in box file format which the first letter in the line is the character recognized and then the bounding box coordinates of an occurrence of that character in the image. If boxes=False, tesseract will only output the characters recognized.
The image you are trying to OCR is the 7-segment digits, you may need to have a trained (language) data for 7-segment digits in order to get a good result.

Related

How do I filter based in Indices in Python?

I am having an issue with manipulating indices once I have used the groupby command. My problem is similar to this code:
import pandas as pd
import numpy as np
np.random.seed(0)
df=pd.DataFrame(np.random.randint(0,10,size=(1000000,5)),columns=list('ABCDE'))
M=df.groupby(['A','B','D','E'])['C'].sum().unstack()
M
E 0 1 2 3 4 5 6 7 8 9
A B D
0 0 0 464 414 553 420 499 394 528 423 415 443
1 407 479 392 441 433 472 520 421 484 384
2 545 546 523 356 386 434 531 534 486 417
3 408 511 422 424 477 351 452 395 341 492
4 502 462 403 434 428 444 506 414 418 328
... ... ... ... ... ... ... ... ... ... ...
9 9 5 419 416 485 386 581 330 408 489 394 454
6 416 475 469 490 357 523 418 514 555 499
7 528 419 462 486 565 388 438 445 469 521
8 390 454 566 341 459 463 478 463 426 499
9 414 436 441 462 403 415 362 472 433 430
[1000 rows x 10 columns]
I am wondering how to filter down to only situations where B is greater than A, when they are both in the index here. If they weren't in the index then I would be doing something like M=M[M['A']<M['B']].

You can temporarily convert the index to_frame:
out = M.loc[M.index.to_frame().query('B>A').index]
Or use Index.get_level_values:
A = M.index.get_level_values('A')
B = M.index.get_level_values('B')
out = M.loc[B>A]
Output:
E 0 1 2 3 4 5 6 7 8 9
A B D
0 1 0 489 452 421 455 442 377 440 476 477 451
1 468 448 473 443 557 492 471 460 476 469
2 576 472 465 355 503 448 491 437 546 425
3 404 438 474 516 410 446 411 459 467 450
4 500 418 441 445 420 605 467 580 479 377
... ... ... ... ... ... ... ... ... ... ...
8 9 5 390 466 436 493 446 508 375 390 485 393
6 457 478 476 417 458 460 361 397 432 403
7 516 587 379 406 396 449 430 433 357 432
8 390 460 489 427 346 490 498 454 395 345
9 474 510 466 336 484 577 443 428 459 406
[450 rows x 10 columns]

Python list alignment

I have an assignment I am trying to complete.
I have 100 random int. From those 100 I have to create a 10x10 table. DONE..
within that table I have to align my values to the right side of the each column. That is the part I'm missing.
Below is the code for that:
print(num, end=(" " if counter < 10 else "\n"))

Late answer, but you can also use:
import random
rl = random.sample(range(100, 999), 100)
max_n = 10
for n, x in enumerate(rl, 1):
print(x, end=("\n" if n % max_n == 0 else " "))
440 688 758 837 279 736 510 706 392 631
588 511 610 792 535 526 335 842 247 124
552 329 245 689 832 407 919 302 592 385
542 890 406 898 189 116 495 764 664 471
851 728 292 314 839 503 691 355 350 213
661 489 800 649 521 958 123 205 983 219
321 633 120 388 632 187 158 576 294 835
673 470 699 908 456 270 220 878 376 884
816 525 147 104 602 637 249 763 494 127
981 524 262 915 267 873 886 397 922 932

You can just format the number before printing it.
print(f"{num:>5}", end=(" " if counter < 10 else "\n"))
Alternatively, if you wanna cast the numbers to string you can use the rjust method of string.

There is a simple way to do it. I hope I have made it clear.
import random
# Generate 100 random numbers in range 1 to 1000.
random_numbers = list(map(lambda x:random.randrange(1,1000), range(100)))
# Create an empty string to store the pretty string.
pretty_txt = ''
# Loop through random_numbers and use an enumerate to get iteration count.
for index, i in enumerate(random_numbers):
# Add the current number into the string with a space.
pretty_txt += str(i) + ' '
# Add a newline every ten numbers.
# If you don't add index != 0 it will put a space in first iteration
if index % 9 == 0 and index != 0:
pretty_txt += '\n'
print(pretty_txt)
The output is:
796 477 578 458 284 287 43 535 514 504
91 411 288 980 85 233 394 313 263
135 770 793 394 362 433 370 725 472
981 932 398 275 626 631 817 82 551
775 211 755 202 81 750 695 402 809
477 925 347 31 313 514 363 115 144
341 684 662 522 236 219 142 114 621
940 241 110 851 997 699 685 434 813
983 710 124 443 569 613 456 232 80
927 445 179 49 871 821 428 750 792
527 799 878 731 221 780 16 779 333

Slicing a 2D array using indices 1D array

I have a 2D array of (10,24) and a 1D array of (10,) shape.
I want to slice a 2D array using 1D array such that my resultant array will be (10,24) but the values are sliced from indices in 1D array onwards.
import numpy as np
x1 = np.random.randint(1,20,10)
print(x1)
[ 8, 13, 13, 13, 14, 3, 14, 14, 11, 16]
y1 = np.random.randint(low = 1, high = 999, size = 240).reshape(10,24)
print(y1)
[[152 128 251 282 334 776 650 247 990 803 700 323 250 262 552 220 744 50
684 695 600 293 138 5]
[830 917 148 612 801 746 623 794 435 469 610 598 29 452 188 688 364 56
246 991 554 33 716 712]
[603 16 838 65 312 764 676 392 187 476 878 229 555 558 58 194 565 764
48 579 447 202 81 300]
[315 562 276 993 859 145 82 484 134 59 397 566 573 263 340 465 728 406
767 408 294 115 394 941]
[422 891 475 174 720 672 526 52 938 347 114 613 186 151 925 482 315 373
856 155 5 60 65 746]
[978 621 543 785 663 32 817 497 615 897 713 459 396 154 220 221 171 589
571 587 248 668 413 553]
[227 188 4 874 975 586 93 179 356 740 645 723 558 814 64 922 748 457
249 688 799 239 708 516]
[230 556 563 55 390 666 304 661 218 744 502 720 418 581 839 772 818 278
190 997 553 71 897 909]
[631 928 606 111 927 912 81 38 529 956 759 6 725 325 944 174 62 804
82 358 305 291 454 34]
[193 661 452 54 816 251 750 183 60 563 787 283 599 182 823 546 629 527
667 614 615 3 790 124]]
I want my resultant array to be be:
[[990 803 700 323 250 262 552 220 744 50 684 695 600 293 138 5 0 0 0 0 0 0 0 0]
[452 188 688 364 56 246 991 554 33 716 712 0 0 0 0 0 0 0 0 0 0 0 0 0]
[558 58 194 565 764 48 579 447 202 81 300 0 0 0 0 0 0 0 0 0 0 0 0 0]
[263 340 465 728 406 767 408 294 115 394 941 0 0 0 0 0 0 0 0 0 0 0 0 0]
[925 482 315 373 856 155 5 60 65 746 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[785 663 32 817 497 615 897 713 459 396 154 220 221 171 589 571 587 248 668 413 553 0 0 0]
[64 922 748 457 249 688 799 239 708 516 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
[839 772 818 278 190 997 553 71 897 909 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[6 725 325 944 174 62 804 82 358 305 291 454 34 0 0 0 0 0 0 0 0 0 0 0 ]
[546 629 527 667 614 615 3 790 124 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Here's a vectorized one with masking and also leveraging broadcasting -
def select_gt_indices(a, idx):
r = np.arange(a.shape[1])
select_mask = idx[:,None] <= r
put_mask = (a.shape[1]-idx-1)[:,None] >= r
# or put_mask = np.sort(select_mask,axis=1)[:,::-1]
out = np.zeros_like(a)
out[put_mask] = a[select_mask]
return out
Sample run -
In [92]: np.random.seed(0)
...: a = np.random.randint(0,999,(4,5))
...: idx = np.array([2,4,3,0])
In [93]: a
Out[93]:
array([[684, 559, 629, 192, 835],
[763, 707, 359, 9, 723],
[277, 754, 804, 599, 70],
[472, 600, 396, 314, 705]])
In [94]: idx
Out[94]: array([2, 4, 3, 0])
In [95]: select_gt_indices(a, idx)
Out[95]:
array([[629, 192, 835, 0, 0],
[723, 0, 0, 0, 0],
[599, 70, 0, 0, 0],
[472, 600, 396, 314, 705]])

I don't think you can slice the array as you are padding with 0's. You can create an empty zeros array and populate it, for example
y1_result = np.zeros(y1.shape)
for row, x_i in enumerate(x):
for j, element in enumerate(y1[row, x_i:]):
y1_result[row, j] = element

Efficient for loop in python

I want to write a for loop in python which iterates for example like 111, 112, 113, 114, 121, 122, 123, 124, 131,.. up to 444. Is there an efficient way to do so?
I tried to convert between decimal and base 4 system but is there a better way to do so?

>>> from itertools import chain
>>> for k in chain.from_iterable(range(i+1, i+5) for i in range(110, 450, 10)):
... print(k)
...
111
112
113
114
121
122
123
124
131
132
133
134
141
142
.
.
.
423
424
431
432
433
434
441
442
443
444

SO like this:
[ i for i in range(111, 445) if '0' < str(i)[-1] < '5']

you can use:
[ i for i in range(111, 445) if 0< i%(i-i%10) <5]

You can convert a range of integers to base 4 using base_repr from numpy:
import numpy
for i in range(64):
print(int(numpy.base_repr(i, base=4)) + 111)
Output:
111
112
113
114
121
122
123
124
131
132
133
134
141
142
143
144
211
212
213
214
221
222
223
224
231
232
233
234
241
242
243
244
311
312
313
314
321
322
323
324
331
332
333
334
341
342
343
344
411
412
413
414
421
422
423
424
431
432
433
434
441
442
443
444

How to split column data and create new DataFrame with multiple columns

I'd like to split the data in the following DataFrame
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32), 'r':12*range(8), 'cnt': np.random.randint(300, 400, 96)}); df
cnt per r
0 355 10 0
1 359 10 1
2 347 10 2
3 390 10 3
4 304 10 4
5 306 10 5
.. ... ... ..
87 357 30 7
88 371 30 0
89 396 30 1
90 357 30 2
91 353 30 3
92 306 30 4
93 301 30 5
94 329 30 6
95 312 30 7
[96 rows x 3 columns]
such that for each r value a new column cnt_r{r} exist in a DataFrame but also keeping the corresponding per column.
The following piece of code almost does what I want except that it looses the per column:
pd.DataFrame({'cnt_r{}'.format(i): df[df.r==i].reset_index()['cnt'] for i in range(8)})
cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 355 359 347 390 304 306 366 310
1 394 331 384 312 380 350 318 396
2 340 336 360 389 352 370 353 319
...
9 341 300 386 334 386 314 358 326
10 357 386 311 382 356 339 375 357
11 371 396 357 353 306 301 329 312
I need a way to build the follow DataFrame:
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 355 359 347 390 304 306 366 310
1 10 394 331 384 312 380 350 318 396
2 10 340 336 360 389 352 370 353 319
...
7 20 384 385 376 323 345 339 339 347
9 30 341 300 386 334 386 314 358 326
10 30 357 386 311 382 356 339 375 357
11 30 371 396 357 353 306 301 329 312
Note that by construction my dataset has same number of values per per for each r. Obviously my dataset is much larger than the example one (about 800 million records).
Many thanks for your time.

If possible use reshape for 2d array and then insert new colum per:
np.random.seed(1256)
df = pd.DataFrame(data={'per': np.repeat([10,20,30], 32),
'r': 12*list(range(8)),
'cnt': np.random.randint(300, 400, 96)})
df1 = pd.DataFrame(df['cnt'].values.reshape(-1, 8)).add_prefix('cnt_r')
df1.insert(0, 'per', np.repeat([10,20,30], 4))
print (df1)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304
Or use cumcount for create new groups and reshape by set_index with unstack:
df = (df.set_index([df.groupby('r').cumcount(), 'per','r'])['cnt']
.unstack()
.add_prefix('cnt_r')
.reset_index(level=1)
.rename_axis(None, axis=1))
print (df)
per cnt_r0 cnt_r1 cnt_r2 cnt_r3 cnt_r4 cnt_r5 cnt_r6 cnt_r7
0 10 365 358 305 311 393 343 340 313
1 10 393 319 358 351 322 387 316 359
2 10 360 301 337 333 322 337 393 396
3 10 320 344 325 310 338 381 314 339
4 20 323 305 342 340 343 319 332 371
5 20 398 308 350 320 340 319 305 369
6 20 344 340 345 332 373 334 304 331
7 20 323 349 301 334 344 374 300 336
8 30 357 375 396 354 309 391 304 334
9 30 311 395 372 359 370 342 351 330
10 30 378 302 306 341 308 392 387 332
11 30 350 373 316 376 338 351 398 304

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Don't understand the result I have with pytesseract - python

Related

How do I filter based in Indices in Python?

Python list alignment

Slicing a 2D array using indices 1D array

Efficient for loop in python

How to split column data and create new DataFrame with multiple columns

Categories

Resources