print() clears console with pytesseract - python

I followed a tutorial on a webpage to code a program to recognize text in images, the code it's very straightforward and it's the following.
import cv2, pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
img = cv2.imread('test_image.jpeg')
text = pytesseract.image_to_string(img)
print(text)
print('done')
where "test_image.jpeg" is an image (with some text) placed in the same folder where the program has been saved.
I already installed cv2 and pytesseract and the installation was successful as they are in the list of modules installed.
But when I run the program, it clears the console (working on Spyder, Python 3.8) and it just prints 'done'.
I've made some checks and the problem is the line "print(text)", when it runs it, it clears the console.
I tried to use another output function, sys.stdout.write(text), and it does the same thing.
I tried to uninstall numpy and cv2 and reinstalled them.
I checked if text was a string with "print(type(text))" and it says string.
I checked in the Variable Explorer if just before printing the 'text' variable had the right text in it, and it had it, so everything in "background" should be ok.
Now I'm thinking I might just missing something easy that I can't see, could you help me? Thank you! :)

As noticed in the comment you can solve the problem printing repr(text)
print(repr(text))
You can notice that there are (at least in my case) a lot of \n and one \x0c, that is a sort of 'new page' escape sequence. These come from the way pytesseract works.
So the easier way to fix the problem I found is to delete them through
text = text[:-5]
The number of escape sequences depends on the image, but you could create a function to identify and delete them, for example, starting from the end, you delete every escape sequence you find.

Building on the previous answer by Matte, you can check for the existence of escape sequences by printing
print(repr(text))
To get rid of the escape sequence you have two options:
You strip the OCR result to remove all whitespace at the beginning or end of the string. This works regardless of how many \n or \x0c characters there are at the end.
text = text.strip()
You change or remove the page separator used by tesseract (by default this is \x0c, see https://tesseract-ocr.github.io/tessdoc/FAQ.html#tesseract-400).
text = pytesseract.image_to_string(img, config="-c page_separator=''")

Related

unable to read file from external location in python

I am trying to read a txt file(kept in another location) in python, but getting error.
FileNotFoundError
in ()
----> 1 employeeFile=open("C:‪/Users/xxxxxxxx/Desktop/python/files/employee.txt","r")
2 print(employeeFile.read())
3 employeeFile.close()
FileNotFoundError: [Errno 2] No such file or
directory:'C:\u202a/Users/xxxxxxxx/Desktop/python/files/employee.txt'
Code used:
employeeFile=open("C:‪/Users/xxxxxxxx/Desktop/python/files/employee.txt","r")
print(employeeFile.read())
employeeFile.close()
I tried using frontslash(/) and backslash(). But getting the same error.Please let me know what is missing in code.
I'm guessing you copy and pasted from a Windows property pane, switching backslashes to forward slashes manually. Problem is, the properties dialog shoves a Unicode LEFT-TO-RIGHT EMBEDDING character into the path so the display is consistent, even in locales with right-to-left languages (e.g. Arabic, Hebrew).
You can read more about this on Raymond Chen's blog, The Old New Thing. The solution is to delete that invisible character from your path string. Selecting everything from the initial " to the first forward slash, deleting it, then retyping "C:/, should do the trick.
As your error message suggests, there's a weird character between the colon and the forward slash (C:[some character]/). Other than that the code is fine.
employeeFile = open("C:/Users/xxxxxxxx/Desktop/python/files/employee.txt", "r")
You can copy paste this code and use it.

Python special characters encoding problems in PATH

Given this simple code, I receive faulty paths if the userfolder contains any special characters. For example the returned path is expected to be "C:\Users\Aoë\", but the ë is instead shown as a ‰ or a \u2030 depending on what is done with encoding. This then messes up the rest of my code because of attempts to write to nonexistent paths.
I ran into this problem trying to run kivy, but it seems to be happening globally.
from pathlib import Path
home = str(Path.home())
print(home)
I've spent quite some time, but haven't been able to reach a solution. This is with the latest python, x64 on windows with eclipse. No matter what I do, I cannot get python to handle special characters properly.
Try 'r' tag at the beginning, it ignores the special characters:
home = r'%s'%str(Path.home())

File with "|"s in Atom editor has smiley faces printed from Python; split("|") doesn't work

I have an input file I'm trying to process with Python, which appears to have content like the following:
# This works, when run at a REPL
line = 'aababasdf|75=2|asdfa|150=17|asdfasdf'
date = line.split('|75=')[1].split('|',1)[0]
When I run the above by hand, or copy-and-paste the file's contents from Atom, it works. However, when I have the Python open the file and read the line itself, it fails:
# This fails, reading from the file from which contents were copy-and-pasted
with open(filename) as curfile:
for line in curfile:
date = line.split('|75=')[1].split('|',1)[0]
This code fails with an IndexError: the split() creates only a single segment, so no [1] segment exists.
When I print the line from the file-based code, it prints smiley faces where the |s should be.
What could be going wrong here? How can I better debug this scenario?
If you're running this from the Windows console (code page 437) there are two vertical bar characters: b'\x7c' and b'\xb3'. The first is part of the ASCII character set, and the second is one of the line-drawing characters that were part of the original PC.
>>> print(b'\x7c\xb3'.decode('cp437'))
|│
In addition you appear to be using a text editor that shows b'\x01' as a vertical bar as well. That's a non-standard way of displaying that character, which is generally invisible since it's an ASCII/Unicode control character.
Once you've figured out the actual character in the file, you can substitute it in your split call.

Jupyter Notebook encoding error?

I started to learn pandas by following this tutorial:
https://github.com/jvns/pandas-cookbook
Right in the first chapter I try very elementary example of reading a csv file. The example goes like this:
import pandas as pd
broken_df = pd.read_csv("..\data\bikes.csv")
I get a lengthy error message, which ends with a line:
FileNotFoundError: File b'..\\data\x08ikes.csv' does not exist
So although I write 'bikes.csv', which I have in the correct folder, the program seems to be searching for a file called 'x08ikes.csv'. Could this be an encoding error? sys.getdefaultencoding() returns 'utf-8'.
I am using Anaconda3 for 64bit Windows, version 4.4.0. My browser is Brave. Any ideas what is going wrong here?
The backslash character '\' has special meaning; it tries to "escape" the next character. In this case '\b' is an escape character that does have a meaning. There are three ways around this:
Escape the escapes:
You can use the backslash to escape the next backslash, telling Python "this is just another character"
broken_df = pd.read_csv("..\\data\\bikes.csv")
Use a raw string:
Placing r at the beginning of a string tells Python to interpret everything in the string as-is
broken_df = pd.read_csv(r"..\data\bikes.csv")
Use forward slashes:
This is specific to file paths. You can trace the directory to you file using forward slashes instead of backslashes.
broken_df = pd.read_csv("../data/bikes.csv")
What you can do is, upload the bikes.csv in to the Jupyter Home "Files" tab. Open it and you may still see the message. Then go to File->New, and you may get a new blank file. Open the original bikes.csv in notepad, copy and paste the content in to the file in jupyter notebook. This may help to resolve it.
Then you can run the following code.
import pandas as pd
broken_df = pd.read_csv("..\data\bikes.csv")

Eclipse/PyDev treats newlines pasted into its console as instructions, but I want it to parse them as part of a long string

I am working on a Python script to automate some repetitive text-fiddling tasks I need to do. I use PyDev as a plugin for Eclipse as my IDE.
I need the script to accept user input pasted from the clipboard. The input will typically be many lines long, with many newline characters included.
I currently have the script asking for input as follows:
oldTableString = raw_input('Paste text of old table here:\n')
The console correctly displays the prompt and waits for user input. However, once I paste text into the console, it appears to interpret any newline characters in the pasted text as presses of the enter button, and executes the code as if the only input it received was the first line of the pasted text (before the first newline character), followed by a press of the enter key (which it interprets as a cue that I'm done giving it input).
I've confirmed that it's only reading the first line of the input via the following line:
print oldTableString
...which, as expected, prints out only the first line of whatever I paste into the console.
How can I get Eclipse to recognize that I want it to parse the entirety of what I paste into the console, newlines included, as a single string?
Thanks!
text = ""
tmp = raw_input("Enter text:\n")
while tmp != "":
text += tmp + "\n"
tmp = raw_input()
print text
This works but you have to press enter one more time.
What about reading directly from the clipboard or looping over every line until it receives a termination symbol or times out. Also, is it important to make it work under Eclipse? Does it work when executed directly?

Categories

Resources