How can I get pytesseract to adjust its targeting parameters in real time depending on where my data has shifted? - python

I'm trying to use pytesseract to read data from a web page. However, it's not always in the same exact place. Sometimes, I end up on an application where data has shifted to the left or right a little bit, because there's lack of information on the application or because one of the cells contains too much information.
Here's a sample image:
target
Here, I'm trying to read and print the first line containing the text Brazil(BZ) (and eventually, I'll have it read the other two lines as well). However, this information is again, scattered. So, I'm currently targeting it like this:
import pytesseract
import os
import cv2
import numpy as np
import pyautogui
import configparser
pytesseract.pytesseract.tesseract_cmd = os.path.join(os.environ['ProgramFiles'], 'Tesseract-OCR', 'tesseract.exe')
config = configparser.ConfigParser()
config.read('config.ini')
l1 = config['info']['l1']
l2 = config['info']['l2']
l3 = config['info']['l3']
l4 = config['info']['l4']
x1, y1 = int(l1), int(l2)
x2, y2 = int(l3), int(l4)
try:
screenshot = np.array(pyautogui.screenshot())
cropped_image1 = screenshot[y1:y2, x1:x2]
gray_image1 = cv2.cvtColor(cropped_image1, cv2.COLOR_BGR2GRAY)
text1 = pytesseract.image_to_string(gray_image1)
print(text1)
except:
pass
my config.ini file has the following:
l1 = 189
l2 = 1038
l3 = 537
l4 = 1115
This is the area on the page I'm targeting, and it's failing to target the correct area on some applications. Is there a way to maybe search for the text "Country" then go down a little bit, and print whatever is below it, regardless of where it is on my screen?
Would I perhaps do better with selenium? I don't know if it can read html data from web pages in real time AFTER I process an application and end up on a different part of the site. Can it do that?
I tried targeting my data of interest but it fails because information is shifting depending on what information an applicant put in their application.

Related

How to change the scale of json animation lottie in streamlit?

I'm using streamlit for visualization and wanted to add some animation. I find and install lottie. I liked one gradient that I would like to use as a separator between blocks, that is, it should be a small strip across the entire width of the screen. And with that I had a problem.
Sample code:
import streamlit as st
import requests
import json
from streamlit_lottie import st_lottie
from streamlit_lottie import st_lottie_spinner
def load_lottieurl(url: str):
""" Load animation and images from lottie"""
r = requests.get(url)
if r.status_code != 200:
return None
else:
animation = json.loads(r.text)
return animation
animation_1 = load_lottieurl('https://assets5.lottiefiles.com/packages/lf20_OXZeQi.json')
st_lottie(animation_1, speed=1, loop=True, quality="medium", width=1920)
Actually what's the problem, I can't scale the image without scaling it entirely. Is it possible to do this somehow, or somehow crop the displayed image vertically? Literally height in pixels 10) No matter how I twisted the settings, I can’t orient it correctly
That's what happens now, I just wanted it to be a small strip that would shimmer:

Replacing a word with another word, and replacing an image with another image in a PDF file through python, is this possible?

I need to replace a K words with K other words for every PDF file I have within a certain path file location and on top of this I need to replace every logo with another logo. I have around 1000 PDF files, and so I do not want to use Adobe Acrobat and edit 1 file at a time. How can I start this?
Replacing words seems at least doable as long as there is a decent PDF reader one can access through Python ( Note I want to do this task in Python ), however replacing an image might be more difficult. I will most likely have to find the dimension of the current image and resize the image being used to replace the current image dynamically, whilst the program runs through these PDF files.
Hi, so I've written down some code regarding this:
from pikepdf import Pdf, PdfImage, Name
import os
import glob
from PIL import Image
import zlib
example = Pdf.open(r'...\Likelihood.pdf')
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
if len(list(i.images.keys())) >= 1:
PagesWithImages.append(i)
ImageCodesForPages.append(list(i.images.keys()))
pdfImages = []
for i,j in zip(PagesWithImages, ImageCodesForPages):
for x in j:
pdfImages.append(i.images[x])
# Replace every single page using random image, ensure that the dimensions remain the same?
for i in pdfImages:
pdfimage = PdfImage(i)
rawimage = pdfimage.obj
im = Image.open(r'...\panda.jpg')
pillowimage = pdfimage.as_pil_image()
print(pillowimage.height)
print(pillowimage.width)
im = im.resize((pillowimage.width, pillowimage.height))
im.show()
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
So just one problem, it doesn't actually replace anything. If you're wondering why and how I wrote this code I actually got it from this documentation:
https://buildmedia.readthedocs.org/media/pdf/pikepdf/latest/pikepdf.pdf
Start at Page 53
I essentially put all the pdfImages into a list, as 1 page can have multiple images. In conjunction with this, the last for loop essentially tries to replace all these images whilst maintaining the same width and height size. Also note, the file path names I changed here and it definitely is not the issue.
Again Thank You
I have figured out what I was doing wrong. So for anyone that wants to actually replace an image with another image in place on a PDF file what you do is:
from pikepdf import Pdf, PdfImage, Name
from PIL import Image
import zlib
example = Pdf.open(filepath, allow_overwriting_input=True)
PagesWithImages = []
ImageCodesForPages = []
# Grab all the pages and all the images in every page.
for i in example.pages:
imagelists = list(i.images.keys())
if len(imagelists) >= 1:
for x in imagelists:
rawimage = i.images[x]
pdfimage = PdfImage(rawimage)
rawimage = pdfimage.obj
pillowimage = pdfimage.as_pil_image()
im = Image.open(imagePath)
im = im.resize((pillowimage.width, pillowimage.height))
rawimage.write(zlib.compress(im.tobytes()), filter=Name("/FlateDecode"))
rawimage.ColorSpace = Name("/DeviceRGB")
rawimage.Width, rawimage.Height = pillowimage.width, pillowimage.height
example.save()
Essentially, I changed the arguements in the first line, such that I specify that I can overwrite. In conjunction, I also added the last line which actually allows me to save.

I am having trouble with the streamlit I am using a Chromebook and cant keep to insert images

from PIL import Image
import requests
import streamlit as st
from streamlit_lottie import st_lottie
#find more emojis at the link he set line 4 will show how your website would be layedout for users to see
st.set_page_config(page_title="My Webpage", page_icon=":tada:", layout="wide")
def load_lottieurl(url):
r =requests.get(url)
if r.status_code != 200:
return None
return r.json()
#lottie files are for animation to be put on your webiste (pip install requests)(pip install streamlit-lottie)(pip install pillow)
lottie_coding = load_lottieurl("https://assets7.lottiefiles.com/packages/lf20_2znxgjyt.json")
img_william = Image.open("images\william.png")
img_tacos = Image.open("images\tacos.png")
#header section / st.container will organize the code it will work fine without
with st.container():
st.subheader("hello i am william and this is my first website made by python header :wave:")
st.title("this would be were the title would go ")
st.write("this would be a small paragraph you wold write this area")
#st.write is for small paragrahps / st.title is to start the paragraps )
# what i am doing with the 3 lines and quotes on line 15 and that i am diving space on the website
with st.container():
st.write("---")
left_column, right_column = st.columns(2)
with left_column:
st.header("this will be the column for the left side ")
st.write('##')
st.write(
"""
i am currently doing what the tutotial is telling me to do even though i am quite cofnused:
- i will master this and i will understtand the concepts of making a website with this framework
- i will be great i bless god for changing how i act and i am as a person
- i know allot of people do not want to see me win but god does and thats all i need
- rome was not built in a day and they will say the samething about my journey
i am confused but if i keep puttingand effort in everyday i will master this in jesues name
"
""
)
i am having difficulty uploading images to python I am using the framework called streamlit and i have been following a tutorial on how to make my own website and when i do what he does to upload the image it tells me in the terminal that no such file in the directory i have made a file for the images that includes the code in the same file so i am very confused please help i am also using a chromebook
if images folder is in the working directory and you want to display the images, all you need is the code below.
st.image("images/william.png")
st.image("images/tacos.png")
but displaying image with PIL, you can do the following
img_william = Image.open("images/william.png")
img_tacos = Image.open("images/tacos.png")
st.image(img_william)
st.image(img_tacos)
Then apply this last method:
right click on your images folder and copy full path.
from pathlib import Path
SCR_DIR = 'C:\\users\\Desktop\\images' # Edit 'C:\\users\\Desktop\\images' with the coppied path you made at the first step. But maintain \\ when editting
img_william = Image.open(Path(SRC_DIR, "william.png"))
img_tacos = Image.open(Path(SRC_DIR, "tacos.png"))
st.image(img_william)
st.image(img_tacos)
Your scr_dir hould be something like
SCR_DIR = 'C:\\user\\My files\\Webpage\\images'
I am pretty sure you are missing something out.

How can I pull information from Excel into PowerPoint using Python and keep the format?

I've written a script with python's xlrd and pptx to read each workbook in a directory and pull information from each sheet into a table in a PowerPoint slide. It works okay if the excel table is small but I don't know what will be in these excel files. It becomes illegible when there is too many rows and columns. My main problem arose when an excel file had graphs instead of cells and the script couldn't read it. So I tried using pyscreenshot to open the document and take a screenshot but this seems slow and unnecessary. I'd like to make a slide in the PowerPoint look exactly as it would in excel but with the ability to add and change things.
import libraries and modules
import xlrd
from pptx import Presentation
from pptx.util import Inches, Pt
import time
import glob
import os
start = time.time()
prs = Presentation()
title_slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(title_slide_layout)
shapes = slide.shapes
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Dashboard Generator"
subtitle.text = "made with Python-pptx and xlrd"
for filename in glob.glob(os.path.join("C:/Users/penelope/Desktop/PMO/myfiles/", '*.xlsx')):
print(filename)
file_location = filename
try:
workbook = xlrd.open_workbook(file_location)
nsheets = workbook.nsheets
for n in range(0, nsheets):
sheet = workbook.sheet_by_index(n)
print("sheet:", sheet)
rows = sheet.nrows
cols = sheet.ncols
c = cols
r = rows
if c > 0:
print(c, r)
slide = prs.slides.add_slide(prs.slide_layouts[5])
shapes = slide.shapes
title = slide.shapes.title
title.text = "Table testing"
left = Inches(0.0)
top = Inches(2.0)
width = Inches(6.0)
height = Inches(4.0)
num = 10.0/c
table = shapes.add_table(rows, cols, left, top, width, height).table
for i in range(0, c):
table.columns[i].width = Inches(num)
for i in range(0,r):
for e in range(0,c):
table.cell(i,e).text = str(sheet.cell_value(i,e))
cell = table.rows[i].cells[e]
paragraph = cell.text_frame.paragraphs[0]
paragraph.font.size = Pt(11)
except:
print("Error!")
pass
prs.save('powerpointfile1.pptx')
end = time.time()
print(end - start)
And this is my screenshot script:
import os
import time
import pyscreenshot as ImageGrab
from PIL import Image
if __name__ == "__main__":
os.system('start excel.exe "C:/Users/penelope/Desktop/PMO/TestCase.xlsx"')
time.sleep(3)
im=ImageGrab.grab(bbox=(24,210,1800,990))
im.save("image7.png")
img = Image.open('image7.png')
img.show()
Well, you've chosen a hard problem. Certainly all the times I've attempted this sort of thing I've ended up abandoning the effort.
The fundamental explanation I formed was that Excel (and Word) are "flowed" document environments. That is, when you run out of room on one page, it flows to the next. PowerPoint, on the other hand, is a page-by-page exhibit layout environment. Each slide is independent of the rest (evidenced by the ability to reorder slides freely), each meant to be shown all at once, and not scrolled. This leads to each slide being self-contained, which means constrained to a single "page".
There's a limit to how much information one can place on a slide and still have it communicate. Generally less is better. So, perhaps it's not a surprise all my early efforts there ended in frustration :) I also concluded that an effective "dashboard" slide would require very skillful layout, and extreme restraint on content length, probably requiring specific (human) summarization effort (not just copying from a "database").
Regarding the charts bit, those theoretically can be moved to PowerPoint and I've even seen it done, but it's technically quite challenging. There is no API support for it in python-pptx. This historical issue on the GitHub repo may give some idea what was involved. Not for the faint of heart I expect :)

Batch download google images with tags

I'm trying to find an efficient and replicable way to batch download full-size image files from a Google image search. Other people have asked similar things, but I haven't found anything that's exactly what I'm looking for or that I understand.
Most refer to the depreciated Google Image Search API or the Google Custom Search API which doesn't seem to work for the whole web, or are just about downloading images from a single URL.
I imagine this could be a two step process: First, pull all the image URLs from a search and then batch download from those?
I should add that I am a beginner (which is probably obvious; sorry). So if someone could explain as well as point me in the right direction, that would be much appreciated.
I've also looked into freeware options, but those seem spotty as well. Unless anyone knows of a reliable one.
Download images from google image search (python)
In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?
And if anyone know anything about the labels from this and if they exist somewhere/are associated with the images? https://en.wikipedia.org/wiki/Google_Image_Labeler
import json
import os
import time
import requests
from PIL import Image
from StringIO import StringIO
from requests.exceptions import ConnectionError
def go(query, path):
"""Download full size images from Google image search.
Don't print or republish images without permission.
I used this to train a learning algorithm.
"""
BASE_URL = 'https://ajax.googleapis.com/ajax/services/search/images?'\
'v=1.0&q=' + query + '&start=%d'
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
start = 0 # Google's start query string parameter for pagination.
while start < 60: # Google will only return a max of 56 results.
r = requests.get(BASE_URL % start)
for image_info in json.loads(r.text)['responseData']['results']:
url = image_info['unescapedUrl']
try:
image_r = requests.get(url)
except ConnectionError, e:
print 'could not download %s' % url
continue
# Remove file-system path characters from name.
title = image_info['titleNoFormatting'].replace('/', '').replace('\\', '')
file = open(os.path.join(BASE_PATH, '%s.jpg') % title, 'w')
try:
Image.open(StringIO(image_r.content)).save(file, 'JPEG')
except IOError, e:
# Throw away some gifs...blegh.
print 'could not save %s' % url
continue
finally:
file.close()
print start
start += 4 # 4 images per page.
# Be nice to Google and they'll be nice back :)
time.sleep(1.5)
# Example use
go('landscape', 'myDirectory')
Update
I was able to create a Custom Search using the full web as specified here, and successfully execute to get the image links, but as also mentioned in that previous post, they don't exactly align with the normal Google image results.
Try using the ImageSoup module. To install it, simply:
pip install imagesoup
A sample code:
>>> from imagesoup import ImageSoup
>>>
>>> soup = ImageSoup()
>>> images_wanted = 50
>>> query = 'landscape'
>>> images = soup.search(query, n_images=50)
Now you have a list with 50 landscape images from Google Images. Let's play with the first one:
>>> im = images[0]
>>> im.URL
https://static.pexels.com/photos/279315/pexels-photo-279315.jpeg
>>> im.size
(2600, 1300)
>>> im.mode
RGB
>>> im.dpi
(300, 300)
>>> im.color_count
493230
>>> # Let's check the main 4 colors in the image. We use
>>> # reduce_size = True to speed up the process.
>>> im.main_color(reduce_size=True, n=4))
[('black', 0.2244), ('darkslategrey', 0.1057), ('darkolivegreen', 0.0761), ('dodgerblue', 0.0531)]
# Let's take a look on our image
>>> im.show()
>>> # Nice image! Let's save it.
>>> im.to_file('landscape.jpg')
The number of images returned by each search may change. Usually is a number smaller than 900. If you want to get all images, set n_images=1000.
To contribute or report bugs, check the github repo: https://github.com/rafpyprog/ImageSoup

Categories

Resources