Using Chromedriver with Google Images download - python

I am trying to run the following code to pull in some images:
from google_images_download import google_images_download #importing the library
response = google_images_download.googleimagesdownload() #class instantiation
arguments = {"keywords":"foxes, shiba inu outside","limit":2000,"print_urls":True} #creating list of arguments
paths = response.download(arguments) #passing the arguments to the function
print(paths) #printing absolute paths of the downloaded images
Because I am trying to do over 100 images I am getting a message saying
Looks like we cannot locate the path the 'chromedriver' (use the '--chromedriver' argument to specify the path to the executable.) or google chrome browser is not installed on your machine (exception: expected str, bytes or os.PathLike object, not NoneType)
I am unsure how to integrate the chromedriver piece into my code and set the path. I searched around but I cannot find a clear answer. I tried adding the line
browser = webdriver.Chrome(executable_path=r"/Users/jerelnovick/Desktop/Projects/Image_Recognition/chromedriver.exe")
as I read in one post but that gave me a message saying
WebDriverException: Message: 'Image_Recognitionchromedriver.exe' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
I am using a Mac.

There are some additional steps to take to get more than 100 images. From the docs:
If you would want to download more than 100 images per keyword, then you will need to install ‘selenium’ library along with ‘chromedriver’ extension.
And then your arguments will need to be updated as:
arguments = {"keywords":"foxes, shiba inu outside",
"limit":2000,
"print_urls":True,
"chromedriver":"/Users/jerelnovick/Desktop/Projects/Image_Recognition/chromedriver"}
Also make sure the chromedriver you download is the proper one for mac.

Related

py-wallpaper returning "is not recognized as an internal or external command"

Hi there,
I am trying to change my windows 11 background wallpaper using the simplest way, so I just found py-wallpaper library where I can:
from wallpaper import set_wallpaper, get_wallpaper
# getting the current wallpaper
print(get_wallpaper())
# setting a new wallpaper
set_wallpaper("FULL_IMG_PATH")
But even if I use one of them I just get this error:
>>> wallpaper.set_wallpaper(r"C:\Users\Khaled\Downloads\Screenshot 2022-08-13 190317.png")
'C:\Users\Khaled\AppData\Local\Programs\Python\Python310\Lib\site-packages\wallpaper\win-wallpaper.exe' is not recognized as an internal or external command,
operable program or batch file.
[]
Note: I do not want to use ctypes
win.py contains the code used to set and load wallpapers. The file contains these lines at the top:
import os
real_path = os.path.realpath(__file__)
win_wallpaper_path = os.path.join(os.path.dirname(real_path), 'win-wallpaper.exe')
The script is trying to call win-wallpaper.exe from wherever win.py is located. Anyway, as the comments mention, the developer didn't list this requirement, but I looked it up and found this repository on GitHub which seems to be the required program.
The source code of wallpaper.c in that repo is only 41 lines. You can read through the source followed by downloading the binary if you deem it safe.

Converting PDF to JPG in Python

I am trying to convert a PDF into JPEG using Python. Below are the steps I have taken as well as the code but, firstly, here are:
Expected results: Have 1 JPEG file per page in the PDF file added into my "Output" folder.
Actual results: The code appears to run indefinitely without any JPEGS being added to the "Output" folder.
Steps taken:
Installed pdf2image via CMD (pip install pdf2image)
Installed Poppler.
Note on Poppler:
It is required to add it to PATH and I had done this in the environment variables but I kept getting the error pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is Poppler installed and in PATH?. So as a workaround, I have added the path in the code directly and I am not receiving this error any longer.
from pdf2image import convert_from_path
path = "D:/Users/<USERNAME>/Desktop/Python/DeratingTool/"
pdfname = path+"<PDFNAME>.pdf"
images = convert_from_path(pdfname, 500,poppler_path=r'C:\Program Files\Release-22.04.0-0\poppler-22.04.0\Library\bin')
output_folder_path = "D:/Users/<USERNAME>/Desktop/Python/DeratingTool/Output"
i = 1
for image in images:
image.save(output_folder_path + str(i) + "jpg", "JPEG")
i = i+1
Any ideas why this doesn't seem to be able to finish would be most welcome.
Thank you.
I actually found all of the information I needed for the desired result right in the definitions (Thank you, #RJAdriaansen for pointing me back there). The default format is set to "PPM" and can be changed to "jpeg" Below is the functioning code for my purposes:
from pdf2image import convert_from_path
path = "D:/Users/<USERNAME>/Desktop/Python/DeratingTool/"
pdfname = path+"<FILENAME>.pdf"
images = convert_from_path(
pdfname,
dpi=500,
poppler_path=r'C:\Program Files\Release-22.04.0-0\poppler-22.04.0\Library\bin',
output_folder="D:/Users/<USERNAME>/Desktop/Python/DeratingTool/Output",
fmt="jpeg",
jpegopt=None)
Thank you

Pytesseract Failed loading language 'chi-sim'

I am working on python tesseract package with sample code like the follows:
import pytesseract
from PIL import Image
tessdata_dir_config = "--tessdata-dir \"/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/\""
image = Image.open("dataset/test.jpeg")
text = pytesseract.image_to_string(image, lang = "chi-sim", config = tessdata_dir_config)
print(text)
And I received the following error message:
pytesseract.pytesseract.TesseractError: (1, 'Error opening data file /opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/chi-sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'chi-sim' Tesseract couldn't load any languages! Could not initialize tesseract.')
From my understanding, the error occurred when reading the file chi-sim.traineddata (which stands for Simplified Chinese), as I will explain the attempts I have made to settle this problem below.
My developing environment is M1 macOS, and I installed tesseract and tesseract-lang from Homebrew. I am pretty sure that the path specified above is exactly where the source files are located, since when I call
print(pytesseract.get_languages(config = ""))
I get a long list of languages printed, including chi-sim.
Further, if we just use English instead of Chinese, the following code can successfully recognize the English texts in an image:
text = pytesseract.image_to_string(image)
I've tried to specify environment variable TESSDATA_PREFIX in multiple ways, including:
Using config parameter as in the original code.
Adding global environment variable in PyCharm.
Adding the following line in the code
os.environ["TESSDATA_PREFIX"] = "tesseract/4.1.1/share/tessdata/"
Adding the following line to bash_profile in terminal
export TESSDATA_PREFIX=/opt/homebrew/Cellar/tesseract-lang/4.1.0/share/tessdata/
But unfortunately, none of these works.
It seems as if my file chi-sim.traineddata is, somehow, broken, so I directly downloaded the trained data file from GitHub (https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata), hit the "Download" button on the right, and placed the downloaded file in the tesseract-lang and original tesseract directory (where eng.traineddata is located). Yes, I've tried both, but neither works.
With respect to this issue, is there any potential solutions?
Code works for me on Linux if I use lang="chi_sim" with _ instead of - because file downloaded from server has name chi_sim.traineddata also with _ instead of -.
If I rename file into chi-sim.traineddata then I can use lang="chi-sim" (with - instead of _)

pdfkit changes href from relative to absolute paths on conversion

I'm using pdfkit to convert html files that have links with href attributes in them.
Inside of the html, href's are written with relative paths, e.g.:
PIC
When I convert this to pdf, the hrefs seem to be automatically rewritten to absolute paths (C:/Users/...).
Why does pdf change the href?
Wkhtmltopdf, which pdfkit relies on, converts relative links to absolute links by default.
This can be stopped by using the command line tool with a special flag:
wkhtmltopdf --keep-relative-links src destination
Or by telling pdfkit to apply this option:
def convert_to_pdf(path):
try:
# run the conversion and write the result to a file
config = pdfkit.configuration(wkhtmltopdf=path_wkthmltopdf)
options = {
'--keep-relative-links': ''
}
pdfkit.from_url(path+'.htm', path+'.pdf', configuration=config, options=options)
except Exception as why:
# report the error
sys.stderr.write('Pdf Conversion Error: {}\n'.format(why))
raise
Usually when you create a PDF out of an HTML file the PDF file will be opened on another location (for example on another computer after sending it via mail). So in order to reference correctly the full path is needed.
Of course this will only work if the other computer can access the path (so if the path is accessible from the other computer). With paths on C: this will only work from the localhost and not from other PCs.

Python & ActiveSync file exchange

I have the following problem:
I want to copy some files with a python3 script from a (windows) mobile device which is connected via ActiveSync to a windows computer.
I tried it like that, showing the 3 failed Path-trys:
def CopyDir(LocDir, DestDir):
shutil.copytree(LocDir, DestDir)
Dir = '%%CSIDL_PROGRAM_FILES%%\MD_Data' # Path worked in .bat - but not here
Dir = 'Computer\Pocket_PC\\\Program Files' # Direct path out of windows explorer - didn't work
Dir = '::{20D04FE0-3AEA-1069-A2D8-08002B30309D}\\\?\activesyncwpdenumerator#umb#2&306b293b&0&activesyncwpddevice-ab5f2fe4-d830-f1e7-d806-33298de2d8ee-#{6ac27878-a6fa-4155-ba85-f98f491d4f33}\f%7CS%7C%5C\f%7CF%7C%5CProgram%20Files%5C' # Cryptic path read out by a windows clipboard tool - didn't work
CopyDir(Dir, DestDir+'\\MD_Data')
The error is always the same: system couldn't find path
Now I'm asking myself, do I simply use the wrong path format for ActiveSync? Or do I need any special ActivSync modul? (Didn't find one)
Thanks already in advance!

Categories

Resources