Current IDE: Anaconda using App: Spyder.
I wrote a program to data scrape a website and filter posts by keywords.
This is the output.
What is the best way to display a list of images given jpeg addresses?
A few attempts include:
Using Pillows - but the best I can do is open a new tab with Google Chrome to display the photo.
Using Image - but I can only open in Preview saved images on my computer.
Using csv - only to export the output as a neatly formatted csv file.
Try to use some regex pattern to match all these cases.
import re
if re.search(r'^.*\.jpeg$', full_url):
# do something
Related
I have one website which has search button and i need to give some numeric value and give enter button. It will go to another page and it display some content in which there are some URL, if i click that URL, it will ask to save diagram and the diagram is either tiff format or PDF.
To download Tiff format diagram, i am using swift plugin in internet explore and save to my machine
Here i am doing this work manually, just i want to do automate this whole process.
Steps:
Using python request module and pass the URL with numeric value to post method
save response content to variable
perform pattern matching and fetch url
click the url but i am stuck with this part to save the diagram local since it is tiff.
is there any module to download tiff based diagram and save to local machine?
Just I want to share How i resolved the issue for the above question and it might be useful for others.
Since tiff image needs to be downloaded from web, so I used python request module with pillow module as below,
from PIL import image
import requests
tiffURL='https://***.tif'
img=Image.open(requests.get(tiffURL,stream=True).raw)
img.save('imagename.jpg')
#img.save('imagename.jpg',quality=95)
Note:
tiff image can not be viewed by normal editor , so i converted to jpg
if you want high resoultion, you can pass quality=95 to save method
While creating the text(getting text using OCR) layer for scanned PDF edit(because OCR is giving wrong text) the text without messing up the page looks?
ocrmypdf is doing best job to create textlayer(able to search scanned PDF) and giving PDF/A standard documents(without messing any page UI). It's using Tesseract ocr to detect text, but some times the Tesseract is giving wrong detected text. So I want enable user to change that text and complete the creation of PDF.
Example PDF which is OCRis not working properly. So want to update the ocr detected text before rendering into PDF.
Solution need like , Change in source code of ocrmypdf or update text using PDFBOX both works for me.
Example:
OCRMYPDF input file
OCRMYPDF output file
Im working on a project that needs to process images from one file and output to another the change is varied but the main one is colour profiles that need to be changed however everything i see so far is only able to convert to sRGB or that range but i would need to ether be able to add a profile or have an extensive or full list of profiles to convert to. for example one of the profiles ill be needing to use is eciRGB v2.
Please help me to automate this in python (I cant use photoshop...) .
You can do this with ImageMagick and terminal.
Download the colour profile(eciRGB_v2.icc) and make sure you know the path to it, I recommend keeping it in the same directory as your images.
example set up
Then open Terminal at that directory and run this code:
convert image.tif -profile eciRGB_V2.icc output.tif
example of result
Link to download ImageMagick: https://imagemagick.org/index.php
I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image?
Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.
Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)
I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..
Ref: doc2txt
docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.
Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.
As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .
docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True. Well, you get what your image called in pagaraph text and list of image files. Match as you like.
I've successfully written a code that go through several urls, find a specific image in each of them, and saves its address. now i want to download the image.
I'm using this.
def update(name,set,url):
urllib.urlretrieve(url,"c:/path/"+set+"/"+url)
it is currently working, but the images this code obtains can't be opened. i get a message that says that either i don't have the proper update or the windows viewer can't open it because it doesn't support it