I wrote a code in python, which creates music notation based on a user input. It generates a PDF and everything works fine.
The only problem I have is that the notation consists of 100's of small images and I wanted to ask if there is a possibility to merge them all together into one big image. I don't want them to be selectable or anything. Basically I want the PDF to be like one big picture per page.
Is that possible using PyPDF2?
Thanks in advance and have a great weekend!
Related
Im new to python, and would be really grateful if some one could offer some advice or point me in the right direction. I have 1 large CSV file, with many order numbers and order lines. (1 order many have many order lines, but the file consists of all orders from last 24 hours)
Im trying to create multiple PDF's which has been grouped by the order number showing the lines. (one pdf for each order)
I found a great tutorial on Utube to show me how to add the group / create multiple files (mark2741-generate certificates) It looks great because i have added order address and a logo to the canvas. But i cant add a table to the canvas based on the group to show the order lines.
any ideas?
ps, Im using ReportLabs
Thanks in advance
Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.
I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.
Can anyone help me with an efficient python 3.6 code to solve the same?
Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.
Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?
https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf
For extracting tables you can use camelot
Here is an article about it.
For images I've found this question and answer Extract images from PDF without resampling, in python?
Try using PyMuPdf(https://github.com/pymupdf/PyMuPDF/tree/1.18.3) for amalgamation of texts, bars, lines and axis. It has so many extra utilities.
I have some challenge and I try find any information, tips, examples which help me do that. First I looking many times google, and this forum with different ask but I don't found any this same task, algorithm. I try many commercial program to compare images, to find diffrent and common parts but all is don't do that good and smart.
I have some website with many different boxes, modules, elements etc. Now I do do first printscreen, save this image as web1.png.
Next step I change some boxes, elements on this website, for example I remove some block, add new elements, move one of some module/part of website into another places.
Now I do next printscreen this website after last change and save as web2.png
And now it's the most important think what I want to get, do do.
I put this two images (web1.png and web2.png) for examples to some scripts on Python or another technology where smart algorithm to compare this two file and show, marked different or maybe only the first the same element on this two files.
I think is the most big problem is defined what is exactly separated some block, module, many different elements on printscreen website and then find this same block on this next page and how marked this or maybe create next result png with this same element. I don't sure is possible to do that, whether there is a smart
algorithm or way to do that. Thank you in advance for all the help and guidance.
Here is images examples
I have lot of PDF, DOC[X], TIFF and others files (scans from a shared folder). Each file converted into pack of text files: one text file per page.
Each pack of files could contain multiple documents (for example thee contracts). Document kind could be not only contract.
During the processing the pack of the files I don't know what kind of the documents current pack contains and it's possible that one pack contains multiple document kinds (contracts, invoices, etc).
I'm looking for some possible approaches to solve this programmatically.
I'm tried to search something like that but without any success.
UPD: I tried to create binary classificator with scikit-learn and now looking for another solution.
This at its basis, being they are "scans" sounds more like something that could be approached with computer vision, however this is currently far far above my current level of programming.
E.g. projects like SimpleCV may be a good starting point,
http://www.simplecv.org/
Or possibly you could get away with OCR reading the "scans" and working based on the contents. pytesseract seems popular for this type of task,
https://pypi.org/project/pytesseract/
However that still lacks defining how you would tell your program that this part of the image means that this is 3 separate contracts, Is there anything about these files in particular that make this clear, e.g. "1 of 3" on the pages,, a logo or otherwise? that will be the main part that determines how complex a problem you are trying to solve.
Best solution was to create binary classifier (SGDClassifier) and train it on classes first-page and not-first-page. Each item from the dataset was trimmed to 100 tokens (words)
I've got a PDF file that I'm trying to obtain specific data from.
I've been able to parse the PDF via PyPDF2 into one long string but searching for specific data is difficult because of - I assume - formatting in the original PDF.
What I am looking to do is to retrieve specific known fields and the data that immediately follows (as formatted in the PDF) and then store these in seperate variables.
The PDFs are bills and hence are all presented in the exact same way, with defined fields and images. So what I am looking to do is to extract these fields.
What would be the best way to achieve this?
I've got a PDF file that I'm trying to obtain specific data from.
In general, it is probably impossible (or extremely difficult), and details (than you don't mention) are very important. Study in details the complex PDF specification. Notice that PDF is (more or less accidentally) Turing complete (so your problem is undecidable in general, since equivalent to the halting problem).
For example, a normal human reader could read digits in the document as text, or as a JPEG image, etc. And in practice many PDF documents have such kind of data.... Practically speaking, PDF is an output-only format and is designed for screen displaying and printing, not for extracting data from it.
You need to understand how exactly that PDF file was generated (with what exact software, from what actual data). That could take a lot of time (maybe several years of full time reverse-engineering work) without help.
A much better approach is to contact the person or entity providing that PDF file and negotiate some way of accessing the actual data (or at least get detailed explanation about the generation of that particular PDF file). For example, if the PDF file is computed from some database, you'll better access that database.
Perhaps using metadata or comments in your PDF file might help in guessing how it was generated.
The source of the data might produce various kinds of PDF file. For example, my cheap scanner is able to produce PDF. But your program would have hard time in extracting some numerical data from it (because that kind of PDF is essentially wrapping a pixelated image à la JPEG) and would need to deploy image recognition techniques (i.e. OCR) to do so.