I'm writing a python program to extract tables from excel sheets and pdf. Currently, I'm using different libraries for each file type. Xlrd for excel sheets, Pdfminer for pdf.
I'm wondering if there is a generic approach to extract tables from any type of file (xls, pdf, csv, word etc.). Since I'm planning to expand the list of supported file types, writing different functions for each file type would be cumbersome.
P.S. I came across PETL while looking for solutions. I could not find any excel/pdf extraction examples and I could not fully understand the documentation. Would PETL fulfill my requirement? If yes, I would really appreciate an example. Thank you.
Related
I used to upload csv, excel, json or geojson files in my a postegreSQL using Python/Django.
I noticed that the scripts is redundant and sometimes difficult to maintain when we need to update key or columns. Is there a way to use design pattern? I have never used it before.
Any suggestion or links could be hep!
Thanks for taking the time to read my question.
I am working on a personal project to learn python scripting for excel, and I want to learn how to move data from one workbook to another.
In this example, I am emulating a company employee ledger that has name, position, address, and more (The organizations is by row so every employee takes up one row). But the project is to have a selected number of people be transferred to a new ledger (another excel file). So I have a list of emails in a .txt file (it could even be another excel file but I thought .txt would be easier), and I would want the script to run through the .txt file, get the emails, and look for any rows that have a matching email address(all emails are in cell 'B'). And if any are found, then copy that entire row to the new excel file.
I tried a lot of ways to make this work, but I could not figure it out. I am really new to python so I am not even sure if this is possible. Would really appreciate some help!
You have essentially two packages that will allow manipulation of Excel files. For reading in data and performing analysis the standard package for use is pandas. You can save the files as .xlsx however you are only really working with base table data and not the file itself (IE, you are extracing data FROM the file, not working WITH the file)
However what you need is really to perform manipulation on Excel files directly which is better done with openpyxl
You can also read files (such as your text file) using with open function that is native to Python and is not a third party import like pandas or openpyxl.
Part of learning to program includes learning how to use documentation.
As such, here is the documentation you require with sufficient examples to learn openpyxl: https://openpyxl.readthedocs.io/en/stable/
And you can learn about pandas here: https://pandas.pydata.org/docs/user_guide/index.html
And you can learn about python with open here: https://docs.python.org/3/tutorial/inputoutput.html
Hope this helps.
EDIT: It's possible I or another person can give you a specific example using your data / code etc, but you would have to provide it fully. Since you're learning, I suggest using the documentation or youtube.
I've been trying to do this and I really got no clue. I've search a lot and i know that i can merge the files with easily with VBA or other languages, but i really want to do it with Python.
Can anyone get me on track?
I wished there was a straight forward support from openpyxl/xlsxwriter to copy sheets across different workbooks.
However, I see you would have to mash up a recipe using a couple of libraries:
One for reading the worksheet data and,
Another for writing data to a unified xlsx
For both of the above there are lot of options in terms of python packages.
I need to create some excel tables, but these tables don't have simple look.
There are some pictures, some special fonts etc.
But the complicated parts are static, that means always the same.
So my idea was, I will create an excel-template with these tricky parts and then from python just insert dynamic data to this template.
I am working with pandas framework, but I didn't find a way how to do that with or without this framework.
Any idea?
There isn't an easy way to do this with any of the usual "direct file manipulation" libraries in Python (xlrd, xlwt, XlsxWriter, OpenPyXL; these are what pandas uses). The reason is that the structure of a workbook file is such that it's impossible or prohibitively difficult (depending on whether you're talking about .xls or .xlsx) to do anything resembling "in-place" editing, short of re-implementing Excel itself.
So for what you're trying to do, your best option is to let Excel do the work. (I'm assuming you can run Excel, since you mention that you'd like to create Excel templates.) There are ways to automate Excel, the most straightforward probably being Microsoft's VBA or VBScript. But if you want to do it in Python, you can, using PyWin32 or pywinauto.
How can I convert a .csv file into .dbf file using a python script? I found this piece of code online but I'm not certain how reliable it is. Are there any modules out there that have this functionality?
Using the dbf package you can get a basic csv file with code similar to this:
import dbf
some_table = dbf.from_csv(csvfile='/path/to/file.csv', to_disk=True)
This will create table with the same name and either Character or Memo fields and field names of f0, f1, f2, etc.
For a different filename use the filenameparameter, and if you know your field names you can also use the field_names parameter.
some_table = dbf.from_csv(csvfile='data.csv', filename='mytable',
field_names='name age birth'.split())
Rather basic documentation is available here.
Disclosure: I am the author of this package.
You won't find anything on the net that reads a CSV file and writes a DBF file such that you can just invoke it and supply 2 file-paths. For each DBF field you need to specify the type, size, and (if relevant) number of decimal places.
Some questions:
What software is going to consume the output DBF file?
There is no such thing as "the" (one and only) DBF file format. Do you need dBase III ? dBase 4? 7? Visual FoxPro? etc?
What is the maximum length of text field that you need to write? Do you have non-ASCII text?
Which version of Python?
If your requirements are minimal (dBase III format, no non-ASCII text, text <= 254 bytes long, Python 2.X), then the cookbook recipe that you quoted should do the job.
Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.
Edit: Originally, I listed dbfpy, but the library above seems to be more actively updated.
None that are well-polished, to my knowledge. I have had to work with xBase files many times over the years, and I keep finding myself writing code to do it when I have to do it. I have, somewhere in one of my backups, a pretty functional, pure-Python library to do it, but I don't know precisely where that is.
Fortunately, the xBase file format isn't all that complex. You can find the specification on the Internet, of course. At a glance the module that you linked to looks fine, but of course make copies of any data that you are working with before using it.
A solid, read/write, fully functional xBase library with all the bells and whistles is something that has been on my TODO list for a while... I might even get to it in what is left this year, if I'm lucky... (probably not, though, sadly).
I have created a python script here. It should be customizable for any csv layout. You do need to know your DBF data structure before this will be possible. This script requires two csv files, one for your DBF header setup and one for your body data. good luck.
https://github.com/mikebrennan/csv2dbf_python