Orange3 and Python Data Export/Import - python

Orange 3 seems to be a great tool. However, I had trouble saving and reading my files in my python code (jupyterlab with pandas). There is a way to save the file in an orange pickle format but had no luck in finding a way to properly open the file.
If there is a better way in exporting data tables as well, that will be much appreciated.

You can easily open a pickle file that holds the data table with:
from Orange.data import Table
table = Table("pickled_file.pkl")
```python
You can save the Orange Table in various formats (.tab, .csv, .pickle, ...). Just use the `save` method on the table.
Here is the example on the Iris dataset.
from Orange.data import Table
table = Table("iris")
table.save("iris.csv")

Related

How to extract a table from any file using python?

I'm writing a python program to extract tables from excel sheets and pdf. Currently, I'm using different libraries for each file type. Xlrd for excel sheets, Pdfminer for pdf.
I'm wondering if there is a generic approach to extract tables from any type of file (xls, pdf, csv, word etc.). Since I'm planning to expand the list of supported file types, writing different functions for each file type would be cumbersome.
P.S. I came across PETL while looking for solutions. I could not find any excel/pdf extraction examples and I could not fully understand the documentation. Would PETL fulfill my requirement? If yes, I would really appreciate an example. Thank you.

How to update and delete csv data in a flask website

I'm a beginner in Flask and would like to know how to update and delete csv data using a flask website.
My csv Database is:
name
Mark
Tom
Matt
I would like to know how I could add, update, and delete data on a csv file using a flask website.
Try out pandas
# Load the Pandas libraries with alias 'pd'
import pandas as pd
# Read data from file 'filename.csv'
data = pd.read_csv("filename.csv")
# Preview the first 5 lines of the loaded data
data.head()
Check out more here pandas
Why do you need to storing or processing of data into a CSV file ? Probably you will need to conditional CRUD. Looks like very troublesome way.
You may use SQLite or similar databases that more efficiently way instead of a CSV file. SQLite
Even so if you are determined to use a CSV file maybe it helps. CRUD with CSV

tabula vs camelot for table extraction from PDF

I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc.
I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect all tables perfectly, and I am not sure whether it will work for all kinds or not.
So seeking suggestions from experts who have implemented similar use case.
Example PDFs: PDF1 PDF2 PDF3
Tabula Implementation:
import tabula
tab = tabula.read_pdf('pdfs/PDF1.pdf', pages='all')
for t in tab:
print(t, "\n=========================\n")
Camelot Implementation:
import camelot
tables = camelot.read_pdf('pdfs/PDF1.pdf', pages='all', split_text=True)
tables
for tabs in tables:
print(tabs.df, "\n=================================\n")
Please read this: https://camelot-py.readthedocs.io/en/master/#why-camelot
The main advantage of Camelot is that this library is rich in parameters, through which you can improve the extraction.
Obviously, the application of these parameters requires some study and various attempts.
Here you can find comparision of Camelot with other PDF Table Extraction libraries.
I think Camelot better extracts data in a clean format and not jumbled up ( i.e. data retains the information and row contents are not affected).
So, The quality of data extracted is better in case of difference in the number of lines per cells .
->Tabula requires a Java Runtime Environment
There are open (Tabula, pdf-table-extract) source (smallpdf, PDFTables) tools that are widely used to extract tables from PDF files. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
Camelot was created to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!

Load 100 million rows to DWH daily

There are several OLTP Postgres databases which in total accept 100 millions rows daily.
Also there is a Greenplum DWH. How to load this 100 million rows of data, with only little transformation to Greenplum daily?
I am gonna to use Python for that.
I am sure that doing that in the traditional way (psycopg2 + cursor.execute("INSERT ...), even with batches, gonna take lot of time and will create a bottleneck in the whole system.
Do you have any suggestions how to optimize the process of data loading? Any links or books which may help also welcome.
You should try to export data into a flat file (csv, txt etc).
Then you can use some of Greenplum utilities form import data.
Look here.
You can do the transformation on data with Python before creating the flat file. Use Python to automate the whole process: export data into file and import data into table.

How to save table in python?

I am trying to get a table with data saved in svg, pdf or png file. Are there any libraries to do it?
I've tried pygal, but it seems that they provide only charts saving.
Edited: This table is just a couple of arrays with data, and I need to build a nice table from them
Use tabulate, the documentation can be found here

Categories

Resources