Python - read parquet file without pandas - python

Currently I'm using the code below on Python 3.5, Windows to read in a parquet file.
import pandas as pd
parquetfilename = 'File1.parquet'
parquetFile = pd.read_parquet(parquetfilename, columns=['column1', 'column2'])
However, I'd like to do so without using pandas. How to best do this? I'm using both Python 2.7 and 3.6 on Windows.

You can use duckdb for this. It's an embedded RDBMS similar to SQLite but with OLAP in mind. There's a nice Python API and a SQL function to import Parquet files:
import duckdb
conn = duckdb.connect(":memory:") # or a file name to persist the DB
# Keep in mind this doesn't support partitioned datasets,
# so you can only read one partition at a time
conn.execute("CREATE TABLE mydata AS SELECT * FROM parquet_scan('/path/to/mydata.parquet')")
# Export a query as CSV
conn.execute("COPY (SELECT * FROM mydata WHERE col = 'val') TO 'col_val.csv' WITH (HEADER 1, DELIMITER ',')")

Related

How do I export a pandas DataFrame to Microsoft Access?

I have a Pandas DataFrame with around 200,000 indexes/rows and 30 columns.
I need to have this directly exported into an .mdb file, converting it into a csv and manually importing it will not work.
I understand there's tools like pyodbc that help a lot with importing/reading access, but there is little documentation on how to export.
I'd love any help anyone can give, and would strongly appreciate any examples.
First convert the dataframe into .csv file using the below command
name_of_your_dataframe.to_csv("filename.csv", sep='\t', encoding='utf-8')
Then load .csv to .mdb using pyodbc
MS Access can directly query CSV files and run a Make-Table Query(https://support.office.com/en-us/article/Create-a-make-table-query-96424f9e-82fd-411e-aca4-e21ad0a94f1b) to produce a resulting table. However, some cleaning is needed to remove the rubbish rows. Below opens two files one for reading and other for writing. Assuming rubbish is in first column of csv, the if logic writes any line that has some data in second column (adjust as needed):
import os
import csv
import pyodbc
# TEXT FILE CLEAN
with open('C:\Path\To\Raw.csv', 'r') as reader, open('C:\Path\To\Clean.csv', 'w') as writer:
read_csv = csv.reader(reader); write_csv = csv.writer(writer,lineterminator='\n')
for line in read_csv:
if len(line[1]) > 0:
write_csv.writerow(line)
# DATABASE CONNECTION
access_path = "C:\Path\To\Access\\DB.mdb"
con = pyodbc.connect("DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={};" \
.format(access_path))
# RUN QUERY
strSQL = "SELECT * INTO [TableName] FROM [text;HDR=Yes;FMT=Delimited(,);" + \
"Database=C:\Path\To\Folder].Clean.csv;"
cur = con.cursor()
cur.execute(strSQL)
con.commit()
con.close() # CLOSE CONNECTION
os.remove('C\Path\To\Clean.csv') # DELETE CLEAN TEMP
2020 Update
There is now a supported external SQLAlchemy dialect for Microsoft Access ...
https://github.com/gordthompson/sqlalchemy-access
... which enables you to use pandas' to_sql method directly via pyodbc and the Microsoft Access ODBC driver (on Windows).
I would recommend to export the pandas dataframe to csv as usual like this:
dataframe_name.to_csv("df_filename.csv", sep=',', encoding='utf-8')
Then you can convert it to .mdb file as this stackoverflow answer shows

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?
The filename is someFile.snappy.orc
I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.
I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
Starting from Pandas 1.0.0, there is a built in function for Pandas.
https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html
import pandas as pd
import pyarrow.orc
df = pd.read_orc('/tmp/your_df.orc')
Be sure to read this warning about dependencies. This function might not work on Windows
https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc
If you want to use
read_orc(), it is highly recommended to install pyarrow using conda
ORC, like AVRO and PARQUET, are format specifically designed for massive storage. You can think about them "like a csv", they are all files containing data, with their particular structure (different than csv, or a json of course!).
Using pyspark should be easy reading an orc file, as soon as your environment grants the Hive support.
Answering your question, I'm not sure that in a local environment without Hive you will be able to read it, I've never done it (you can do a quick test with the following code):
Loads ORC files, returning the result as a DataFrame.
Note: Currently ORC support is only available together with Hive support.
>>> df = spark.read.orc('python/test_support/sql/orc_partitioned')
Hive is a data warehouse system, that allows you to query your data on HDFS (distributed file system) through Map-Reduce like a traditional relational database (creating queries SQL-like, doesn't support 100% all the standard SQL features!).
Edit: Try the following to create a new Spark Session. Not to be rude, but I suggest you to follow one of many PySpark tutorial in order to understand the basics of this "world". Everything will be much clearer.
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
Easiest way is using pyorc:
import pyorc
import pandas as pd
with open(r"my_orc_file.orc", "rb") as orc_file:
reader = pyorc.Reader(orc_file)
orc_data = reader.read()
orc_schema = reader.schema
columns = list(orc_schema.fields)
df = pd.DataFrame(data=orc_data, columns=columns)
I did not want to submit a spark job to read local ORC files or have pandas. This worked for me.
import pyarrow.orc as orc
data_reader = orc.ORCFile("/path/to/orc/part_file.zstd.orc")
data = data_reader.read()
source = data.to_pydict()

How i can insert data from dataframe(in python) to greenplum table?

Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()

How to write python sql output into CSV using a dataframe

IMPORT MODULES
import pyodbc
import pandas as pd
import csv
CREATE CONNECTION TO MICROSOFT SQL SERVER
msconn = pyodbc.connect(driver='{SQL Server}',
server='SERVER',
database='DATABASE',
trusted_msconnection='yes')
cursor = msconn.cursor()
CREATE VARIABLES THAT HOLD SQL STATEMENTS
SCRIPT = "SELECT * FROM TABLE"
PRINT DATA
cursor.execute(SCRIPT)
cursor.commit
for row in cursor:
print (row)
WRITE ALL ROWS WITH COLUMN NAME TO CSV --- NEED HELP HERE
Pandas
Since pandas support direct import from an RDBMS with the name being called read_sql you don't need to write this manually.
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('mssql+pyodbc://user:pass#mydsn')
df = pd.read_sql(sql='SELECT * FROM ...', con=engine)
The right tool: odo
From odo docs
Loading CSV files into databases is a solved problem. It’s a problem
that has been solved well. Instead of rolling our own loader every
time we need to do this and wasting computational resources, we should
use the native loaders in the database of our choosing.
And it works the other way round also.
from odo import odo
odo('mssql+pyodbc://user:pass#mydsn::tablename','myfile.csv')
#e4c5's answer is great as it should be faster compared to for loop + cursor - i would extend it with saving result set to CSV:
...
pd.read_sql(sql='SELECT * FROM TABLE', con=msconn) \
.to_csv('/path/to/file.csv', index=False)
if you want to read all rows (not specifying WHERE clause):
pd.read_sql_table('TABLE', con=msconn).to_csv('/path/to/file.csv', index=False)

SQL statement for CSV files on IPython notebook

I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.
For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.
df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
df['name'][df['session_id']==1]
I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:
Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`
But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!
Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.
import pandas as pd
import sqlite3
#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.
First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).
Here's a full example you can run in a Jupyter notebook:
Installation
pip install jupysql duckdb duckdb-engine
Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine
Example
Load extension (%sql magic) and create in-memory database:
%load_ext SQL
%sql duckdb://
Download some sample CSV data:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")
Query:
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
JupySQL documentation available here

Categories

Resources