How to keep null values when writing to csv - python

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command. The issue is that Python's csv writer automatically converts Nulls into an empty string "" and it fails my job when the column is an int or float datatype and it tries to insert this "" when it should be a None or null value.
To make it as easy as possible to interface with modules which
implement the DB API, the value None is written as the empty string.
https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.writer
What is the best way to keep the null value? Is there a better way to write csvs in Python? I'm open to all suggestions.
Example:
I have lat and long values:
42.313270000 -71.116240000
42.377010000 -71.064770000
NULL NULL
When writing to csv it converts nulls to "":
with file_path.open(mode='w', newline='') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
if include_headers:
csv_writer.writerow(col[0] for col in self.cursor.description)
for row in self.cursor:
csv_writer.writerow(row)
.
42.313270000,-71.116240000
42.377010000,-71.064770000
"",""
NULL
Specifies the string that represents a null value. The default is \N
(backslash-N) in text format, and an unquoted empty string in CSV
format. You might prefer an empty string even in text format for cases
where you don't want to distinguish nulls from empty strings. This
option is not allowed when using binary format.
https://www.postgresql.org/docs/9.2/sql-copy.html
ANSWER:
What solved the problem for me was changing the quoting to csv.QUOTE_MINIMAL.
csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields
which contain special characters such as delimiter, quotechar or any
of the characters in lineterminator.
Related questions:
- Postgresql COPY empty string as NULL not work

You have two options here: change the csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)
Python csv.writer() and quoting
On the Python side, you are telling the csv.writer() object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC:
Instructs writer objects to quote all non-numeric fields.
None values are non-numeric, so result in "" being written.
Switch to using csv.QUOTE_MINIMAL or csv.QUOTE_NONE:
csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.
Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.
With either option, the CSV output for None values is simple an empty string:
>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
... outfile = StringIO()
... csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
... csv_writer.writerows(rows)
... return outfile.getvalue()
...
>>> rows = [
... [42.313270000, -71.116240000],
... [42.377010000, -71.064770000],
... [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""
>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,
>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,
PostgreSQL 9.4 COPY FROM, NULL values and FORCE_NULL
As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULLs, when you use the FORCE_NULL option. From the COPY FROM documentation:
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
Naming the columns in a FORCE_NULL option lets PostgreSQL accept both the empty column and "" as NULL values for those columns, e.g.:
COPY position (
lon,
lat
)
FROM "filename"
WITH (
FORMAT csv,
NULL '',
DELIMITER ',',
FORCE_NULL(lon, lat)
);
at which point it doesn't matter anymore what quoting options you used on the Python side.
Other options to consider
For simple data transformation tasks from other databases, don't use Python
If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.
Numpy data? Consider using COPY FROM as binary, directly from Python
Numpy data can more efficiently be inserted via binary COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY and the psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.
Persisting data to handle large datasets in a pipeline?
Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.
Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.
Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.
The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.

your code
for row in self.cursor:
csv_writer.writerow(row)
uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression
for row in self.cursor:
csv_writer.writerow("null" if x is None else x for x in row)

You are asking for csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using csv.QUOTE_MINIMAL as it might be more what you are after:
Test Code:
import csv
test_data = (None, 0, '', 'data')
for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC),
('test2.csv', csv.QUOTE_MINIMAL)):
with open(name, mode='w') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes)
csv_writer.writerow(test_data))
Results:
test1.csv:
"",0,"","data"
test2.csv:
,0,,data

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command.
I believe your true requirement is you need to hop data rows through the filesystem, and as both the sentence above and the question title make clear, you are currently doing that with a csv file.
Trouble is that csv format offers poor support for the RDBMS notion of NULL.
Let me solve your problem for you by changing the question slightly.
I'd like to introduce you to parquet format.
Given a set of table rows in memory, it allows you to very quickly persist them to a compressed binary file, and recover them, with metadata and NULLs intact, no text quoting hassles.
Here is an example, using the pyarrow 0.12.1 parquet engine:
import pandas as pd
import pyarrow
def round_trip(fspec='/tmp/locations.parquet'):
rows = [
dict(lat=42.313, lng=-71.116),
dict(lat=42.377, lng=-71.065),
dict(lat=None, lng=None),
]
df = pd.DataFrame(rows)
df.to_parquet(fspec)
del(df)
df2 = pd.read_parquet(fspec)
print(df2)
if __name__ == '__main__':
round_trip()
Output:
lat lng
0 42.313 -71.116
1 42.377 -71.065
2 NaN NaN
Once you've recovered the rows in a dataframe you're free to call df2.to_sql() or use some other favorite technique to put numbers and NULLs into a DB table.
EDIT:
If you're able to run .to_sql() on the PG server, or on same LAN, then do that.
Otherwise your favorite technique will likely involve .copy_expert().
Why?
The summary is that with psycopg2, "bulk INSERT is slow".
Middle layers like sqlalchemy and pandas, and well-written apps that care about insert performance, will use .executemany().
The idea is to send lots of rows all at once, without waiting for individual result status, because we're not worried about unique index violations.
So TCP gets a giant buffer of SQL text and sends it all at once, saturating the end-to-end channel's bandwidth,
much as copy_expert sends a big buffer to TCP to achieve high bandwidth.
In contrast the psycopg2 driver lacks support for high performance executemany.
As of 2.7.4 it just executes items one at a time, sending a SQL command across the WAN and waiting a round trip time for the result before sending next command.
Ping your server;
if ping times suggest you could get a dozen round trips per second,
then plan on only inserting about a dozen rows per second.
Most of the time is spent waiting for a reply packet, rather than spent processing DB rows.
It would be lovely if at some future date psycopg2 would offer better support for this.

I would use pandas,psycopg2,and sqlalchemy. Make sure are installed. Coming from your current workflow and avoiding writing to csv
#no need to import psycopg2
import pandas as pd
from sqlalchemy import create_engine
#create connection to postgres
engine = create_engine('postgres://.....')
#get column names from cursor.description
columns = [col[0] for col in self.cursor.description]
#convert data into dataframe
df = pd.DataFrame(cursor.fetchall(),columns=columns)
#send dataframe to postgres
df.to_sql('name_of_table',engine,if_exists='append',index=False)
#if you still need to write to csv
df.to_csv('your_file.csv')

Related

Fastest way to convert object to string Python

I have a ~250,000 row dataset I am pulling from Oracle DB. The main data I am concerned with is a text field that is pulled as a HUGECLOB object. Storing my data to a CSV took quite a bit of time, so I decided to switch to Feather via: Pandas read_csv speed up
There was some other question that used feather too, but I cannot find it. I tried to_hdf, but for some reason that did not work. Either way, to get my query results to work with Feather, the text column must be converted to string. So I have:
SQLquery = ('SELECT*')
datai = pd.read_sql(SQLquery, conn)
print("Query Passed, start date")
datai['REPORTDATE'] = pd.to_datetime(datai['REPORTDATE'], format='%m-%d-%Y')
print("Row done, string")
datai['LOWER(LD.LDTEXT)'] = datai['LOWER(LD.LDTEXT)'].apply(str)
print("Data Retrieved")
print("To feather start")
datai.to_feather(r'C:\Users\asdean\Documents\Nuclear Text\dataStore\rawData2.feather')
print("Done with feather")
Note: I put a bunch of print statements because I was trying to figure out where it got hung up.
The text column is identified as datai['LOWER(LD.LDTEXT)']. Some of the rows contain quite a bit of text (~ a couple paragraphs). The string conversion TAKES FOREVER (or may not even be completing). I did not have this problem when I read it from an old CSV (old data, no longer use, we updated the query, etc).
I have tried all the common ways of doing this with astype(str), map(str), apply(str), values.astype(str) with no success of speeding it up to a reasonable (less than 1 hour) pace. Is there a way to do faster object to string conversions? Is there a library I am missing here? Is there a faster way through Oracle/HUGECLOB? How can I speed this up?
I'm assuming the data is stored in the DB as the Oracle CLOB datatype. How big is the data? If the strings are < 1GB make sure they are being fetched as strings (not streamed) in the lower layer https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes

Conserving data types from dataframe to SQL Server table

I have a dataframe (originally from a csv)
df = pd.read_excel (r'R:\__Test Server\DailyStatuses\DailyWork10.18.xlsx')
I created a dataframe to deal with some null values in the rows. I've also created the table in SQL Server, and defined the column types (datetime, int and varchar).
I'm building an insert string to insert the data into a new table in SQL Server.
insert_query='INSERT INTO [DailyStatuses].[dbo].[StatusReports] VALUES (
for i in range (df.shape[0]):
for j in range(df.shape[1]):
insert_query += (df[df.columns.values[j]][i]) +','
insert_query= insert_query[:-1] + '),('
insert_query = insert_query[:-3] + ');'
My output is:
INSERT INTO [DailyStatuses].[dbo].[StatusReports] VALUES (3916, 2019-10-17 16:45:54...
I'm constantly running into errors about data types, is it best to define everything as a str so it's easier to insert into SQL Server (and define each column in the table as a str) and define data types upon extraction later down the road?
You would be better off using parameters but based on the question you asked, you are going to have to deal with each data type separately.
For int values, you'll be fine with them as is.
For string values, you'll have to put single quotes around them i.e. 'a value' and you'll also need to replace any embedded single quotes with two single quotes.
For datetime values, you should use a format that isn't affected by regional settings and put quotes around them i.e. '20191231 12:54:54'
The other alternative (as you suggest) is to bring them all in as strings, and do the clean-up, and data-type changes within SQL Server. That's often a more reliable direction. Again though, don't forget to double up any embedded single quotes within the values.

create database by load a csv files using the header as columnnames (and add a column that has the filename as a name)

I have CSV files that I want to make database tables from in mysql. I've searched all over and can't find anything on how to use the header as the column names for the table. I suppose this must be possible. In other words, when creating a new table in MySQL do you really have to define all the columns, their names, their types etc in advance. It would be great if MySQL could do something like Office Access where it converts to the corresponding type depending on how the value looks.
I know this is maybe a too broadly defined question, but any pointers in this matter would be helpful. I am learning Python too, so if it can be done through a python script that would be great too.
Thank you very much.
Using Python, you could use the csv DictReader module to makes it pretty easy to use the headers from the csv files as labels for the input data. It basically reads all lines in as a dictionary object with the keys as the headers, so you can use the keys as the source for your column names when accessing mySQL.
A quick example that reads a csv into a list of dictionaries:
example.csv:
name,address,city,state,phone
jack,111 washington st, somewhere, NE, 888-867-5309
jill,112 washington st, somewhere else, NE, 888-867-5310
john,113 washington st, another place, NE, 888-867-5311
example.py:
import csv
data = []
with open("example.csv") as csvfile:
reader = csv.DictReader(csvfile)
for line in reader:
data.append(line)
print(data[0].keys())
print(data[0]['address'])
print(data[1]['name'])
print(data[2]['phone'])
output:
$:python example.py
dict_keys(['name', 'address', 'city', 'state', 'phone'])
111 washington st
jill
888-867-5311
More in-depth examples at: http://java.dzone.com/articles/python-101-reading-and-writing
Some info on connection to MySQL in Python: How do I connect to a MySQL Database in Python?
The csv module can easily give you the column names from the first line, and then the values from the other ones. The hard part will be do guess the correct column types. When you load a csv file into an Excel worksheet, you only have few types : numeric, string, date.
In a database like MySQL, you can define the size of string columns, and you can give the table a primary key and eventually other indexes. You will not be able to guess that part automatically from a csv file.
At the simplest way, you can treat all columns as varchar(255). It is really uncommon to have fields in a csv file that do not fit in 255 characters. If you want something more clever, you will have to scan the file twice : first time to control the maximum size for each colum, and at the end, you could take the minimum power of 2 greater than that. Next step would be to control if any column contains only integers or floating point values. It begins to be harder to do that automatically, because the representation of floating point values may be different depending on the locale. For example 12.51 in an english locale would be 12,51 in a french locale. But Python can give you the locale.
The hardest thing would be eventual date or datetime fields, because there are many possible formats only numeric (dd/mm/yyyy or mm/dd/yy) or using plain text (Monday, 29th of september).
My advice would be to define a default mode, for example all string, or just integer and strings, and use configuration parameters or even a configuration file to finely tune conversion per column.
For the reading part, the csv module will give you all what you need.

Sorting in pandas for large datasets

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?
In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
Say your file is called stuff.csv, and looks like this:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.
As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted().
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html
Blaze might be the tool for you with the ability to work with pandas and csv files out of core.
http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.
If your csv file contains only structured data, I would suggest approach using only linux commands.
Assume csv file contains two columns, COL_1 and P_VALUE:
map.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.
Here is my Honest sugg./ Three options you can do.
I like Pandas for its rich doc and features but I been suggested to
use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
You can also use 'kind' parameter in your pandas-sort function you are using.
Godspeed my friend.

Efficient way to import a lot of csv files into PostgreSQL db

I see plenty of examples of importing a CSV into a PostgreSQL db, but what I need is an efficient way to import 500,000 CSV's into a single PostgreSQL db. Each CSV is a bit over 500KB (so grand total of approx 272GB of data).
The CSV's are identically formatted and there are no duplicate records (the data was generated programatically from a raw data source). I have been searching and will continue to search online for options, but I would appreciate any direction on getting this done in the most efficient manner possible. I do have some experience with Python, but will dig into any other solution that seems appropriate.
Thanks!
If you start by reading the PostgreSQL guide "Populating a Database" you'll see several pieces of advice:
Load the data in a single transaction.
Use COPY if at all possible.
Remove indexes, foreign key constraints etc before loading the data and restore them afterwards.
PostgreSQL's COPY statement already supports the CSV format:
COPY table (column1, column2, ...) FROM '/path/to/data.csv' WITH (FORMAT CSV)
so it looks as if you are best off not using Python at all, or using Python only to generate the required sequence of COPY statements.
Nice chunk of data you have there. I'm not 100% sure about Postgre, but at least MySQL provides some SQL commands, to feed a csv directly into a table. This bypasses any insert checks and so on and is thatswhy more than a order of magnitude faster than any ordinary insert operations.
So the probably fastest way to go is create some simple python script, telling your postgre server, which csv files in which order to hungrily devour into it's endless tables.
I use php and postgres, and read the csv file with php and ride a string in the following format:
{ {line1 column1, line1 column2, line1 column3} , { line2 column1,line2 column2,line2 column3} }
Care in a single transaction by passing the string parameter to postgresql function.
I can check all records, formatting, amount of data, etc., and get a result of importing 500,000 records in about 3 minutes.
To read the data in postgresql function:
DECLARE
d varchar[];
BEGIN
FOREACH d SLICE 1 IN ARRAY p_dados
LOOP
INSERT INTO schema.table (
column1,
column2,
column3,
)
VALUES (
d[1],
d[2]::INTEGER, -- explicit conversion to INTEGER
d[3]::BIGINT, -- explicit conversion to BIGINT
);
END LOOP;
END;

Categories

Resources