Fastest way to convert object to string Python - python

I have a ~250,000 row dataset I am pulling from Oracle DB. The main data I am concerned with is a text field that is pulled as a HUGECLOB object. Storing my data to a CSV took quite a bit of time, so I decided to switch to Feather via: Pandas read_csv speed up
There was some other question that used feather too, but I cannot find it. I tried to_hdf, but for some reason that did not work. Either way, to get my query results to work with Feather, the text column must be converted to string. So I have:
SQLquery = ('SELECT*')
datai = pd.read_sql(SQLquery, conn)
print("Query Passed, start date")
datai['REPORTDATE'] = pd.to_datetime(datai['REPORTDATE'], format='%m-%d-%Y')
print("Row done, string")
datai['LOWER(LD.LDTEXT)'] = datai['LOWER(LD.LDTEXT)'].apply(str)
print("Data Retrieved")
print("To feather start")
datai.to_feather(r'C:\Users\asdean\Documents\Nuclear Text\dataStore\rawData2.feather')
print("Done with feather")
Note: I put a bunch of print statements because I was trying to figure out where it got hung up.
The text column is identified as datai['LOWER(LD.LDTEXT)']. Some of the rows contain quite a bit of text (~ a couple paragraphs). The string conversion TAKES FOREVER (or may not even be completing). I did not have this problem when I read it from an old CSV (old data, no longer use, we updated the query, etc).
I have tried all the common ways of doing this with astype(str), map(str), apply(str), values.astype(str) with no success of speeding it up to a reasonable (less than 1 hour) pace. Is there a way to do faster object to string conversions? Is there a library I am missing here? Is there a faster way through Oracle/HUGECLOB? How can I speed this up?

I'm assuming the data is stored in the DB as the Oracle CLOB datatype. How big is the data? If the strings are < 1GB make sure they are being fetched as strings (not streamed) in the lower layer https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes

Related

"Couldn't convert string to float" in python reading .DBF files - handling erroneous sensor data

I'm trying to read some sensor data, which is stored in .DBF-format, which usually works perfectly. However, one of my sensors seems to be having some issues, as all of a sudden, my usual scripts don't work any longer.
Let me clarify: The only table per file consists of four columns (time, temperature, humidity, dewpoint) and at some point (spanning roughly line 25 to 150 or so), only the value of humidity apparently is set to "--327", which python tries to convert to float and obviously fails to do. From the experiment I did and the values before and after that timespan (always 0.0) I know, that these values are in fact not interesting for my analysis.
Now the thing that I can't wrap my head around is, that I'm reading the data line-by-line wise (I have some other stuff to do anyways, before creating a pandas df) and I thought I was prepared for weird values, by doing this:
import numpy as np
import dbfread
data = dbfread.DBF(file)
for line in data:
try:
val_humi = line["humidity"]
except Exception:
val_humi = np.nan
print("Error reading the humidity")
default_further_processing_of_read_value_and_df_insertion(val_humi)
While probably not being best practice, this usually works for me as I don't really care about these few values being set to np.nan in a file containing several thousand entries.
How can I get rid of the error message I'm receiving so that the rest of my script can continue working? I'm grateful for any suggestions.
ValueError: could not convert string to float: b'--327'
P.S.:
My first idea was to directly load the dbf into a pd.dataframe, but the error was the same.
My next best idea so far was to manually check each read value (performance is not an issue) for it being equal to "--327" and then manually setting it to "0", but I can't even get there as the error message is thrown as soon as I'm attempting to extract the value. Maybe there is some option like <"set_read_data_type" = str> or so?
Thanks in advance!
EDIT:
Maybe this also helps:
If I import the dbf-file into excel, the erroneous values appear as empty cells, while only opening the file with something like DBFViewer2000 shows the --327 value, that python also sees.
Using my dbf library, you can specify your own float conversion routine to handle weird errors:
import dbf
def fix_float(string):
try:
return float(string)
except ValueError:
return 0.0
data = dbf.Table(file, default_data_types={'N': fix_float})
data.open()
for record in data:
# do stuff
Note that there are a few differences in how dbf chooses to handle data and data access, so spend a few minutes exploring it.

What is an appropriate choice to store very large files with python? .csv files truncated data in certain cells

I'm writing a python script for the data acquisition phase of my project and so far I've been storing data in .csv files. While I was reading data from a particular .csv file, I got the error:
syntaxError: EOL while scanning string literal
I took a look at the specific row in the file and and the data in the specific cell were truncated. I am using pandas to store dicts to csv and it never threw an error. I guess .csv will save itself no matter what, even if that means it will delete data without any warning.
I thought of changing to .xls. When the same row was being stored, an error came up saying (something along the lines of):
Max character length reached. Max character length per cell was ~32k.
Then I thought that it may just be an excel/libreoffice calc issue (I tried both) and they can't visualize the data in cell but they are actually there. So I tried printing the specific cell; data were indeed truncated. The specific cell contains a dict, whose values are float, int, boolean or string. However, all of them have been converted to strings.
My question is, is there a way to fix it without changing the file format?
In the case that I have to change the file format, what would be an appropriate choice to store very large files? I am thinking about hdf5.
In case you need more info, do let me know. Thank you!
There is a limit to fields size:
csv.field_size_limit([new_limit])
Returns the current maximum field size allowed by the parser.
If new_limit is given, this becomes the new limit.
On my system (Python 3.8.0), I get:
>>> import csv
>>> csv.field_size_limit()
131072
which is exactly 128 kB.
You could try to set the limit higher:
csv.field_size_limit(your_new_limit)
But maybe a different file format would be more adapted depending on what kind of data you store.

How to keep null values when writing to csv

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command. The issue is that Python's csv writer automatically converts Nulls into an empty string "" and it fails my job when the column is an int or float datatype and it tries to insert this "" when it should be a None or null value.
To make it as easy as possible to interface with modules which
implement the DB API, the value None is written as the empty string.
https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.writer
What is the best way to keep the null value? Is there a better way to write csvs in Python? I'm open to all suggestions.
Example:
I have lat and long values:
42.313270000 -71.116240000
42.377010000 -71.064770000
NULL NULL
When writing to csv it converts nulls to "":
with file_path.open(mode='w', newline='') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
if include_headers:
csv_writer.writerow(col[0] for col in self.cursor.description)
for row in self.cursor:
csv_writer.writerow(row)
.
42.313270000,-71.116240000
42.377010000,-71.064770000
"",""
NULL
Specifies the string that represents a null value. The default is \N
(backslash-N) in text format, and an unquoted empty string in CSV
format. You might prefer an empty string even in text format for cases
where you don't want to distinguish nulls from empty strings. This
option is not allowed when using binary format.
https://www.postgresql.org/docs/9.2/sql-copy.html
ANSWER:
What solved the problem for me was changing the quoting to csv.QUOTE_MINIMAL.
csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields
which contain special characters such as delimiter, quotechar or any
of the characters in lineterminator.
Related questions:
- Postgresql COPY empty string as NULL not work
You have two options here: change the csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)
Python csv.writer() and quoting
On the Python side, you are telling the csv.writer() object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC:
Instructs writer objects to quote all non-numeric fields.
None values are non-numeric, so result in "" being written.
Switch to using csv.QUOTE_MINIMAL or csv.QUOTE_NONE:
csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.
Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.
With either option, the CSV output for None values is simple an empty string:
>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
... outfile = StringIO()
... csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
... csv_writer.writerows(rows)
... return outfile.getvalue()
...
>>> rows = [
... [42.313270000, -71.116240000],
... [42.377010000, -71.064770000],
... [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""
>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,
>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,
PostgreSQL 9.4 COPY FROM, NULL values and FORCE_NULL
As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULLs, when you use the FORCE_NULL option. From the COPY FROM documentation:
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
Naming the columns in a FORCE_NULL option lets PostgreSQL accept both the empty column and "" as NULL values for those columns, e.g.:
COPY position (
lon,
lat
)
FROM "filename"
WITH (
FORMAT csv,
NULL '',
DELIMITER ',',
FORCE_NULL(lon, lat)
);
at which point it doesn't matter anymore what quoting options you used on the Python side.
Other options to consider
For simple data transformation tasks from other databases, don't use Python
If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.
Numpy data? Consider using COPY FROM as binary, directly from Python
Numpy data can more efficiently be inserted via binary COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY and the psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.
Persisting data to handle large datasets in a pipeline?
Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.
Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.
Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.
The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.
your code
for row in self.cursor:
csv_writer.writerow(row)
uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression
for row in self.cursor:
csv_writer.writerow("null" if x is None else x for x in row)
You are asking for csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using csv.QUOTE_MINIMAL as it might be more what you are after:
Test Code:
import csv
test_data = (None, 0, '', 'data')
for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC),
('test2.csv', csv.QUOTE_MINIMAL)):
with open(name, mode='w') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes)
csv_writer.writerow(test_data))
Results:
test1.csv:
"",0,"","data"
test2.csv:
,0,,data
I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command.
I believe your true requirement is you need to hop data rows through the filesystem, and as both the sentence above and the question title make clear, you are currently doing that with a csv file.
Trouble is that csv format offers poor support for the RDBMS notion of NULL.
Let me solve your problem for you by changing the question slightly.
I'd like to introduce you to parquet format.
Given a set of table rows in memory, it allows you to very quickly persist them to a compressed binary file, and recover them, with metadata and NULLs intact, no text quoting hassles.
Here is an example, using the pyarrow 0.12.1 parquet engine:
import pandas as pd
import pyarrow
def round_trip(fspec='/tmp/locations.parquet'):
rows = [
dict(lat=42.313, lng=-71.116),
dict(lat=42.377, lng=-71.065),
dict(lat=None, lng=None),
]
df = pd.DataFrame(rows)
df.to_parquet(fspec)
del(df)
df2 = pd.read_parquet(fspec)
print(df2)
if __name__ == '__main__':
round_trip()
Output:
lat lng
0 42.313 -71.116
1 42.377 -71.065
2 NaN NaN
Once you've recovered the rows in a dataframe you're free to call df2.to_sql() or use some other favorite technique to put numbers and NULLs into a DB table.
EDIT:
If you're able to run .to_sql() on the PG server, or on same LAN, then do that.
Otherwise your favorite technique will likely involve .copy_expert().
Why?
The summary is that with psycopg2, "bulk INSERT is slow".
Middle layers like sqlalchemy and pandas, and well-written apps that care about insert performance, will use .executemany().
The idea is to send lots of rows all at once, without waiting for individual result status, because we're not worried about unique index violations.
So TCP gets a giant buffer of SQL text and sends it all at once, saturating the end-to-end channel's bandwidth,
much as copy_expert sends a big buffer to TCP to achieve high bandwidth.
In contrast the psycopg2 driver lacks support for high performance executemany.
As of 2.7.4 it just executes items one at a time, sending a SQL command across the WAN and waiting a round trip time for the result before sending next command.
Ping your server;
if ping times suggest you could get a dozen round trips per second,
then plan on only inserting about a dozen rows per second.
Most of the time is spent waiting for a reply packet, rather than spent processing DB rows.
It would be lovely if at some future date psycopg2 would offer better support for this.
I would use pandas,psycopg2,and sqlalchemy. Make sure are installed. Coming from your current workflow and avoiding writing to csv
#no need to import psycopg2
import pandas as pd
from sqlalchemy import create_engine
#create connection to postgres
engine = create_engine('postgres://.....')
#get column names from cursor.description
columns = [col[0] for col in self.cursor.description]
#convert data into dataframe
df = pd.DataFrame(cursor.fetchall(),columns=columns)
#send dataframe to postgres
df.to_sql('name_of_table',engine,if_exists='append',index=False)
#if you still need to write to csv
df.to_csv('your_file.csv')

Pandas - Appending 'table' format to HDF5Store with different dtypes: invalid combinate of [values_axes]

I recently started trying to use HDF5 format in python pandas to store data but encountered a problem where cant find a workaround for. Before i worked with CSV files and i had no trouble in regards to appending new data.
This is what i try:
store = pd.HDFStore('cdw.h5')
frame.to_hdf('cdw.h5','cdw/data_cleaned', format='table',append=True, data_columns=True,dropna=False)
And it throws:
ValueError: invalid combinate of [values_axes] on appending data [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->float64,kind->float,shape->(1, 176345)] vs current table [name->Ordereingangsdatum,cname->Ordereingangsdatum,dtype->bytes128,kind->string,shape->None]
I get that it tells me i want to append different data type for a column but what buffles me is that i have wrote the same CSV file before with some other CSV Files from a Dataframe to that HDF5 file.
I'm doing analysis in the forwarding industry and the data there is very inconsistent - more often than not there are missing values or mixed dtypes in columns or other 'data dirt'.
Im looking for a way to append data to HDF5 file no matter what is inside the column as long as the column names are the same.
It would be beautiful to enforce appending data in HDF store independant of datatypes or another simple solution for my problem. The goal is to have an automation later on for the analysis therefore id not like to change datatypes everytime i have a missing value in a column of the total 62 columns i have.
Another question in my question is:
My file access for read_hdf consumes more time than my read_csv i have around 1.5 million rows with 62 columns. Is this because i have no SSD drive? Because i have read that the file access for read_hdf should be faster.
I question myself if I rather should stick with CSV files or with HDF5?
Help would be greatly appreciated.
Okay for anyone having the same issue with appending data where the dtype is not always secured to be the same: I finally found a solution. First convert every column to object with li = list(frame)
frame[li] = frame[li].astype(object)
frame.info() then try the method df.to_hdf(key,value, append=True) and wait for its error message. The error message TypeError: Cannot serialize the column [not_one_datatype] because its data contents are [mixed] object dtype will tell the columns it still doesnt like. Converting those columns to float worked for me! After that the error convert the mentioned column with df['not_one_datatype'].astype(float) only use integer if you are sure that a float will never occur in this column otherwise append method will bug again.
I decided to work parallel with CSV and HDF5 Files. If i get a problem with HDF5 where i have no workaround for i will simply change to CSV - this is what i can recommend personally.
Update: Okay it seems that the creators of this format have not thought about the reality when considering the HDF API: HDF5 min_itemsize error: ValueError: Trying to store a string with len [##] in [y] column but this column has a limit of [##]! occurs when trying to append data to an already existing file if some column happens to be longer than the initial write to HDF file.
Now the joke here is that the creators of this API expecting me to know the max column length of each possible data in a column at the first write? really? Another inconsistency is that df.to_hdf(append=True) do not have the parameter min_itemsize={'column1':1000}. This format is at best suited for storing self created data only but definately not for data where the dtypes and length of the entries in each column are NOT set in stone. The only solution left when you want to append data from pandas dataframes independent of the stubborn HDF5 API in Python is to insert in every dataframe before appending a row with very long strings except for the numeric columns. Just to be sure that you will always be able to append the data no matter how possible long it will get.
When doing this write process will take ages and slurp gigantic sizes of disc drive for saving the huge HDF5 file.
CSV definately wins against HDF5 in terms of performance, integration and especially usability.

Efficient way to import a lot of csv files into PostgreSQL db

I see plenty of examples of importing a CSV into a PostgreSQL db, but what I need is an efficient way to import 500,000 CSV's into a single PostgreSQL db. Each CSV is a bit over 500KB (so grand total of approx 272GB of data).
The CSV's are identically formatted and there are no duplicate records (the data was generated programatically from a raw data source). I have been searching and will continue to search online for options, but I would appreciate any direction on getting this done in the most efficient manner possible. I do have some experience with Python, but will dig into any other solution that seems appropriate.
Thanks!
If you start by reading the PostgreSQL guide "Populating a Database" you'll see several pieces of advice:
Load the data in a single transaction.
Use COPY if at all possible.
Remove indexes, foreign key constraints etc before loading the data and restore them afterwards.
PostgreSQL's COPY statement already supports the CSV format:
COPY table (column1, column2, ...) FROM '/path/to/data.csv' WITH (FORMAT CSV)
so it looks as if you are best off not using Python at all, or using Python only to generate the required sequence of COPY statements.
Nice chunk of data you have there. I'm not 100% sure about Postgre, but at least MySQL provides some SQL commands, to feed a csv directly into a table. This bypasses any insert checks and so on and is thatswhy more than a order of magnitude faster than any ordinary insert operations.
So the probably fastest way to go is create some simple python script, telling your postgre server, which csv files in which order to hungrily devour into it's endless tables.
I use php and postgres, and read the csv file with php and ride a string in the following format:
{ {line1 column1, line1 column2, line1 column3} , { line2 column1,line2 column2,line2 column3} }
Care in a single transaction by passing the string parameter to postgresql function.
I can check all records, formatting, amount of data, etc., and get a result of importing 500,000 records in about 3 minutes.
To read the data in postgresql function:
DECLARE
d varchar[];
BEGIN
FOREACH d SLICE 1 IN ARRAY p_dados
LOOP
INSERT INTO schema.table (
column1,
column2,
column3,
)
VALUES (
d[1],
d[2]::INTEGER, -- explicit conversion to INTEGER
d[3]::BIGINT, -- explicit conversion to BIGINT
);
END LOOP;
END;

Categories

Resources