Pandas read sql integer became float - python

I met a problem that when I use pandas to read Mysql table, some columns (see 'to_nlc') used to be integer became a float number (automatically add .0 after that).
Can anyone figure it out? Or some guessings? Thanks very much!

Problem is your data contains NaN values, so int is automatically cast to float.
I think you can check NA type promotions:
When introducing NAs into an existing Series or DataFrame via reindex or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. These are summarized by this table:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
While this may seem like a heavy trade-off, in practice I have found very few cases where this is an issue in practice. Some explanation for the motivation here in the next section.

As already said the problem is that pandas' integer can not handle NULL/NA value.
You can replace read_sql_table with read_sql and convert NULL to some integer value (for example 0 or -1, something which has NULL sense in your setting):
df = pandas.read_sql("SELECT col1, col2, IFNULL(col3, 0) FROM table", engine)
Here col3 can be NULL in mysql, ifnull will return 0 if it is NULL or col3 value otherwise.
Or same thing with little function helper:
def read_sql_table_with_nullcast(table_name, engine, null_cast={}):
"""
table_name - table name
engine - sql engine
null_cast - dictionary of columns to replace NULL:
column name as key value to replace with as value.
for example {'col3':0} will set all NULL in col3 to 0
"""
import pandas
cols = pandas.read_sql("SHOW COLUMNS FROM " + table_name, engine)
cols_call = [c if c not in null_cast else "ifnull(%s,%d) as %s"%(c,null_cast[c],c) for c in cols['Field']]
sel = ",".join(cols_call)
return pandas.read_sql("SELECT " + sel + " FROM " + table_name, engine)
read_sql_table_with_nullcast("table", engine, {'col3':0})

You can use parameters: coerce_float=False
df = pd.read_sql(sql, con=conn, coerce_float=False)
coerce_floatbool, default True
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html

Another possibility is to exclude NULL values in the WHERE clause of your SQL query, if you're not expecting them and they correspond to unusable rows.
So it won't be suitable in all circumstances, but is a clean and simple option when it does apply.

Related

Why does SQLite3 not yield an error

I am quite new to SQL, but trying to bugfix the output of an SQL-Query. However this question does not concern the bug, but rather why SQLite3 does not yield an error when it should.
I have query string that looks like:
QueryString = ("SELECT e.event_id, "
"count(e.event_id), "
"e.state, "
"MIN(e.boot_time) AS boot_time, "
"e.time_occurred, "
"COALESCE(e.info, 0) AS info "
"FROM events AS e "
"JOIN leg ON leg.id = e.leg_id "
"GROUP BY e.event_id "
"ORDER BY leg.num_leg DESC, "
"e.event_id ASC;\n"
)
This yields an output with no errors.
What I dont understand, is why there is no error when I GROUP BY e.event_id and e.state and e.time_occurred does not contain aggregate-functions and is not part of the GROUP BY statement?
e.state is a string column. e.time_occurred is an integer column.
I am using the QueryString in Python.
In a misguided attempt to be compatible with MySQL, this is allowed. (The non-aggregated column values come from some random row in the group.)
Since SQLite 3.7.11, using min() or max() guarantees that the values in the non-aggregated columns come from the row that has the minimum/maximum value in the group.
SQLite and MySQL allow bare columns in an aggregation query. This is explained in the documentation:
In the query above, the "a" column is part of the GROUP BY clause and
so each row of the output contains one of the distinct values for "a".
The "c" column is contained within the sum() aggregate function and so
that output column is the sum of all "c" values in rows that have the
same value for "a". But what is the result of the bare column "b"? The
answer is that the "b" result will be the value for "b" in one of the
input rows that form the aggregate. The problem is that you usually do
not know which input row is used to compute "b", and so in many cases
the value for "b" is undefined.
Your particular query is:
SELECT e.event_id, count(e.event_id), e.state, MIN(e.boot_time) AS boot_time,
e.time_occurred, COALESCE(e.info, 0) AS info
FROM events AS e JOIN
leg
ON leg.id = e.leg_id "
GROUP BY e.event_id
ORDER BY leg.num_leg DESC, e.event_id ASC;
If e.event_id is the primary key in events, then this syntax is even supported by the ANSI standard, because event_id is sufficient to uniquely define the other columns in a row in events.
If e.event_id is a PRIMARY or UNIQUE key of the table then e.time_occurred is called "functionally dependent" and would not even throw an error in other SQL compliant DBMSs.
However, SQLite has not implemented functional dependency. In the case of SQLite (and MySQL) no error is thrown even for columns that are not functionally dependent on the GROUP BY columns.
SQLite (and MySQL) simply select a random row from the result set to fill the (in SQLite lingo) "bare column", see this.

Pyspark changes longs to ints

I'm using Pyspark 1.2.1 with Hive. (Upgrading will not happen immediately).
The problem I have is that when I select from a Hive table, and add an index, pyspark changes long values to ints, so I end up with a temp table with a column of type Long, but values of type Integer. (See code below).
My question is: how can I either (a) perform the merge of the index (see code) without changing longs to ints; or (b) add the index in some other way that avoids the problem; or (c) randomize table columns without needing to join?
The underlying problem I'm trying to solve is that I want to randomize the order of certain columns in a hive table, and write that to a new table. This is to make data no longer personally identifiable. I'm doing that by adding an incrementing index to the original table and the randomised columns, then joining on that index.
The table looks like:
primary | longcolumn | randomisecolumn
The code is:
hc = HiveContext(sc)
orig = hc.sql('select * from mytable')
widx = orig.zipWithIndex().map(merge_index_on_row)
sql_context.applySchema(widx, add_index_schema(orig.schema()))
.registerTempTable('sani_first')
# At this point sani_first has a column longcolumn with type long,
# but (many of) the values are ints
def merge_index_on_row((row, idx), idx_name=INDEX_COL):
"""
Row is a SchemaRDD row object; idx is an integer;
schema is the schema for row with an added index col at the end
returns a version of row applying schema and holding the index in the new row
"""
as_dict = row.asDict()
as_dict[idx_name] = idx
return Row(**as_dict)
def add_index_schema(schema):
"""
Take a schema, add a column for an index, return the new schema
"""
return StructType(sorted(schema.fields + [StructField(INDEX_COL, IntegerType(), False)],key=lambda x:x.name))
In the absence of a better solution, I'm going to force the affected columns to long type in the python code. This is...not great.

How to insert empty excel dates into oracle with Python+Pandas?

I've got a Python application that is using pandas to grok some excel spreadsheets and insert values into an oracle database.
For date cells that have a value, this works fine. For empty date cells I am inserting a NaT, which I would have thought would be fine, but in Oracle that is becoming some weird invalid time that displays as "0001-255-255 00:00:00" (Something like MAXINT or 0 being converted to a timestamp I'm guessing?)
In[72]: x.iloc[0][9]
Out[72]: NaT
Above is the bit of data in the DataFrame, you can see it's a NaT.
But this is what I see in Oracle..
SQL> select TDATE from TABLE where id=5067 AND version=5;
TDATE
---------
01-NOVEMB
SQL> select dump("TDATE") TABLE where id=5067 AND version=5;
DUMP("TDATE")
--------------------------------------------------------------------------------
Typ=12 Len=7: 100,101,255,255,1,1,1
I tried doing df.replace and/or df.where to convert NaT to None but I get assorted errors with either of these that seem to imply the substitution is not valid in that way.
Any way to ensure consistency of a null date across these datastores?!
This issue has been fixed in Pandas 15.0.
If you can, update to Pandas >= 15.0. Starting with that version, NaN and NaT are properly stored as NULL in the database.
After having performed some experiments, it appears that Pandas pass NaT to SQLAlchemy and down to cx_Oracle -- which in its turn blindly send an invalid date to Oracle (which in its turn does not complain).
Anyway, one I was able to come with is to add a BEFORE INSERT TRIGGER to fix incoming timestamps. For that to work, you will have to manually create the table first.
-- Create the table
CREATE TABLE W ("ID" NUMBER(5), "TDATE" TIMESTAMP);
And then the trigger:
-- Create a trigger on the table
CREATE OR REPLACE TRIGGER fix_null_ts
BEFORE INSERT ON W
FOR EACH ROW WHEN (extract(month from new.tdate) = 255)
BEGIN
:new.tdate := NULL;
END;
/
After that, from Python, using pandas.DataFrame.toSql(..., if_exists='append'):
>>> d = [{"id":1,"tdate":datetime.now()},{"id":2}]
>>> f = pd.DataFrame(d)
>>> f.to_sql("W",engine, if_exists='append', index=False)
# ^^^^^^^^^^^^^^^^^^
# don't drop the table! append data to an existing table
And check:
>>> result = engine.execute("select * from w")
>>> for row in result:
... print(row)
...
(1, datetime.datetime(2014, 10, 31, 1, 10, 2))
(2, None)
Beware that, if you ever need to rewrite an other DataFrame to the same table, you will first need to delete its content -- but not drop it, otherwise you would loose the trigger at the same time. For example:
# Some new data
>>> d = [{"id":3}]
>>> f = pd.DataFrame(d)
# Truncate the table and write the new data
>>> engine.execute("truncate table w")
>>> f.to_sql("W",engine, if_exists='append', index=False)
>>> result = engine.execute("select * from w")
# Check the result
>>> for row in result:
... print(row)
...
(3, None)
I hope the data type of the date column in Oracle database is DATE.
In that case, remember, date has a date part and time part together as THE DATE. While loading into database, make sure you use TO_DATE and put a proper datetime format to the date literal.
That's about loading. Now, to display, use TO_CHAR with proper datetime format to see the value the way the human eyes want to see a datetime value.
And, regarding the NULL values, unless you have NOT NULL constraint, I don't see any issue with loading. The NULL values would anyway loaded as NULL. If you want to manipulate the NULL values, use NVL function and use the desired value you want to replace the NULL value with.

How to easily initialize a new object with all values null using sqlalchemy

I have a database table with about 15 columns, and I'm using sqlalchemy to access the database. If I want to create a new row to add to the database table, and I want the values of every column (except the id) to be null, is there an easier and more elegant way to do this rather than doing:
new_object = table(col1 = None, col2 = None, col3 = None....) //all the way until column 15
The error I get when I only give the value of the id and no other parameters is as follows:
"TypeError: __init__() missing 15 required positional arguments:"...
and then it lists the 15 parameters I didn't assign values to.
The INSERT statement fills all columns of a table that are no mentioned explicitely with their respective column default. If none has been defined, NULL is the default default (sic!).
Plus, you can instruct Postgres to insert the column default with the key word DEFAULT:
INSERT INTO tbl (id) VALUES (DEFAULT) RETURNING id;
Should do what you are after and return the newly created id.
Not sure how to translate this to your brand of ORM.

SELECT * in SQLAlchemy?

Is it possible to do SELECT * in SQLAlchemy?
Specifically, SELECT * WHERE foo=1?
Is no one feeling the ORM love of SQLAlchemy today? The presented answers correctly describe the lower-level interface that SQLAlchemy provides. Just for completeness, this is the more-likely (for me) real-world situation where you have a session instance and a User class that is ORM mapped to the users table.
for user in session.query(User).filter_by(name='jack'):
print(user)
# ...
And this does an explicit select on all columns.
The following selection works for me in the core expression language (returning a RowProxy object):
foo_col = sqlalchemy.sql.column('foo')
s = sqlalchemy.sql.select(['*']).where(foo_col == 1)
If you don't list any columns, you get all of them.
query = users.select()
query = query.where(users.c.name=='jack')
result = conn.execute(query)
for row in result:
print row
Should work.
You can always use a raw SQL too:
str_sql = sql.text("YOUR STRING SQL")
#if you have some args:
args = {
'myarg1': yourarg1
'myarg2': yourarg2}
#then call the execute method from your connection
results = conn.execute(str_sql,args).fetchall()
Where Bar is the class mapped to your table and session is your sa session:
bars = session.query(Bar).filter(Bar.foo == 1)
Turns out you can do:
sa.select('*', ...)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(rows, columns=rows.keys())
For joins if columns are not defined manually, only columns of target table are returned. To get all columns for joins(User table joined with Group Table:
sql = User.select(from_obj(Group, User.c.group_id == Group.c.id))
# Add all coumns of Group table to select
sql = sql.column(Group)
session.connection().execute(sql)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(dict(zip(r.keys(), r)) for r in rows)
If you're using the ORM, you can build a query using the normal ORM constructs and then execute it directly to get raw column values:
query = session.query(User).filter_by(name='jack')
for cols in session.connection().execute(query):
print cols
every_column = User.__table__.columns
records = session.query(*every_column).filter(User.foo==1).all()
When a ORM class is passed to the query function, e.g. query(User), the result will be composed of ORM instances. In the majority of cases, this is what the dev wants and will be easiest to deal with--demonstrated by the popularity of the answer above that corresponds to this approach.
In some cases, devs may instead want an iterable sequence of values. In these cases, one can pass the list of desired column objects to query(). This answer shows how to pass the entire list of columns without hardcoding them, while still working with SQLAlchemy at the ORM layer.

Categories

Resources