I've got a Python application that is using pandas to grok some excel spreadsheets and insert values into an oracle database.
For date cells that have a value, this works fine. For empty date cells I am inserting a NaT, which I would have thought would be fine, but in Oracle that is becoming some weird invalid time that displays as "0001-255-255 00:00:00" (Something like MAXINT or 0 being converted to a timestamp I'm guessing?)
In[72]: x.iloc[0][9]
Out[72]: NaT
Above is the bit of data in the DataFrame, you can see it's a NaT.
But this is what I see in Oracle..
SQL> select TDATE from TABLE where id=5067 AND version=5;
TDATE
---------
01-NOVEMB
SQL> select dump("TDATE") TABLE where id=5067 AND version=5;
DUMP("TDATE")
--------------------------------------------------------------------------------
Typ=12 Len=7: 100,101,255,255,1,1,1
I tried doing df.replace and/or df.where to convert NaT to None but I get assorted errors with either of these that seem to imply the substitution is not valid in that way.
Any way to ensure consistency of a null date across these datastores?!
This issue has been fixed in Pandas 15.0.
If you can, update to Pandas >= 15.0. Starting with that version, NaN and NaT are properly stored as NULL in the database.
After having performed some experiments, it appears that Pandas pass NaT to SQLAlchemy and down to cx_Oracle -- which in its turn blindly send an invalid date to Oracle (which in its turn does not complain).
Anyway, one I was able to come with is to add a BEFORE INSERT TRIGGER to fix incoming timestamps. For that to work, you will have to manually create the table first.
-- Create the table
CREATE TABLE W ("ID" NUMBER(5), "TDATE" TIMESTAMP);
And then the trigger:
-- Create a trigger on the table
CREATE OR REPLACE TRIGGER fix_null_ts
BEFORE INSERT ON W
FOR EACH ROW WHEN (extract(month from new.tdate) = 255)
BEGIN
:new.tdate := NULL;
END;
/
After that, from Python, using pandas.DataFrame.toSql(..., if_exists='append'):
>>> d = [{"id":1,"tdate":datetime.now()},{"id":2}]
>>> f = pd.DataFrame(d)
>>> f.to_sql("W",engine, if_exists='append', index=False)
# ^^^^^^^^^^^^^^^^^^
# don't drop the table! append data to an existing table
And check:
>>> result = engine.execute("select * from w")
>>> for row in result:
... print(row)
...
(1, datetime.datetime(2014, 10, 31, 1, 10, 2))
(2, None)
Beware that, if you ever need to rewrite an other DataFrame to the same table, you will first need to delete its content -- but not drop it, otherwise you would loose the trigger at the same time. For example:
# Some new data
>>> d = [{"id":3}]
>>> f = pd.DataFrame(d)
# Truncate the table and write the new data
>>> engine.execute("truncate table w")
>>> f.to_sql("W",engine, if_exists='append', index=False)
>>> result = engine.execute("select * from w")
# Check the result
>>> for row in result:
... print(row)
...
(3, None)
I hope the data type of the date column in Oracle database is DATE.
In that case, remember, date has a date part and time part together as THE DATE. While loading into database, make sure you use TO_DATE and put a proper datetime format to the date literal.
That's about loading. Now, to display, use TO_CHAR with proper datetime format to see the value the way the human eyes want to see a datetime value.
And, regarding the NULL values, unless you have NOT NULL constraint, I don't see any issue with loading. The NULL values would anyway loaded as NULL. If you want to manipulate the NULL values, use NVL function and use the desired value you want to replace the NULL value with.
Related
I met a problem that when I use pandas to read Mysql table, some columns (see 'to_nlc') used to be integer became a float number (automatically add .0 after that).
Can anyone figure it out? Or some guessings? Thanks very much!
Problem is your data contains NaN values, so int is automatically cast to float.
I think you can check NA type promotions:
When introducing NAs into an existing Series or DataFrame via reindex or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. These are summarized by this table:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
While this may seem like a heavy trade-off, in practice I have found very few cases where this is an issue in practice. Some explanation for the motivation here in the next section.
As already said the problem is that pandas' integer can not handle NULL/NA value.
You can replace read_sql_table with read_sql and convert NULL to some integer value (for example 0 or -1, something which has NULL sense in your setting):
df = pandas.read_sql("SELECT col1, col2, IFNULL(col3, 0) FROM table", engine)
Here col3 can be NULL in mysql, ifnull will return 0 if it is NULL or col3 value otherwise.
Or same thing with little function helper:
def read_sql_table_with_nullcast(table_name, engine, null_cast={}):
"""
table_name - table name
engine - sql engine
null_cast - dictionary of columns to replace NULL:
column name as key value to replace with as value.
for example {'col3':0} will set all NULL in col3 to 0
"""
import pandas
cols = pandas.read_sql("SHOW COLUMNS FROM " + table_name, engine)
cols_call = [c if c not in null_cast else "ifnull(%s,%d) as %s"%(c,null_cast[c],c) for c in cols['Field']]
sel = ",".join(cols_call)
return pandas.read_sql("SELECT " + sel + " FROM " + table_name, engine)
read_sql_table_with_nullcast("table", engine, {'col3':0})
You can use parameters: coerce_float=False
df = pd.read_sql(sql, con=conn, coerce_float=False)
coerce_floatbool, default True
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html
Another possibility is to exclude NULL values in the WHERE clause of your SQL query, if you're not expecting them and they correspond to unusable rows.
So it won't be suitable in all circumstances, but is a clean and simple option when it does apply.
I'm using Pyspark 1.2.1 with Hive. (Upgrading will not happen immediately).
The problem I have is that when I select from a Hive table, and add an index, pyspark changes long values to ints, so I end up with a temp table with a column of type Long, but values of type Integer. (See code below).
My question is: how can I either (a) perform the merge of the index (see code) without changing longs to ints; or (b) add the index in some other way that avoids the problem; or (c) randomize table columns without needing to join?
The underlying problem I'm trying to solve is that I want to randomize the order of certain columns in a hive table, and write that to a new table. This is to make data no longer personally identifiable. I'm doing that by adding an incrementing index to the original table and the randomised columns, then joining on that index.
The table looks like:
primary | longcolumn | randomisecolumn
The code is:
hc = HiveContext(sc)
orig = hc.sql('select * from mytable')
widx = orig.zipWithIndex().map(merge_index_on_row)
sql_context.applySchema(widx, add_index_schema(orig.schema()))
.registerTempTable('sani_first')
# At this point sani_first has a column longcolumn with type long,
# but (many of) the values are ints
def merge_index_on_row((row, idx), idx_name=INDEX_COL):
"""
Row is a SchemaRDD row object; idx is an integer;
schema is the schema for row with an added index col at the end
returns a version of row applying schema and holding the index in the new row
"""
as_dict = row.asDict()
as_dict[idx_name] = idx
return Row(**as_dict)
def add_index_schema(schema):
"""
Take a schema, add a column for an index, return the new schema
"""
return StructType(sorted(schema.fields + [StructField(INDEX_COL, IntegerType(), False)],key=lambda x:x.name))
In the absence of a better solution, I'm going to force the affected columns to long type in the python code. This is...not great.
I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.
I want to convert a MySQL query from a python script to an analogous query in R. The python uses a loop structure to search for specific values using genomic coordinates:
SQL = """SELECT value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE `chrom` = %d AND `site` = %d""" % (Table, Chr, Start)
cur.execute(SQL)
In R the chromosomes and sites are in a dataframe and for every row in the dataframe I would like to extract a single value and add it to a new column in the dataframe
So my current dataframe has a similar structure to the following:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
The amended dataframe should have an additional column with values from the database (at corresponding genomic coordinates. The structure should be similar to:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300), "Value"=c(1.5, 0, 5, 60, 100)
So far I connected to the database using:
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
Rather than loop over each row in my dataframe, I would like to use something that would add the corresponding value to a new column in the existing dataframe.
Update with working solution based on answer below:
library(RMySQL)
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
GetValue <- function(DataFrame, Table){
queries <- sprintf("SELECT value as value
FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE chrom = %d AND site = %d UNION ALL SELECT 'NA' LIMIT 1", Table, DataFrame$Chr, DataFrame$start)
res <- ldply(queries, function(query) { dbGetQuery(con, query)})
DataFrame[, Table] <- res$value
return(DataFrame)
}
df <- GetValue(df, "TableName")
Maybe you could do something like this. First, build up your queries, then execute them, storing the results in a column of your dataframe. Not sure if the do.call(rbind part is necessary, but that basically takes a bunch of dataframe rows, and squishes them together by row into a dataframe.
queries=sprintf("SELECT value as value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) WHERE chrom = %d AND site = %d UNION ALL SELECT 0 LIMIT 1", "TableName", df$Chrom, df$Pos)
df$Value = do.call("rbind",sapply(queries, function(query) dbSendQuery(mydb, query)))$value
I played with your SQL a little, my concern with the original is with cases where it might return more than 1 row.
I like the data.table package for this kind of tasks as its syntax is inspired by SQL
require(data.table)
So an example database to match the values to a table
table <- data.table(chrom=rep(1:5, each=5),
site=rep(100*1:5, times=5),
Value=runif(5*5))
Now the SQL query can be translated into something like
# select from table, where chrom=Chr and site=Site, value
Chr <- 2
Site <- 200
table[chrom==Chr & site==Site, Value] # returns data.table
table[chrom==Chr & site==Site, ]$Value # returns numeric
Key (index) database for quick lookup (assuming unique chrom and site..)
setkey(table, chrom, site)
table[J(Chr, Site), ]$Value # very fast lookup due to indexed table
Your dataframe as data table with two columns 'Chr' and 'Site' both integer
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
dt <- as.data.table(df) # adds data.table class to data.frame
setkey(dt, Chr, Site) # index for 'by' and for 'J' join
Match the values and append in new column (by reference, so no copying of table)
# loop over keys Chr and Site and find the match in the table
# select the Value column and create a new column that contains this
dt[, Value:=table[chrom==Chr & site==Site]$Value, by=list(Chr, Site)]
# faster:
dt[, Value:=table[J(Chr, Site)]$Value, by=list(Chr, Site)]
# fastest: in one table merge operation assuming the keys are in the same order
table[J(dt)]
kind greetings
Why don't you use the RMySQL or sqldf package?
With RMySQL, you get MySQL access in R.
With sqldf, you can issue SQL queries on R data structures.
Using either of those, you do not need to reword you SQL query to get the same results.
Let me also mention the data.table package, which lets you do very efficient selects and joins on your data frames after converting them to data tables using as.data.table(your.data.frame). Another good thing about it is that a data.table object is a data.frame at the same time, so all your functions that work on the data frames work on these converted objects, too.
You could easily use dplyr package. There is even nice vignette about that - http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html.
One thing you need to know is:
You can connect to MySQL and MariaDB (a recent fork of MySQL) through
src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll
need to provide a dbname, username, password, host, and port.
Hi I have a result set from psycopg2 like so
(
(timestamp1, val11, val12, val13, val14),
(timestamp2, val21, val22, val23, val24),
(timestamp3, val31, val32, val33, val34),
(timestamp4, val41, val42, val43, val44),
)
I have to return the difference between the values of the row (exception for the timestamp column).
Each row would subtract the previous row values.
The first row would be
timestamp, 'NaN', 'NaN' ....
This has to then be returned as a generic object
Ie something like an array of the following objects
Group(timestamp=timestamp, rows=[val11, val12, val13, val14]
I was going to use Pandas to do the diff.
Something like below works ok on the values
df = DataFrame().from_records(data=results, columns=headers)
diffs = df.set_index('time', drop=False).diff()
But diff also performs on the timestamp column and I can't get it to ignore a column while
leaving the original timestamp column in place.
Also I wasn't sure it was going to be efficient to get the data into my return format
as Pandas advises against row access
What would a fast way to get the result set differences in my required output format ?
Why did you set drop=False? That puts the timestamps in the index (where they will not be touched by diff) but also leaves a copy of the timestamps as a proper column, to be process by diff.
I think this will do what you want:
diffs = df.set_index('time').diff().reset_index()
Since you mention psycopg2, take a look at the docs for pandas 0.14, released just a few days ago, which features improved SQL functionality, including new support for postgresql. You can read and write directly between the database and pandas DataFrames.