I want to convert a MySQL query from a python script to an analogous query in R. The python uses a loop structure to search for specific values using genomic coordinates:
SQL = """SELECT value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE `chrom` = %d AND `site` = %d""" % (Table, Chr, Start)
cur.execute(SQL)
In R the chromosomes and sites are in a dataframe and for every row in the dataframe I would like to extract a single value and add it to a new column in the dataframe
So my current dataframe has a similar structure to the following:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
The amended dataframe should have an additional column with values from the database (at corresponding genomic coordinates. The structure should be similar to:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300), "Value"=c(1.5, 0, 5, 60, 100)
So far I connected to the database using:
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
Rather than loop over each row in my dataframe, I would like to use something that would add the corresponding value to a new column in the existing dataframe.
Update with working solution based on answer below:
library(RMySQL)
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
GetValue <- function(DataFrame, Table){
queries <- sprintf("SELECT value as value
FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE chrom = %d AND site = %d UNION ALL SELECT 'NA' LIMIT 1", Table, DataFrame$Chr, DataFrame$start)
res <- ldply(queries, function(query) { dbGetQuery(con, query)})
DataFrame[, Table] <- res$value
return(DataFrame)
}
df <- GetValue(df, "TableName")
Maybe you could do something like this. First, build up your queries, then execute them, storing the results in a column of your dataframe. Not sure if the do.call(rbind part is necessary, but that basically takes a bunch of dataframe rows, and squishes them together by row into a dataframe.
queries=sprintf("SELECT value as value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) WHERE chrom = %d AND site = %d UNION ALL SELECT 0 LIMIT 1", "TableName", df$Chrom, df$Pos)
df$Value = do.call("rbind",sapply(queries, function(query) dbSendQuery(mydb, query)))$value
I played with your SQL a little, my concern with the original is with cases where it might return more than 1 row.
I like the data.table package for this kind of tasks as its syntax is inspired by SQL
require(data.table)
So an example database to match the values to a table
table <- data.table(chrom=rep(1:5, each=5),
site=rep(100*1:5, times=5),
Value=runif(5*5))
Now the SQL query can be translated into something like
# select from table, where chrom=Chr and site=Site, value
Chr <- 2
Site <- 200
table[chrom==Chr & site==Site, Value] # returns data.table
table[chrom==Chr & site==Site, ]$Value # returns numeric
Key (index) database for quick lookup (assuming unique chrom and site..)
setkey(table, chrom, site)
table[J(Chr, Site), ]$Value # very fast lookup due to indexed table
Your dataframe as data table with two columns 'Chr' and 'Site' both integer
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
dt <- as.data.table(df) # adds data.table class to data.frame
setkey(dt, Chr, Site) # index for 'by' and for 'J' join
Match the values and append in new column (by reference, so no copying of table)
# loop over keys Chr and Site and find the match in the table
# select the Value column and create a new column that contains this
dt[, Value:=table[chrom==Chr & site==Site]$Value, by=list(Chr, Site)]
# faster:
dt[, Value:=table[J(Chr, Site)]$Value, by=list(Chr, Site)]
# fastest: in one table merge operation assuming the keys are in the same order
table[J(dt)]
kind greetings
Why don't you use the RMySQL or sqldf package?
With RMySQL, you get MySQL access in R.
With sqldf, you can issue SQL queries on R data structures.
Using either of those, you do not need to reword you SQL query to get the same results.
Let me also mention the data.table package, which lets you do very efficient selects and joins on your data frames after converting them to data tables using as.data.table(your.data.frame). Another good thing about it is that a data.table object is a data.frame at the same time, so all your functions that work on the data frames work on these converted objects, too.
You could easily use dplyr package. There is even nice vignette about that - http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html.
One thing you need to know is:
You can connect to MySQL and MariaDB (a recent fork of MySQL) through
src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll
need to provide a dbname, username, password, host, and port.
Related
I'm loading some data from SQL database to Python, but I need to apply some criteria from Python Dataframe, to be simplified, see example below:
some_sql = """
select column1,columns2
from table
where a between '{}' and '{}'
or a between '{}' and '{}'
or a between '{}' and '{}'
""".format(date1,date2,date3,date4,date5,date6)
date1,date2,date3,date4,date5,date6 are sourced from Python Dataframe. I can manually specify all 6 parameters, but I do have over 20 in fact...
df = DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
is there a way I am able to do a loop here to be more efficient
Setup
# Create a dummy dataframe
df = pd.DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
# Prepare the SQL (conditions will be added later)
some_sql = """
select column1,columns2
from table
where """
First approach
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(f"or a between '{data['col1']}' and '{data['col2']}'")
some_sql += '\n'.join(conditions)
By using iterrows() we can iterate through the dataframe, rows by row.
Alternative
some_sql += '\nor '.join(df.apply(lambda x: f"a between '{x['col1']}' and '{x['col2']}'", axis=1).tolist())
Using apply() should be faster that iterrows():
Although apply() also inherently loops through rows, it does so much
more efficiently than iterrows() by taking advantage of a number of
internal optimizations, such as using iterators in Cython.
source
Another alternative
some_sql += '\nor '.join([f"a between '{row['col1']}' and '{row['col2']}'" for row in df.to_dict('records')])
This converts the dataframe to a list of dicts, and then applies a list comprehension to create the conditions.
Result
select column1,columns2
from table
where a between 'date1' and 'date2'
or a between 'date3' and 'date4'
or a between 'date5' and 'date6'
As a secondary note to Kristof's answer above, I would note that even as an analyst one should probably be careful about things like SQL injection, so inlining data is something to be avoided.
If possible you should define your query once with placeholders and then create a param list to go with the placeholders. This also saves on the formatting too.
So in your case your query looks like:
some_sql = """
select column1,columns2
from table
where a between ? and ?
or a between ? and ?
or a between ? and ?
And our param list generation is going to look like:
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(data['col1'])
conditions.append(data['col2'])
Then execute your SQL with placeholder syntax and params list as placeholders.
Im using Python to query a SQL database. I'm fairly new with databases. I've tried looking up this question, but I can't find a similar enough question to get the right answer.
I have a table with multiple columns/rows. I want to find the MAX of a single column, I want ALL columns returned (the entire ROW), and I want only one instance of the MAX. Right now I'm getting ten ROWS returned, because the MAX is repeated ten times. I only want one ROW returned.
The query strings I've tried so far:
sql = 'select max(f) from cbar'
# this returns one ROW, but only a single COLUMN (a single value)
sql = 'select * from cbar where f = (select max(f) from cbar)'
# this returns all COLUMNS, but it also returns multiple ROWS
I've tried a bunch more, but they returned nothing. They weren't right somehow. That's the problem, I'm too new to find the middle ground between my two working query statements.
In SQLite 3.7.11 or later, you can just retrieve all columns together with the maximum value:
SELECT *, max(f) FROM cbar;
But your Python might be too old. In the general case, you can sort the table by that column, and then just read the first row:
SELECT * FROM cbar ORDER BY f DESC LIMIT 1;
I'm using Pyspark 1.2.1 with Hive. (Upgrading will not happen immediately).
The problem I have is that when I select from a Hive table, and add an index, pyspark changes long values to ints, so I end up with a temp table with a column of type Long, but values of type Integer. (See code below).
My question is: how can I either (a) perform the merge of the index (see code) without changing longs to ints; or (b) add the index in some other way that avoids the problem; or (c) randomize table columns without needing to join?
The underlying problem I'm trying to solve is that I want to randomize the order of certain columns in a hive table, and write that to a new table. This is to make data no longer personally identifiable. I'm doing that by adding an incrementing index to the original table and the randomised columns, then joining on that index.
The table looks like:
primary | longcolumn | randomisecolumn
The code is:
hc = HiveContext(sc)
orig = hc.sql('select * from mytable')
widx = orig.zipWithIndex().map(merge_index_on_row)
sql_context.applySchema(widx, add_index_schema(orig.schema()))
.registerTempTable('sani_first')
# At this point sani_first has a column longcolumn with type long,
# but (many of) the values are ints
def merge_index_on_row((row, idx), idx_name=INDEX_COL):
"""
Row is a SchemaRDD row object; idx is an integer;
schema is the schema for row with an added index col at the end
returns a version of row applying schema and holding the index in the new row
"""
as_dict = row.asDict()
as_dict[idx_name] = idx
return Row(**as_dict)
def add_index_schema(schema):
"""
Take a schema, add a column for an index, return the new schema
"""
return StructType(sorted(schema.fields + [StructField(INDEX_COL, IntegerType(), False)],key=lambda x:x.name))
In the absence of a better solution, I'm going to force the affected columns to long type in the python code. This is...not great.
I'm new to python (3) and having a hard time with finding relevant examples for how to handle the following scenario. I know this is on the verge of being a "what's best" question, but hopefully there is a clearly appropriate methodology for this.
I have csv data files that contain timestamps and then at least one column of data with a name defined by a master list (i.e. all possible column headers are known). For example:
File1.csv
date-time, data a, data b
2014-01-01, 23, 22
2014-01-01, 23, 22d
File2.csv
date-time, data d, data a
2014-01-01, 99, 20
2014-01-01, 100, 22
I've been going in circles trying to understand when to use tuples, lists, and dictionaries for this type of scenario for import into postgresql. Since the column order can change and the list of columns is different each time (although always from a master set), I'm not sure on how to best generate a data set that includes the time stamp and columns and then perform an insert into a postgresql table where unspecified columns are provided a value.
Given the dynamic nature of the columns' presence and the need to maintain the relationship with the timestamp for the Postgresql import via psycopg, what is recommended? Lists, lists of lists, dictionaries, or tuples?
I'm not begging for specific code, just some guidance. Thanks.
You can use csv module to parse input file and by it's first row you can build (prepare) psycopg insert statement with column names and %s instead of values. For rest of rows simply execute this statement with row as values:
connect_string = 'dbname=test host=localhost port=5493 user=postgres password=postgres'
connection = psycopg2.connect(connect_string)
cursor = connection.cursor()
f = open(fn, 'rt')
try:
reader = csv.reader(f)
cols = []
for row in reader:
if not cols:
cols = row
psycopg_marks = ','.join(['%s' for s in cols])
insert_statement = "INSERT INTO xyz (%s) VALUES (%s)" % (','.join(cols), psycopg_marks)
print(insert_statement)
else:
print(row)
cursor.execute(insert_statement, row)
finally:
f.close()
...
For your example you will have to correct column names.
Is it possible to do SELECT * in SQLAlchemy?
Specifically, SELECT * WHERE foo=1?
Is no one feeling the ORM love of SQLAlchemy today? The presented answers correctly describe the lower-level interface that SQLAlchemy provides. Just for completeness, this is the more-likely (for me) real-world situation where you have a session instance and a User class that is ORM mapped to the users table.
for user in session.query(User).filter_by(name='jack'):
print(user)
# ...
And this does an explicit select on all columns.
The following selection works for me in the core expression language (returning a RowProxy object):
foo_col = sqlalchemy.sql.column('foo')
s = sqlalchemy.sql.select(['*']).where(foo_col == 1)
If you don't list any columns, you get all of them.
query = users.select()
query = query.where(users.c.name=='jack')
result = conn.execute(query)
for row in result:
print row
Should work.
You can always use a raw SQL too:
str_sql = sql.text("YOUR STRING SQL")
#if you have some args:
args = {
'myarg1': yourarg1
'myarg2': yourarg2}
#then call the execute method from your connection
results = conn.execute(str_sql,args).fetchall()
Where Bar is the class mapped to your table and session is your sa session:
bars = session.query(Bar).filter(Bar.foo == 1)
Turns out you can do:
sa.select('*', ...)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(rows, columns=rows.keys())
For joins if columns are not defined manually, only columns of target table are returned. To get all columns for joins(User table joined with Group Table:
sql = User.select(from_obj(Group, User.c.group_id == Group.c.id))
# Add all coumns of Group table to select
sql = sql.column(Group)
session.connection().execute(sql)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(dict(zip(r.keys(), r)) for r in rows)
If you're using the ORM, you can build a query using the normal ORM constructs and then execute it directly to get raw column values:
query = session.query(User).filter_by(name='jack')
for cols in session.connection().execute(query):
print cols
every_column = User.__table__.columns
records = session.query(*every_column).filter(User.foo==1).all()
When a ORM class is passed to the query function, e.g. query(User), the result will be composed of ORM instances. In the majority of cases, this is what the dev wants and will be easiest to deal with--demonstrated by the popularity of the answer above that corresponds to this approach.
In some cases, devs may instead want an iterable sequence of values. In these cases, one can pass the list of desired column objects to query(). This answer shows how to pass the entire list of columns without hardcoding them, while still working with SQLAlchemy at the ORM layer.