I'm loading some data from SQL database to Python, but I need to apply some criteria from Python Dataframe, to be simplified, see example below:
some_sql = """
select column1,columns2
from table
where a between '{}' and '{}'
or a between '{}' and '{}'
or a between '{}' and '{}'
""".format(date1,date2,date3,date4,date5,date6)
date1,date2,date3,date4,date5,date6 are sourced from Python Dataframe. I can manually specify all 6 parameters, but I do have over 20 in fact...
df = DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
is there a way I am able to do a loop here to be more efficient
Setup
# Create a dummy dataframe
df = pd.DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
# Prepare the SQL (conditions will be added later)
some_sql = """
select column1,columns2
from table
where """
First approach
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(f"or a between '{data['col1']}' and '{data['col2']}'")
some_sql += '\n'.join(conditions)
By using iterrows() we can iterate through the dataframe, rows by row.
Alternative
some_sql += '\nor '.join(df.apply(lambda x: f"a between '{x['col1']}' and '{x['col2']}'", axis=1).tolist())
Using apply() should be faster that iterrows():
Although apply() also inherently loops through rows, it does so much
more efficiently than iterrows() by taking advantage of a number of
internal optimizations, such as using iterators in Cython.
source
Another alternative
some_sql += '\nor '.join([f"a between '{row['col1']}' and '{row['col2']}'" for row in df.to_dict('records')])
This converts the dataframe to a list of dicts, and then applies a list comprehension to create the conditions.
Result
select column1,columns2
from table
where a between 'date1' and 'date2'
or a between 'date3' and 'date4'
or a between 'date5' and 'date6'
As a secondary note to Kristof's answer above, I would note that even as an analyst one should probably be careful about things like SQL injection, so inlining data is something to be avoided.
If possible you should define your query once with placeholders and then create a param list to go with the placeholders. This also saves on the formatting too.
So in your case your query looks like:
some_sql = """
select column1,columns2
from table
where a between ? and ?
or a between ? and ?
or a between ? and ?
And our param list generation is going to look like:
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(data['col1'])
conditions.append(data['col2'])
Then execute your SQL with placeholder syntax and params list as placeholders.
Related
SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.
I have the following SQLAlchemy db query
test_query = Unit.query.filter(Unit.id_1.in_(('3D0U|1|A|G|1', '3D0U|1|A|C|160')))
I would like the result of this query to be a list of lists with rows corresponding to each element in the in clause as a separate list. Currently I'm getting all the rows in a single list
This is what I've tried
result = []
for row in test_query:
result.append[(row.id_2)]
When I print out the results, this is what I get
["3D0X|1|A|C|160", "4ERJ|1|A|C|160", "4ERL|1|A|C|160", "3D0X|1|A|G|1", "4ERJ|1|A|G|1", "4ERL|1|A|G|1"]
The desired output is:
[["3D0X|1|A|C|160", "4ERJ|1|A|C|160", "4ERL|1|A|C|160"], ["3D0X|1|A|G|1", "4ERJ|1|A|G|1", "4ERL|1|A|G|1"]]
Sample data from the Unit table is shown below
"id_1","chain_1","pdb_id_1","id_2","chain_2","pdb_id_2"
"3D0U|1|A|G|1","A","3D0U","3D0X|1|A|G|1","A","3D0X"
"3D0U|1|A|G|1","A","3D0U","4ERJ|1|A|G|1","A","4ERJ"
"3D0U|1|A|G|1","A","3D0U","4ERL|1|A|G|1","A","4ERL"
"3D0U|1|A|C|160","A","3D0U","3D0X|1|A|C|160","A","3D0X"
"3D0U|1|A|C|160","A","3D0U","4ERJ|1|A|C|160","A","4ERJ"
"3D0U|1|A|C|160","A","3D0U","4ERL|1|A|C|160","A","4ERL"
Any help is appreciated. Thanks
This could be solved in the following way: first order by Unit.id_1 in the query, and then
result = [[unit.id_2 for unit in units] for id_1, units in itertools.groupby(test_query, lambda x: x.id_1)]
In case of postgresql there is also sqlalchemy.func.array_agg that could be used to construct an array of id_2 grouped by id_1.
For reference, python documentation for itertools describes what itertools.groupby does.
I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']
I have a database table with multiple fields which I am querying and pulling out all data which meets certain parameters. I am using psycopg2 for python with the following syntax:
cur.execute("SELECT * FROM failed_inserts where insertid='%s' AND site_failure=True"%import_id)
failed_sites= cur.fetchall()
This returns the correct values as a list with the data's integrity and order maintained. However I want to query the list returned somewhere else in my application and I only have this list of values, i.e. it is not a dictionary with the fields as the keys for these values. Rather than having to do
desiredValue = failed_sites[13] //where 13 is an arbitrary number of the index for desiredValue
I want to be able to query by the field name like:
desiredValue = failed_sites[fieldName] //where fieldName is the name of the field I am looking for
Is there a simple way and efficient way to do this?
Thank you!
cursor.description will give your the column information (http://www.python.org/dev/peps/pep-0249/#cursor-objects). You can get the column names from it and use them to create a dictionary.
cursor.execute('SELECT ...')
columns = []
for column in cursor.description:
columns.append(column[0].lower())
failed_sites = {}
for row in cursor:
for i in range(len(row)):
failed_sites[columns[i]] = row[i]
if isinstance(row[i], basestring):
failed_sites[columns[i]] = row[i].strip()
The "Dictionary-like cursor", part of psycopg2.extras, seems what you're looking for.
Is it possible to do SELECT * in SQLAlchemy?
Specifically, SELECT * WHERE foo=1?
Is no one feeling the ORM love of SQLAlchemy today? The presented answers correctly describe the lower-level interface that SQLAlchemy provides. Just for completeness, this is the more-likely (for me) real-world situation where you have a session instance and a User class that is ORM mapped to the users table.
for user in session.query(User).filter_by(name='jack'):
print(user)
# ...
And this does an explicit select on all columns.
The following selection works for me in the core expression language (returning a RowProxy object):
foo_col = sqlalchemy.sql.column('foo')
s = sqlalchemy.sql.select(['*']).where(foo_col == 1)
If you don't list any columns, you get all of them.
query = users.select()
query = query.where(users.c.name=='jack')
result = conn.execute(query)
for row in result:
print row
Should work.
You can always use a raw SQL too:
str_sql = sql.text("YOUR STRING SQL")
#if you have some args:
args = {
'myarg1': yourarg1
'myarg2': yourarg2}
#then call the execute method from your connection
results = conn.execute(str_sql,args).fetchall()
Where Bar is the class mapped to your table and session is your sa session:
bars = session.query(Bar).filter(Bar.foo == 1)
Turns out you can do:
sa.select('*', ...)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(rows, columns=rows.keys())
For joins if columns are not defined manually, only columns of target table are returned. To get all columns for joins(User table joined with Group Table:
sql = User.select(from_obj(Group, User.c.group_id == Group.c.id))
# Add all coumns of Group table to select
sql = sql.column(Group)
session.connection().execute(sql)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(dict(zip(r.keys(), r)) for r in rows)
If you're using the ORM, you can build a query using the normal ORM constructs and then execute it directly to get raw column values:
query = session.query(User).filter_by(name='jack')
for cols in session.connection().execute(query):
print cols
every_column = User.__table__.columns
records = session.query(*every_column).filter(User.foo==1).all()
When a ORM class is passed to the query function, e.g. query(User), the result will be composed of ORM instances. In the majority of cases, this is what the dev wants and will be easiest to deal with--demonstrated by the popularity of the answer above that corresponds to this approach.
In some cases, devs may instead want an iterable sequence of values. In these cases, one can pass the list of desired column objects to query(). This answer shows how to pass the entire list of columns without hardcoding them, while still working with SQLAlchemy at the ORM layer.