SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.
Related
I'm trying to filter a Pandas dataframe based on a set of or conditions, but they're all very similar, and I'm wondering if there's a more efficient way to write this.
Specifically, I want to include rows from the dataframe (df) where any of a set of variables is 1:
df.query("Q50r5==1 or Q50r6==1 or Q50r7==1 or Q50r8==1 or Q50r9==1 or Q50r10==1 or Q50r11==1")
This filters correctly to rows where any of these variables is 1.
However, I expect to have a lot more situations where I need to filter my dataframe to something similar, e.g.:
df.query("Q20r1==1 or Q20r2==1 or Q20r3==1")
df.query("Q23r2==1 or Q23r5==1 or Q23r7==1 or Q23r8==1")
The pandas documentation on .query() doesn't specify any more than that you can use and and or like you can elsewhere in Python, so it's possible this is the only way to do this query, but is there some kind of sum or count I could do across these columns within the query? Something like "any(1,Q20r1,Q20r2,Q20r3)" or "sum(Q20r1,Q20r2,Q20r3)>0"?
EDIT: For example, using this small dataframe:
I would want to retrieve ID #s 1,2,4,5,7 and exclude #s 3 and 6, because 3 and 6 do not have any 1's across the columns I'm referring to.
You can use any with axis = 1 to check that at least one value is True in a row.
For example, you can run
df[(df[["Q20r1", "Q20r2", "Q20r3"]] == 1).any(axis = 1)]
I'm using from_statement() function do complex queries that requires multiple subtables which works fine apart from one thing, that you have to specify column names in query() calls.
e.g. If you want to get columns A and B from the following pseudo query
SELECT t1.colA AS A, t2.colB AS B FROM
(
SELECT
sth AS colA
FROM
(
...
) t1
...
then you'll have to specify A and B in your query()
session.query('A', 'B').from_statement(above_statement).all()
which can get annoying when you want to get say more than 5 columns from your query.
Is there some way to just make SQLAlchemy return whatever it gets from the query?
I've tried query('*') or just query() but neither works.
I think the easiest way would be to use execute():
session.execute(above_statement).fetchall()
If you still want to use query(), then you can probably put all of your columns into a list and then unpack them:
cols = ['A', 'B']
session.query(*cols).from_statement(above_statement).all()
I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]
I'm loading some data from SQL database to Python, but I need to apply some criteria from Python Dataframe, to be simplified, see example below:
some_sql = """
select column1,columns2
from table
where a between '{}' and '{}'
or a between '{}' and '{}'
or a between '{}' and '{}'
""".format(date1,date2,date3,date4,date5,date6)
date1,date2,date3,date4,date5,date6 are sourced from Python Dataframe. I can manually specify all 6 parameters, but I do have over 20 in fact...
df = DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
is there a way I am able to do a loop here to be more efficient
Setup
# Create a dummy dataframe
df = pd.DataFrame({'col1':['date1','date3','date5'],
'col2':['date2','date4','date6']})
# Prepare the SQL (conditions will be added later)
some_sql = """
select column1,columns2
from table
where """
First approach
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(f"or a between '{data['col1']}' and '{data['col2']}'")
some_sql += '\n'.join(conditions)
By using iterrows() we can iterate through the dataframe, rows by row.
Alternative
some_sql += '\nor '.join(df.apply(lambda x: f"a between '{x['col1']}' and '{x['col2']}'", axis=1).tolist())
Using apply() should be faster that iterrows():
Although apply() also inherently loops through rows, it does so much
more efficiently than iterrows() by taking advantage of a number of
internal optimizations, such as using iterators in Cython.
source
Another alternative
some_sql += '\nor '.join([f"a between '{row['col1']}' and '{row['col2']}'" for row in df.to_dict('records')])
This converts the dataframe to a list of dicts, and then applies a list comprehension to create the conditions.
Result
select column1,columns2
from table
where a between 'date1' and 'date2'
or a between 'date3' and 'date4'
or a between 'date5' and 'date6'
As a secondary note to Kristof's answer above, I would note that even as an analyst one should probably be careful about things like SQL injection, so inlining data is something to be avoided.
If possible you should define your query once with placeholders and then create a param list to go with the placeholders. This also saves on the formatting too.
So in your case your query looks like:
some_sql = """
select column1,columns2
from table
where a between ? and ?
or a between ? and ?
or a between ? and ?
And our param list generation is going to look like:
conditions = []
for row in df.iterrows():
# Ignore the index
data = row[1]
conditions.append(data['col1'])
conditions.append(data['col2'])
Then execute your SQL with placeholder syntax and params list as placeholders.
I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']