Set Order of Columns in SqlAlchemy before writing to pandas dataframe - python

I'm trying to set the order of columns when building a table with SQLAlchemy, as of right now the columns are appearing in alphabetical order, I currently have:
def data_frame(query, columns):
def make_row(x):
return dict([(c, getattr(x, c)) for c in columns])
return pd.DataFrame([make_row(x) for x in query])
PackL = create_engine('mssql+pyodbc://u:pass#Server/db1?driver=SQL Server', echo=False)
Fabr = create_engine('mssql+pyodbc://u:pass#Server/db2?driver=SQL Server', echo=False)
Session = sessionmaker(bind=PackL)
session = Session()
Base = declarative_base()
metadata = MetaData()
class Tranv(Base):
__tablename__= "Transactions"
__table_args__= {'autoload': True, 'autoload_with':PackL}
newvarv = session.query(Tranv).filter_by(status='SCRAP').filter(Tranv.time_stamp.
between('2015-10-01', '2015-10-09'))
session.close()
dfx = data_frame(newvarv, ['action', 'employee_number', 'time_stamp', 'qty',
'part_number', 'card_number'])
Current dfx has the columns in alphabetical order, but I want it to order them by the order in which I define the columns when I create the data frame dfx. Therefore the order would be action, employee_number, time_stamp, qty, part_number, card_number. I can easily do this with Pandas, but that seems like extra (and unnecessary) steps.
I've searched the documentation, google, & stackoverflow but nothing really seems to fit my needs. As I'm still new with SQLAlchemy I appreciate any help. Am I right in thinking that because I'm autoloading the table, I can not easily define the order of my columns (I'm sure there is a workaround, but don't have a clue where in the documentation that might be found)?

The reason your columns are not in the order you specify, has nothing to do with the sql query or sqlalchemy. This is caused by the fact that you convert the query output to a dictionary, which you then feed to DataFrame.
As a dictionary has no order in python, pandas will sort it alphabetically to have a predictable output.
Using the current approach of the dict, you can always change the order of the columns afterwards by doing dfx.reindex(columns=['action', ..., 'card_number'])
Apart from the explanation why it is not ordered in your case, there are maybe better approaches to tackle this:
Use the builtin pd.read_sql_query. When working with sessions and Query objects, you can pass the selectable attribute to read_sql_query to convert it to a DataFrame:
query = session.query(Table)...
df = pd.read_sql_query(query.selectable, engine)
Do not convert to a dictionary, but keep the output as tuples which you feed to DataFrame: this will keep the order of the query output.

Related

Python SQLAlchemy ON DUPLICATE KEY UPDATE with multiple records

I'd like to use the ON DUPLICATE KEY UPDATE optionality provided by SQLAlchemy to upsert a bunch of records.
These records have been sucessfully inserted with python using the following (where connection is engine.connect() object and table is a Table object)
record_list = [{'col1': 'name1', 'col2': '2015-01-31', 'col3': 27.2},
{'col1': 'name1', 'col2': '2016-01-31', 'col3': 25.2}]
query = insert(table)
results = connection.execute(query, record_list)
Looking at the docs at https://docs.sqlalchemy.org/en/13/dialects/mysql.html#insert-on-duplicate-key-update-upsert as well as a number of SO questions (including the suggestion it's possible under the comments on SQLAlchemy ON DUPLICATE KEY UPDATE ) I've tried a number of different examples, but there were none that I could see that address multiple records with the upsert statement using this method.
I'm trying along the lines of
query = insert(table).values(record_list)
upsert_query = query.on_duplicate_key_update()
results = connection.execute(upsert_query)
but either get the issue that the .on_duplicate_key_update() requires cant be empty or that the SQL syntax is wrong.
If anyone has sucessfully managed and could help me with the code structure here I'd really appreciate it.
I just ran into a similar problem and creating a dictionary out of query.inserted solved it for me.
query = insert(table).values(record_list)
update_dict = {x.name: x for x in query.inserted}
upsert_query = query.on_duplicate_key_update(update_dict)
#user12730260’s answer is great! but has a little bug, the correct code is:
query = insert(table).values(record_list) # each record is a dict
update_dict = {x.name: x for x in query.inserted} # specify columns for update, u can filter some of it
upsert_query = query.on_duplicate_key_update(**update_dict) # here's the modification: u should expand the columns dict
Your on_duplicate_key_update function requires arguments that define the data to be inserted in the update. Please have a look at the example in the documentation that you have already found.
insert().on_duplicate_key_update({"key": "value"})

SQL Alchemy using Or_ looping multiple columns (Pandas Dataframes)

SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.

Python Data Analysis from SQL Query

I'm about to start some Python Data analysis unlike anything I've done before. I'm currently studying numpy, but so far it doesn't give me insight on how to do this.
I'm using python 2.7.14 Anaconda with cx_Oracle to Query complex records.
Each record will be a unique individual with a column for Employee ID, Relationship Tuples (Relationship Type Code paired with Department number, may contain multiple), Account Flags (Flag strings, may contain multiple). (3 columns total)
so one record might be:
[(123456), (135:2345678, 212:4354670, 198:9876545), (Flag1, Flag2, Flag3)]
I need to develop a python script that will take these records and create various counts.
The example record would be counted in at least 9 different counts
How many with relationship: 135
How many with relationship: 212
How many with relationship: 198
How many in Department: 2345678
How many in Department: 4354670
How many in Department: 9876545
How many with Flag: Flag1
How many with Flag: Flag2
How many with Flag: Flag3
The other tricky part of this, is I can't pre-define the relationship codes, departments, or flags What I'm counting for has to be determined by the data retrieved from the query.
Once I understand how to do that, hopefully the next step to also get how many relationship X has Flag y, etc., will be intuitive.
I know this is a lot to ask about, but If someone could just point me in the right direction so I can research or try some tutorials that would be very helpful. Thank you!
At least you need to structurate this data to make a good analysis, you can do it in your database engine or in python (I will do it by this way, using pandas like SNygard suggested).
At first, I create some fake data(it was provided by you):
import pandas as pd
import numpy as np
from ast import literal_eval
data = [[12346, '(135:2345678, 212:4354670, 198:9876545)', '(Flag1, Flag2, Flag3)'],
[12345, '(136:2343678, 212:4354670, 198:9876541, 199:9876535)', '(Flag1, Flag4)']]
df = pd.DataFrame(data,columns=['id','relationships','flags'])
df = df.set_index('id')
df
This return a dataframe like this:
raw_pandas_dataframe
In order to summarize or count by columns, we need to improve our data structure, in some way that we can apply group by operations with department, relationships or flags.
We will convert our relationships and flags columns from string type to a python list of strings. So, the flags column will be a python list of flags, and the relationships column will be a python list of relations.
df['relationships'] = df['relationships'].str.replace('\(','').str.replace('\)','')
df['relationships'] = df['relationships'].str.split(',')
df['flags'] = df['flags'].str.replace('\(','').str.replace('\)','')
df['flags'] = df['flags'].str.split(',')
df
The result is:
dataframe_1
With our relationships column converted to list, we can create a new dataframe with as much columns
as relations in that lists we have.
rel = pd.DataFrame(df['relationships'].values.tolist(), index=rel.index)
After that we need to stack our columns preserving its index, so we will use pandas multi_index: the id and the relation column number(0,1,2,3)
relations = rel.stack()
relations.index.names = ['id','relation_number']
relations
We get: dataframe_2
At this moment we have all of our relations in rows, but still we can't group by using
relation_type feature. So we will split our relations data in two columns: relation_type and department using :.
clear_relations = relations.str.split(':')
clear_relations = pd.DataFrame(clear_relations.values.tolist(), index=clear_relations.index,columns=['relation_type','department'])
clear_relations
The result is
dataframe_3_clear_relations
Our relations are ready to analyze, but our flags structure still is very useless. So we will convert the flag list, to columns and after that we will stack them.
flags = pd.DataFrame(df['flags'].values.tolist(), index=rel.index)
flags = flags.stack()
flags.index.names = ['id','flag_number']
The result is dataframe_4_clear_flags
Voilá!, It's all ready to analyze!.
So, for example, how many relations from each type we have, and wich one is the biggest:
clear_relations.groupby('relation_type').agg('count')['department'].sort_values(ascending=False)
We get: group_by_relation_type
All code: Github project
If you're willing to consider other packages, take a look at pandas which is built on top of numpy. You can read sql statements directly into a dataframe, then filter.
For example,
import pandas
sql = '''SELECT * FROM <table> WHERE <condition>'''
df = pandas.read_sql(sql, <connection>)
# Your output might look like the following:
0 1 2
0 12346 (135:2345678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag3)
1 12345 (136:2343678, 212:4354670, 198:9876545) (Flag1, Flag2, Flag4)
# Format your records into rows
# This part will take some work, and really depends on how your data is formatted
# Do you have repeated values? Are the records always the same size?
# Select only the rows where relationship = 125
rel_125 = df[df['Relationship'] = 125]
The pandas formatting is more in depth than fits in a Q&A, but some good resources are here: 10 Minutes to Pandas.
You can also filter the rows directly, though it may not be the most efficient. For example, the following query selects only the rows where a relationship starts with '212'.
df[df['Relationship'].apply(lambda x: any(y.startswith('212') for y in x))]

SQLAlchemy and multi-column case-insensitive query

Assuming we have a table consisting of column_1, column_2, ... , column_n and all of them are string fields. The conditions we are going to do case-insensitive query are stored in a dictionary d like d[column_1] = "Hello", which may or may not contains all columns. How can we do the query?
I checked the question Case Insensitive Flask-SQLAlchemy Query. It contains a lot of awesome answers, but none of them works if we do not know what conditions we have until runtime.
You would need to build the query looping through each key of the dictionary.
As you didn't give any code sample, I'm going to call the table model class TableModel and each column will be column_1, column_2, etc.
Something like this should work:
d = {'column_1': 'some_string', 'column_3': 'another_string'}
# skipping 'column_2' just to exemplify how every column is optional in the dictionary
my_query = TableModel.query
for k in d:
my_query = my_query.filter(getattr(TableModel, k).ilike(d[k]))
And that's about it. Afterwards you can use my_query as any other query, e.g., my_query.count() or my_query.all()

SELECT * in SQLAlchemy?

Is it possible to do SELECT * in SQLAlchemy?
Specifically, SELECT * WHERE foo=1?
Is no one feeling the ORM love of SQLAlchemy today? The presented answers correctly describe the lower-level interface that SQLAlchemy provides. Just for completeness, this is the more-likely (for me) real-world situation where you have a session instance and a User class that is ORM mapped to the users table.
for user in session.query(User).filter_by(name='jack'):
print(user)
# ...
And this does an explicit select on all columns.
The following selection works for me in the core expression language (returning a RowProxy object):
foo_col = sqlalchemy.sql.column('foo')
s = sqlalchemy.sql.select(['*']).where(foo_col == 1)
If you don't list any columns, you get all of them.
query = users.select()
query = query.where(users.c.name=='jack')
result = conn.execute(query)
for row in result:
print row
Should work.
You can always use a raw SQL too:
str_sql = sql.text("YOUR STRING SQL")
#if you have some args:
args = {
'myarg1': yourarg1
'myarg2': yourarg2}
#then call the execute method from your connection
results = conn.execute(str_sql,args).fetchall()
Where Bar is the class mapped to your table and session is your sa session:
bars = session.query(Bar).filter(Bar.foo == 1)
Turns out you can do:
sa.select('*', ...)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(rows, columns=rows.keys())
For joins if columns are not defined manually, only columns of target table are returned. To get all columns for joins(User table joined with Group Table:
sql = User.select(from_obj(Group, User.c.group_id == Group.c.id))
# Add all coumns of Group table to select
sql = sql.column(Group)
session.connection().execute(sql)
I had the same issue, I was trying to get all columns from a table as a list instead of getting ORM objects back. So that I can convert that list to pandas dataframe and display.
What works is to use .c on a subquery or cte as follows:
U = select(User).cte('U')
stmt = select(*U.c)
rows = session.execute(stmt)
Then you get a list of tuples with each column.
Another option is to use __table__.columns in the same way:
stmt = select(*User.__table__.columns)
rows = session.execute(stmt)
In case you want to convert the results to dataframe here is the one liner:
pd.DataFrame.from_records(dict(zip(r.keys(), r)) for r in rows)
If you're using the ORM, you can build a query using the normal ORM constructs and then execute it directly to get raw column values:
query = session.query(User).filter_by(name='jack')
for cols in session.connection().execute(query):
print cols
every_column = User.__table__.columns
records = session.query(*every_column).filter(User.foo==1).all()
When a ORM class is passed to the query function, e.g. query(User), the result will be composed of ORM instances. In the majority of cases, this is what the dev wants and will be easiest to deal with--demonstrated by the popularity of the answer above that corresponds to this approach.
In some cases, devs may instead want an iterable sequence of values. In these cases, one can pass the list of desired column objects to query(). This answer shows how to pass the entire list of columns without hardcoding them, while still working with SQLAlchemy at the ORM layer.

Categories

Resources