I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel
Related
I am new to Python and trying to write code where I will be doing data manipulations on separate dataframes. I want to automate the process where I can pass the existing dataframe name in a function and manipulations will happen within the function on each dataframe step by step separately .In SAS I can create Macros and do the task, however I am unable to find a solution in Python.
SAS code:
%macro forecast(scenario):
data work.&scenario;
set work.&scenario;
rownum = _n_;
run;
%mend;
%forecast(base);
%forecast(severe);
Here the input is two datasets base and severe. The output will be two datasets : base and severe with the relevant rownum column added for both.
If I try to do this in Python, I can do it for a single dataframe like below:
Python code:
df_base['rownum'] = np.arange(len(df_base))'''
It will add a column rownum for my dataframe.
Now I want to do the same manipulation for the other existing dataframe df_severe as well using a function(or some other technique) which should work similar to SAS macros. I have more manipulations to do within the functions, so I would like to avoid doing it individually/separately.
I have a few small data frames that I'm outputting to excel on one sheet. To make then fit better, I need to merge some cells in one table, but to write this in xlsx writer, I need to specify the data parameter. I want to keep the data that is already written in the left cell from using the to_excel() bit of code. Is there a way to do this without having to specify the data parameter? Or do I need to lookup the value in the dataframe to put in there.
For example:
df.to_excel(writer, 'sheet') gives similar to the following output:
Then I want to merge across C:D for this table without having to specify what data should be there (because it is already in column C), using something like:
worksheet.merge_range('C1:D1', cell_format = fmat) etc.
to get below:
Is this possible? Or will I need to lookup the values in the dataframe?
Is this possible? Or will I need to lookup the values in the dataframe?
You will need to lookup the data from the dataframe. There is no way in XlsxWriter to write formatting on top of existing data. The data and formatting need to be written at the same time (apart from Conditional Formatting which can't be used for merging anyway).
I have a pandas data frame called customer_df of length 11k. I am creating a SQL query to insert values of the df to a table customer_details on postgres.
my code right now is
insert_string = 'insert into customer_details (col1,col2,col3,col4) values '
for i in range(len(customer_df)):
insert_string = insert_string + ('''('{0}','{1}','{2}','{3}'),'''.format(customer_df.iloc[i]['customer'],customer_df.iloc[i]['customerid'],customer_df.iloc[i]['loc'],customer_df.iloc[i]['pin']))
upload_to_postgres(insert_string)
this string finally gives me something like
insert into customer_details (col1,col2,col3,col4) values ('John',23,'KA',560021),('Ram',67,'AP',642918),(values of cust 3) .... (values of cust 11k)
which is sent to postgres using the upload_to_postgres.
This process of creating the string takes around 30secs to 1min. Is there any better optimized way to reduce the timings?
You can use pandas to create and read that df faster
here's a pandas cheat sheet and the library download link
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
The multiple insert is not the right way.
At each row you create a transaction, so the best solution is to do a bulk insert.
If you can create a file like a CSV and use the bulk action copy, your performance increases of order of magnitude. I've no example here, but i've found this article that exlain well the methodology.
I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you
I want to store a dataframe into an existing MSSQL table.
Dataframe has 3 columns, but the SQL table has only 2.
How is it possible to store the 2 columns with the same name into SQL?
I tried the following code:
df.to_sql(sTable, engine, if_exists='append')
It works, if the number and names of the columns are exactly the same. But I want to make my code more generic.
Create a dataframe with the right schema in the first place:
sql_df = df[['colA', 'colB']]
sql_df.to_sql(sTable, engine, if_exists='append')
Pandas ought to be pretty memory-efficient with this, meaning that the columns won't actually get duplicated, they'll just be referenced by sql_df. You could even rename columns to make this work.
A super generic way to accomplish this might look like:
def save_to_sql(df, col_map, table, engine):
sql_df = pd.DataFrame()
for old_name, new_name in col_map:
sql_df[new_name] = df[old_name]
sql_df.to_sql(table, engine, if_exists='append')
Which takes the dataframe and a list that pairs which columns to use with what they should be called to make them line up with the SQL table. E.g., save_to_sql(my_df, [('colA', 'column_a'), ('colB', 'column_b')], 'table_name', sql_engine)
That's a good solution. Now, I'm also able to convert Header names to SQL-field-names. The only topic I have to solve is the idex. DataFrames do have an index (from 0...n). I don't Need the field in the DB. But, I did not found a way to skip the idex column by uplouding to the SQL DB.
Does somebody has an idea?