python dataframe to_sql with different schema - python

I want to store a dataframe into an existing MSSQL table.
Dataframe has 3 columns, but the SQL table has only 2.
How is it possible to store the 2 columns with the same name into SQL?
I tried the following code:
df.to_sql(sTable, engine, if_exists='append')
It works, if the number and names of the columns are exactly the same. But I want to make my code more generic.

Create a dataframe with the right schema in the first place:
sql_df = df[['colA', 'colB']]
sql_df.to_sql(sTable, engine, if_exists='append')
Pandas ought to be pretty memory-efficient with this, meaning that the columns won't actually get duplicated, they'll just be referenced by sql_df. You could even rename columns to make this work.
A super generic way to accomplish this might look like:
def save_to_sql(df, col_map, table, engine):
sql_df = pd.DataFrame()
for old_name, new_name in col_map:
sql_df[new_name] = df[old_name]
sql_df.to_sql(table, engine, if_exists='append')
Which takes the dataframe and a list that pairs which columns to use with what they should be called to make them line up with the SQL table. E.g., save_to_sql(my_df, [('colA', 'column_a'), ('colB', 'column_b')], 'table_name', sql_engine)

That's a good solution. Now, I'm also able to convert Header names to SQL-field-names. The only topic I have to solve is the idex. DataFrames do have an index (from 0...n). I don't Need the field in the DB. But, I did not found a way to skip the idex column by uplouding to the SQL DB.
Does somebody has an idea?

Related

Running a mysql query on each value in a column in DASK

I have a CSV file that contains a user-id. This CSV file is imported as a dask-dataframe.
Once inside a dataframe, I need to take that user-id, for each entry in the id column and run a SQL query on it fetching the user name of that user-id, and add it to the dataframe in a new column. I have a few such columns that need fetching.
I am unsure what is the DASK way of running select queries against a value in a dask dataframe. How would I go about it? I don't just want to go the imperative route and solve it using a for-loop.
This isn't a full answer, but I can't comment yet
Running multiple queries in a loop is quite inefficient, it would be better to just run a single query to get all the of the user-id username pairs from your database into another dataframe, then use Dask's merge method to join the two dataframes on the user_id column.
https://docs.dask.org/en/latest/dataframe-joins.html
Not very experienced with Dask, most of my experience is with Pandas, so there may be a bit more to it than this, but something along these lines:
import dask.dataframe as dd
import pandas as pd
# my_db_connection using whatever database connector you happen to be using
dask_df == dd.read_csv("your_csv_file.csv")
user_df = pandas.read_sql("""
SELECT user_id, username
FROM user_table
""", con=my_db_connection
)
# Assuming both dataframes use "user_id" as the column name,
# if not use right_on and left_on arguments
merged_df = dask_df.merge(user_df, how="left", on="user_id")

Pandas dataframe.to_sql cuts part of text value when inserting many rows

I'm inserting large dataframe to my PostgreSQL table using pandas to_sql function. Problem is, that when I have a long text value in one of the columns it gets cut.
So instead of:
"3432711,3432712,3432713,3432715,3432716,3432718,3432719,3432720,3432721,3432722,3432724,3432725,3432726,3432727,3432729,3432730,3432731,3432732,3432733,3432734,3432736,3432737,3432739,3432740,3432741,3432742,3432743,3432744,3432745,3432746,3432747,3432748,3432749,3432750,3432751,3842395,3842396,3842397"
it only inserts:
"3432711,3432712,3432713,3432715,3432716,3432718,3432719,3432720,3432721,3432722,3432724,3432725,3432726,3432727,3432729,3432730,3432731,3432732,3432733,3432734,3432736,3432737,3432739,3432740,3432741,3432742,3432743,3432744,3432745,3432746,3432747,3432748,"
When I'm trying to insert just one row of dataframe containing this data everyting is good, and whole text value is inserted.
This is part of the code that I use:
df = pd.read_sql(sql_statement, connection)
engine = create_engine('postgresql://postgres:xxx#localhost:xxxx/' + dbName)
df.to_sql(tableName, engine, if_exists='append')
Any ideas why it is happening and how to fix it?

pyspark: join using schema? Or converting the schema to a list?

I am using the following code to join two data frames:
new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')
The above code works fine, but sometimes df_1 and df_2 have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!
You can't join on schema, if what you meant was somehow having join incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=, like this:
join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')
Now obviously you will have to edit the contents of join_cols to make sure it only has the names you actually want to join df_1 and df_2 on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1 and df_2 columns, then edit from there if that's more suitable.
Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.

Column to Transacction ID for association rules on dataframes from Pandas Python.

I imported a CSV into Python with Pandas and I would like to be able to use one as the columns as a transaction ID in order for me to make association rules.
(link: https://github.com/antonio1695/Python/blob/master/nearBPO/facturas.csv)
I hope someone can help me to:
Use UUID as a transaction ID for me to have a dataframe like the following:
UUID Desc
123ex Meat,Beer
In order for me to get association rules like: {Meat} => {Beer}.
Also, a recommendation on a library to do so in a simple way would be appreciated.
Thank you for your time.
You can aggregate values into a list by doing the following:
df.groupby('UUID')['Desc'].apply(list)
This will give you what you want, if you want the UUID back as a column you can call reset_index on the above:
df.groupby('UUID')['Desc'].apply(list).reset_index()
Also for a Series you can still export this to a csv same as with a df:
df.groupby('UUID')['Desc'].apply(list).to_csv(your_path)
You may need to name your index prior to exporting or if you find it easier just reset_index to restore the index back as a column and then call to_csv

Creating a Cross Tab Query in SQL Alchemy

I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel

Categories

Resources