I have a CSV file that contains a user-id. This CSV file is imported as a dask-dataframe.
Once inside a dataframe, I need to take that user-id, for each entry in the id column and run a SQL query on it fetching the user name of that user-id, and add it to the dataframe in a new column. I have a few such columns that need fetching.
I am unsure what is the DASK way of running select queries against a value in a dask dataframe. How would I go about it? I don't just want to go the imperative route and solve it using a for-loop.
This isn't a full answer, but I can't comment yet
Running multiple queries in a loop is quite inefficient, it would be better to just run a single query to get all the of the user-id username pairs from your database into another dataframe, then use Dask's merge method to join the two dataframes on the user_id column.
https://docs.dask.org/en/latest/dataframe-joins.html
Not very experienced with Dask, most of my experience is with Pandas, so there may be a bit more to it than this, but something along these lines:
import dask.dataframe as dd
import pandas as pd
# my_db_connection using whatever database connector you happen to be using
dask_df == dd.read_csv("your_csv_file.csv")
user_df = pandas.read_sql("""
SELECT user_id, username
FROM user_table
""", con=my_db_connection
)
# Assuming both dataframes use "user_id" as the column name,
# if not use right_on and left_on arguments
merged_df = dask_df.merge(user_df, how="left", on="user_id")
Related
I have a pandas data frame called customer_df of length 11k. I am creating a SQL query to insert values of the df to a table customer_details on postgres.
my code right now is
insert_string = 'insert into customer_details (col1,col2,col3,col4) values '
for i in range(len(customer_df)):
insert_string = insert_string + ('''('{0}','{1}','{2}','{3}'),'''.format(customer_df.iloc[i]['customer'],customer_df.iloc[i]['customerid'],customer_df.iloc[i]['loc'],customer_df.iloc[i]['pin']))
upload_to_postgres(insert_string)
this string finally gives me something like
insert into customer_details (col1,col2,col3,col4) values ('John',23,'KA',560021),('Ram',67,'AP',642918),(values of cust 3) .... (values of cust 11k)
which is sent to postgres using the upload_to_postgres.
This process of creating the string takes around 30secs to 1min. Is there any better optimized way to reduce the timings?
You can use pandas to create and read that df faster
here's a pandas cheat sheet and the library download link
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
The multiple insert is not the right way.
At each row you create a transaction, so the best solution is to do a bulk insert.
If you can create a file like a CSV and use the bulk action copy, your performance increases of order of magnitude. I've no example here, but i've found this article that exlain well the methodology.
Dask doesn't have a df.to_sql() like pandas and so I am trying to replicate the functionality and create an sql table using the map_partitions method to do so. Here is my code:
import dask.dataframe as dd
import pandas as pd
import sqlalchemy_utils as sqla_utils
db_url = 'my_db_url_connection'
conn = sqla.create_engine(db_url)
ddf = dd.read_csv('data/prod.csv')
meta=dict(ddf.dtypes)
ddf.map_partitions(lambda df: df.to_sql('table_name', db_url, if_exists='append',index=True), ddf, meta=meta)
This returns my dask dataframe object, but when I go look into my psql server there's no new table... what is going wrong here?
UPDATE
Still can't get it to work, but due to independent issue. Follow-up question: duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe
Simply, you have created a dataframe which is a prescription of the work to be done, but you have not executed it. To execute, you need to call .compute() on the result.
Note that the output here is not really a dataframe, each partition evaluates to None (because to_sql has no output), so it might be cleaner to express this with df.to_delayed, something like
dto_sql = dask.delayed(pd.DataFrame.to_sql)
out = [dto_sql(d, 'table_name', db_url, if_exists='append', index=True)
for d in ddf.to_delayed()]
dask.compute(*out)
Also note, that whether you get good parallelism will depend on the database driver and the data system itself.
UPDATE : Dask to_sql() is now available
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_sql
I want to store a dataframe into an existing MSSQL table.
Dataframe has 3 columns, but the SQL table has only 2.
How is it possible to store the 2 columns with the same name into SQL?
I tried the following code:
df.to_sql(sTable, engine, if_exists='append')
It works, if the number and names of the columns are exactly the same. But I want to make my code more generic.
Create a dataframe with the right schema in the first place:
sql_df = df[['colA', 'colB']]
sql_df.to_sql(sTable, engine, if_exists='append')
Pandas ought to be pretty memory-efficient with this, meaning that the columns won't actually get duplicated, they'll just be referenced by sql_df. You could even rename columns to make this work.
A super generic way to accomplish this might look like:
def save_to_sql(df, col_map, table, engine):
sql_df = pd.DataFrame()
for old_name, new_name in col_map:
sql_df[new_name] = df[old_name]
sql_df.to_sql(table, engine, if_exists='append')
Which takes the dataframe and a list that pairs which columns to use with what they should be called to make them line up with the SQL table. E.g., save_to_sql(my_df, [('colA', 'column_a'), ('colB', 'column_b')], 'table_name', sql_engine)
That's a good solution. Now, I'm also able to convert Header names to SQL-field-names. The only topic I have to solve is the idex. DataFrames do have an index (from 0...n). I don't Need the field in the DB. But, I did not found a way to skip the idex column by uplouding to the SQL DB.
Does somebody has an idea?
I imported a CSV into Python with Pandas and I would like to be able to use one as the columns as a transaction ID in order for me to make association rules.
(link: https://github.com/antonio1695/Python/blob/master/nearBPO/facturas.csv)
I hope someone can help me to:
Use UUID as a transaction ID for me to have a dataframe like the following:
UUID Desc
123ex Meat,Beer
In order for me to get association rules like: {Meat} => {Beer}.
Also, a recommendation on a library to do so in a simple way would be appreciated.
Thank you for your time.
You can aggregate values into a list by doing the following:
df.groupby('UUID')['Desc'].apply(list)
This will give you what you want, if you want the UUID back as a column you can call reset_index on the above:
df.groupby('UUID')['Desc'].apply(list).reset_index()
Also for a Series you can still export this to a csv same as with a df:
df.groupby('UUID')['Desc'].apply(list).to_csv(your_path)
You may need to name your index prior to exporting or if you find it easier just reset_index to restore the index back as a column and then call to_csv
I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel