How can I create a function that uses a list of strings to iterate the following. Intention is a, b, and c represent tables the users upload. Goal is to programmatically iterate no matter how many tables the users upload. I'm just looking to pull the counts of new rows broken out by table.
mylist = df.select('S_ID').distinct().rdd.flatMap(lambda x: x).collect()
mylist
>> ['a', 'b', 'c']
##Count new rows by S_ID type
a = df.filter(df.S_ID == 'a').count()
b = df.filter(df.S_ID == 'b').count()
c = df.filter(df.S_ID == 'c').count()
##Count current rows from Snowflake
a_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'a'").load()
a_current = a_current.count()
b_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'b'").load()
b_current = b_current.count()
c_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'c'").load()
c_current = c_current.count()
##Calculate count of new rows
a_new = a - a_current
a_new = str(a_new)
b_new = b - b_current
b_new = str(b_new)
c_new = c - c_current
c_new = str(c_new)
Something like this:
new_counts_list = []
for i in mylist:
i = df.filter(df.S_ID == 'i').count()
i_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = 'i'").load()
i_current = i_current.count()
i_new = i - i_current
i_new = str(i_new)
new_counts_list.append(i)
I'm stuck on keeping the {names : new_counts}
As it pertains to:
I'm stuck on keeping the {names : new_counts}
,at the end of your for loop you may use
new_counts_list[i]=i_new
instead of
new_counts_list.append(i)
assuming that you change how new_counts_list is initialized. i.e. initialized as a dict (new_counts_list={}) instead of a list.
You also seem to be hardcoding the literal value 'i' which is a string instead of using the variable i (i.e.) without the quotes in your proposed solution. Your updated solution may look like
new_counts_list={}
for i in mylist:
i = df.filter(df.S_ID == i).count()
i_current = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "select R_ID FROM mytable WHERE S_ID = '{0}'".format(i)).load()
i_current = i_current.count()
i_new = i - i_current
i_new = str(i_new)
new_counts_list[i]=i_new
On another note, while your approach, i.e. sequentially looping through each S_ID in my list and running the operations i.e.
Running the action collect to pull all the S_ID to your driver node from your initial dataframe df into a list mylist
Separately counting the number of occurrences of S_ID in your initial dataframe then executing another potentially expensive (IO reads/network communication/shuffles) collect()
Creating another dataframe with spark.read.format(SNOWFLAKE_SOURCE_NAME) that will load all records filtered by each S_ID into memory before executing a count
Finding the difference between the initial dataframe and the snowflake source
will work, it is expensive in IO reads and based on your cluster/setup, potentially expensive in network communication and shuffles.
You may consider using a groupby to reduce the amount of times you execute the potentially expensive collect. Furthermore, you may also join the initial dataframe to your snowflake source and let spark optimize your operations as a lazy execution plan distributed across your cluster/setup. Moreover, similar to how you are using the pushdown filter for the snowflake source, you may combine all selected S_ID in that query to allow snowflake to reduce all the desired results in one read. You would not need a loop. This could potentially look like:
Approach 1
In this approach, I will provide a pure spark solution to achieving your desired results
from pyspark.sql import functions as F
# Ask spark to select only the `S_ID` and group the data but not execute the transformation
my_exiting_counts_df = df.select('S_ID').groupBy('S_ID').count()
# Ask spark to select only the `S_ID` counts from the snowflake source
current_counts_df = (
spark.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("query", "select R_ID, COUNT(1) as cnt FROM mytable GROUP BY R_ID")
)
# Join both datasets which will filter to only selected `S_ID`
# and determine the differences between the existing and current counts
results_df = (
my_exiting_counts_df.alias("existing")
.join(
current_counts_df.alias("current"),
F.col("S_ID")=F.col("R_ID"),
"inner"
)
.selectExpr(
"S_ID",
"count - cnt as actual_count"
)
)
# Execute the above transformations with `collect` and
# Convert the dictionary values in the list above to your desired final dictionary
new_counts = {}
for row in results_df.collect():
new_counts[row['S_ID']]=row['actual_count']
# your desired results are in `new_counts`
Approach 2
In this approach, I will collect the results of the group by and then use this to optimize the push down queries to the snowflake schema to return the desired results.
my_list_counts = df.select('S_ID').groupBy('S_ID').count()
selected_sids = []
case_expression = ""
for row in my_list_counts:
selected_sids.append(row['S_ID'])
case_expression = case_expression + " CASE WHEN R_ID='{0}' THEN {0} ".format(
row['S_ID'],
row['count']
)
# The above has a table with columns `S_ID` and `count` where the
# latter is the number of occurrences of `S_ID` in the dataset `df`
snowflake_push_down_query="""
SELECT
R_ID AS S_ID
((CASE
{0}
END) - cnt) as actual_count
FROM (
SELECT
R_ID,
COUNT(1) AS cnt
FROM
mytable
WHERE
R_ID IN ('{1}')
GROUP BY
R_ID
) t
""".format(
case_expression,
"','".join(selected_sids)
)
results_df = (
spark.read
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("query", snowflake_push_down_query)
)
# Execute the above transformations with `collect` and
# Convert the dictionary values in the list above to your desired final dictionary
new_counts = {}
for row in results_df.collect():
new_counts[row['S_ID']]=row['actual_count']
# your desired results are in `new_counts`
Let me know if this works for you.
Related
I want to insert/update values from a pandas dataframe into a postgres table.
I have a unique tuple (a,b) in the postgres table. If the tuple already exists I only want to update the third value c, if the tuple doesn't exist I want to create a triple (a,b,c).
What is the most efficient way to do so? I guess some sort of bulk insert, but I am not quite sure how exactly.
You can convert your dataframe to a CTE https://www.postgresql.org/docs/current/queries-with.html and insert the data from the CTE into the table afterwards. Something like this:
def convert_df_to_cte(df):
vals = ', \n'.join([f"{tuple([f'$str${e}$str$' for e in row])}" for row in df.values])
vals = vals.replace("'$str$", "$str$")
vals = vals.replace("$str$'", "$str$")
vals = vals.replace('"$str$', "$str$")
vals = vals.replace('$str$"', "$str$")
vals = vals.replace('$str$nan$str$', 'NULL')
columns = ', \n'.join(df.columns)
sql = f"""
WITH vals AS (
SELECT
{columns}
FROM
(VALUES {vals}) AS t ({columns})
)
"""
return sql
df = pd.DataFrame([[1, 2, 3]], columns=['col_1', 'col_2', 'col_3'])
cte_sql = convert_df_to_cte(df)
sql_to_insert = f"""
{cte_sql}
INSERT INTO schema.table (col_1, col_2, col_3)
SELECT
col_1::integer, -- don't forget to cast to right type to avoid errors
col_2::integer, -- don't forget to cast to right type to avoid errors
col_3::character varying
FROM
vals
ON CONFLICT (col_1, col_2) DO UPDATE SET
col_3 = excluded.col_3;
"""
run_sql(sql)
I have a SQL Server stored procedure that returns 3 separate tables.
How can I store each of this table in different data-frame using pandas?
Something like:
df1 - first table
df2 - second table
df3 - third table
Where should I start looking at?
Thank you
import pandas as pd
import pyodbc
from datetime import datetime
param = datetime(year=2019,month=7,day=31)
query = """EXECUTE [dbo].PythonTest_USIC_TreatyYear_ReportingPackage #AsOFDate = '{0}'""".format(param)
conn = pyodbc.connect('DRIVER={SQL Server};server=myserver;DATABASE=mydatabase;Trusted_Connection=yes;')
df = pd.read_sql_query(query, conn)
print(df.head())
You should be able to just iterate through the result sets, convert them to DataFrames, and append those DataFrames to a list. For example, given the stored procedure
CREATE PROCEDURE dbo.MultiResultSP
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT 1 AS [UserID], N'Gord' AS [UserName]
UNION ALL
SELECT 2 AS [UserID], N'Elaine' AS [UserName];
SELECT N'pi' AS [Constant], 3.14 AS [Value]
UNION ALL
SELECT N'sqrt_2' AS [Constant], 1.41 AS [Value]
END
the Python code would look something like this:
data_frames = []
crsr = cnxn.cursor()
crsr.execute("EXEC dbo.MultiResultSP")
result = crsr.fetchall()
while result:
col_names = [x[0] for x in crsr.description]
data = [tuple(x) for x in result] # convert pyodbc.Row objects to tuples
data_frames.append(pd.DataFrame(data, columns=col_names))
if crsr.nextset():
result = crsr.fetchall()
else:
result = None
# check results
for df in data_frames:
print(df)
print()
""" console output:
UserID UserName
0 1 Gord
1 2 Elaine
Constant Value
0 pi 3.14
1 sqrt_2 1.41
"""
I have 2 tables in MySQL.
One has transactions with important columns where each row has Debit account ID and Credit account ID. I have second table which contains Account name and special number associated to Account ID. I want somehow to try sql query which will take data from transactions table and assign account name and account number from second table.
I tried doing everything using two query , one would get transactions and second one would get account details and then I did iterate over dataframe and assigned everything one by one which doesn't seem to be good idea
query = "SELECT tr_id, tr_date, description, dr_acc, cr_acc, amount, currency, currency_rate, document, comment FROM transactions WHERE " \
"company_id = {} {} and deleted = 0 {} LIMIT {}, {}".format(
company_id, filter, sort, sn, en)
df = ncon.getDF(query)
df.insert(4, 'dr_name', '')
df.insert(6, 'cr_name', '')
data = tuple(list(set(df['dr_acc'].values.tolist() + df['cr_acc'].values.tolist())))
query = "SELECT account_number, acc_id, account_name FROM tb_accounts WHERE company_id = {} and deleted = 0 and acc_id in {}".format(
company_id, data)
df_accs = ncon.getDF(query)
for index, row in df_accs.iterrows():
acc = str(row['acc_id'])
ac = row['account_number']
nm = row['account_name']
indx = df.index[df['dr_acc'] == acc].tolist()
df.at[indx, 'dr_acc'] = ac
df.at[indx, 'dr_name'] = nm
indx = df.index[df['cr_acc'] == acc].tolist()
df.at[indx, 'cr_acc'] = ac
df.at[indx, 'cr_name'] = nm
What you're looking for, I think, is a SQL JOIN statement.
Taking a crack at writing a query that might work based on your code:
query = '''
SELECT transactions.tr_id,
transactions.tr_date,
transactions.description,
transactions.dr_acc,
transactions.cr_acc,
transactions.amount,
transactions.currency,
transactions.currency_rate,
transactions.document,
transactions.comment
FROM transactions INNER JOIN tb_accounts ON tb_accounts.acc_id = transactions.acc_id
WHERE
transactions.company_id = {} AND
tb_accounts.company_id = {} AND
transactions.deleted = 0 AND
tb_accounts.deleted = 0
ORDER BY transactions.tr_id
LIMIT 10;'''
The above query will, roughly, present query results with all the fields listed from the two tables for each pair of rows where the acc_id is the same.
NOTE, the query above will probably not have very good performance. SQL JOIN statements must be written with care, but I wrote it above in a way that's easy to understand, so as to illustrate the power of the JOIN.
You should as a matter of habit NEVER try to program something when you could use a join instead. As long as you take care to write a join properly so that it can be efficient, the MySQL engine will beat your python code for performance almost every time.
sort two dataframe and use merge for merging 2data frame
df1 = df1.sort_values(['dr_acc'], ascending=True)
df2 = df2.sort_values(['acc_id'], ascending=True)
merge2df = pd.merge(df1, df2, how='outer',
left_on=['dr_acc'], right_on=['acc_id'])
I assumed df1 is 1st query data set and df2 is 2nd query data set
sql query
'''SELECT tr_id, tr_date,
description,
dr_acc, cr_acc,
amount, currency,
currency_rate,
document,
account_number, acc_id, account_name
comment FROM transactions left join
tb_accounts on transactions.dr_acc=tb_accounts.account_number'''
I am attempting to query a subset of a MySql database table, feed the results into a Pandas DataFrame, alter some data, and then write the updated rows back to the same table. My table size is ~1MM rows, and the number of rows I will be altering will be relatively small (<50,000) so bringing back the entire table and performing a df.to_sql(tablename,engine, if_exists='replace') isn't a viable option. Is there a straightforward way to UPDATE the rows that have been altered without iterating over every row in the DataFrame?
I am aware of this project, which attempts to simulate an "upsert" workflow, but it seems it only accomplishes the task of inserting new non-duplicate rows rather than updating parts of existing rows:
GitHub Pandas-to_sql-upsert
Here is a skeleton of what I'm attempting to accomplish on a much larger scale:
import pandas as pd
from sqlalchemy import create_engine
import threading
#Get sample data
d = {'A' : [1, 2, 3, 4], 'B' : [4, 3, 2, 1]}
df = pd.DataFrame(d)
engine = create_engine(SQLALCHEMY_DATABASE_URI)
#Create a table with a unique constraint on A.
engine.execute("""DROP TABLE IF EXISTS test_upsert """)
engine.execute("""CREATE TABLE test_upsert (
A INTEGER,
B INTEGER,
PRIMARY KEY (A))
""")
#Insert data using pandas.to_sql
df.to_sql('test_upsert', engine, if_exists='append', index=False)
#Alter row where 'A' == 2
df_in_db.loc[df_in_db['A'] == 2, 'B'] = 6
Now I would like to write df_in_db back to my 'test_upsert' table with the updated data reflected.
This SO question is very similar, and one of the comments recommends using an "sqlalchemy table class" to perform the task.
Update table using sqlalchemy table class
Can anyone expand on how I would implement this for my specific case above if that is the best (only?) way to implement it?
I think the easiest way would be to:
first delete those rows that are going to be "upserted". This can be done in a loop, but it's not very efficient for bigger data sets (5K+ rows), so i'd save this slice of the DF into a temporary MySQL table:
# assuming we have already changed values in the rows and saved those changed rows in a separate DF: `x`
x = df[mask] # `mask` should help us to find changed rows...
# make sure `x` DF has a Primary Key column as index
x = x.set_index('a')
# dump a slice with changed rows to temporary MySQL table
x.to_sql('my_tmp', engine, if_exists='replace', index=True)
conn = engine.connect()
trans = conn.begin()
try:
# delete those rows that we are going to "upsert"
engine.execute('delete from test_upsert where a in (select a from my_tmp)')
trans.commit()
# insert changed rows
x.to_sql('test_upsert', engine, if_exists='append', index=True)
except:
trans.rollback()
raise
PS i didn't test this code so it might have some small bugs, but it should give you an idea...
A MySQL specific solution using Panda's to_sql "method" arg and sqlalchemy's mysql insert on_duplicate_key_update features:
def create_method(meta):
def method(table, conn, keys, data_iter):
sql_table = db.Table(table.name, meta, autoload=True)
insert_stmt = db.dialects.mysql.insert(sql_table).values([dict(zip(keys, data)) for data in data_iter])
upsert_stmt = insert_stmt.on_duplicate_key_update({x.name: x for x in insert_stmt.inserted})
conn.execute(upsert_stmt)
return method
engine = db.create_engine(...)
conn = engine.connect()
with conn.begin():
meta = db.MetaData(conn)
method = create_method(meta)
df.to_sql(table_name, conn, if_exists='append', method=method)
Here is a general function that will update each row (but all values in the row simultaneously)
def update_table_from_df(df, table, where):
'''Will take a dataframe and update each specified row in the SQL table
with the DF values -- DF columns MUST match SQL columns
WHERE statement should be triple-quoted string
Will not update any columns contained in the WHERE statement'''
update_string = f'UPDATE {table} set '
for idx, row in df.iterrows():
upstr = update_string
for col in list(df.columns):
if (col != 'datetime') & (col not in where):
if col != df.columns[-1]:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}', '''
else:
upstr += f'''{col} = {row[col]}, '''
else:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}' '''
else:
upstr += f'''{col} = {row[col]} '''
upstr += where
cursor.execute(upstr)
cursor.commit()```
I was struggling with this before and now I've found a way.
Basically create a separate data frame in which you keep data that you only have to update.
df #updating data in dataframe
s_update = "" #String of updations
# Loop through the data frame
for i in range(len(df)):
s_update += "update your_table_name set column_name = '%s' where column_name = '%s';"%(df[col_name1][i], df[col_name2][i])
Now pass s_update to cursor.execute or engine.execute (wherever you execute SQL query)
This will update your data instantly.
My original query is like
select table1.id, table1.value
from some_database.something table1
join some_set table2 on table2.code=table1.code
where table1.date_ >= :_startdate and table1.date_ <= :_enddate
which is saved in a string in Python. If I do
x = session.execute(script_str, {'_startdate': start_date, '_enddate': end_date})
then
x.fetchall()
will give me the table I want.
Now the situation is, table2 is no longer available to me in the Oracle database, instead it is available in my python environment as a DataFrame. I wonder what is the best way to fetch the same table from the database in this case?
You can use the IN clause instead.
First remove the join from the script_str:
script_str = """
select table1.id, table1.value
from something table1
where table1.date_ >= :_startdate and table1.date_ <= :_enddate
"""
Then, get codes from dataframe:
codes = myDataFrame.code_column.values
Now, we need to dynamically extend the script_str and the parameters to the query:
param_names = ['_code{}'.format(i) for i in range(len(codes))]
script_str += "AND table1.code IN ({})".format(
", ".join([":{}".format(p) for p in param_names])
)
Create dict with all parameters:
params = {
'_startdate': start_date,
'_enddate': end_date,
}
params.update(zip(param_names, codes))
And execute the query:
x = session.execute(script_str, params)