I have this python code that pulls out some data from DB2 database into pandas dataframe data1 and data2. Data2 has a column named ARCD which has Text Values as '','01','03','14' etc.
TABLE1
CSOFF CSDATE
ABC 20180101
ADV 20180212
AFS 20180121
ADF 20180202
ABC 20180115
TABLE2
AROFF ARAMT ARCD ARTRDT
ABC 200 20180101
AFS 150 01 20180121
ADV 210 20180129
I need only those records in data3 where values in ARCD is blank i.e '', and '01'.
I can get all the values that have codes like '01', '03' etc. But I am not able to pull records with blank values, i.e ''.
import pyodbc
import pandas as pd
con = pyodbc.connect(
driver='{iSeries Access ODBC Driver}',
system='',
uid='',
pwd='')
cur = con.cursor()
query = """
SELECT * FROM QS99F.TABLE1 WHERE CSDATE > 20180100
"""
data1 = pd.read_sql(query,con,index_col = None)
query = """
SELECT * FROM QS99F.TABLE2 WHERE ARTRDT > 20180100
"""
data2 = pd.read_sql(query,con,index_col = None)
data3 = pd.merge(data1[['CSOFF','CSRATE']],data2[['AROFF','ARAMT','ARCD']],left_on=['CSOFF','CSMKT','CSSUFX'],right_on=['AROFF','ARMKT','ARSUFX'],how='inner')
dp = data3['ARCD'] == "01"
ar = data3['ARCD'] == ""
data3 = data3[dp|ar]
print (data3)
It is likely that your blank values are actually nan rather than "".
You can change this as follows:
data3 = data3.fillna("")
before your tests.
Related
I have two databases in Snowflake, DB1 & DB2. the data is migrated from DB1 to DB2, so the schema and the table names are the same.
Assume DB1.SCHEMA_1.TABLE_1 has this data:
STATE_ID STATE
1 AL
2 AN
3 AZ
4 AR
5 CA
6 AD
7 PN
8 AP
9 JH
10 TX
12 LA
and
Assume DB2.SCHEMA_1.TABLE_1 has this data:
STATE_ID STATE
1 AL
2 AK
3 AZ
4 AR
5 AC
6 AD
7 GP
8 AP
9 JH
10 HA
They both have one more column 'record_created_timestamp' but I drop it in the code.
I wrote a pyspark script that would perform Column based comparison that would run in Aws Glue job. I got help from this link: Generate a report of mismatch Columns between 2 Pyspark dataframes
My code is :
import sys
from pyspark.sql.session import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import concat, col, lit, to_timestamp, when
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from py4j.java_gateway import java_import
import os
from pyspark.sql.types import *
from pyspark.sql.functions import substring
from pyspark.sql.functions import array, count, first
import json
import datetime
import time
import boto3
from botocore.exceptions import ClientError
now = datetime.datetime.now()
year = now.strftime("%Y")
month = now.strftime("%m")
day = now.strftime("%d")
glueClient = boto3.client('glue')
ssmClient = boto3.client('ssm')
region = os.environ['AWS_DEFAULT_REGION']
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'CONNECTION_INFO', 'TABLE_NAME', 'BUCKET_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
client = boto3.client("secretsmanager", region_name="us-east-1")
get_secret_value_response = client.get_secret_value(
SecretId=args['CONNECTION_INFO']
)
secret = get_secret_value_response['SecretString']
secret = json.loads(secret)
db_username = secret.get('db_username')
db_password = secret.get('db_password')
db_warehouse = secret.get('db_warehouse')
db_url = secret.get('db_url')
db_account = secret.get('db_account')
db_name = secret.get('db_name')
db_schema = secret.get('db_schema')
logger = glueContext.get_logger()
logger.info('Fetching configuration.')
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : db_url,
"sfAccount" : db_account,
"sfUser" : db_username,
"sfPassword" : db_password,
"sfSchema" : db_schema,
"sfDatabase" : db_name,
"sfWarehouse" : db_warehouse
}
print(f'database: {db_name}')
print(f'db_warehouse: {db_warehouse}')
print(f'db_schema: {db_schema}')
print(f'db_account: {db_account}')
table_name = args['TABLE_NAME']
bucket_name = args['BUCKET_NAME']
MySql_1 = f"""
select * from DB1.SCHEMA_1.TABLE_1
"""
df = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_1).load()
df1 = df.drop('record_created_timestamp')
MySql_2 = f"""
select * from DB2.SCHEMA_1.TABLE_1
"""
df2 = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_2).load()
df3 = df.drop('record_created_timestamp')
# list of columns to be compared
cols = df1.columns[1:]
df_new = (df1.join(df3, "state_id", "outer")
.select([ when(~df1[c].eqNullSafe(df3[c]), array(df1[c], df3[c])).alias(c) for c in cols ])
.selectExpr('stack({},{}) as (Column_Name, mismatch)'.format(len(cols), ','.join('"{0}",`{0}`'.format(c) for c in cols)))
.filter('mismatch is not NULL'))
df_newv1 = df_new.selectExpr('Column_Name', 'mismatch[0] as Mismatch_In_DB1_Table', 'mismatch[1] as Mismatch_In_DB2_Table')
df_newv1.show()
SNOWFLAKE_SOURCE_NAME = "snowflake"
job.commit()
This provides me the correct output:
Column_Name Mismatch_In_DB1_Table Mismatch_In_DB2_Table
STATE AN AK
STATE CA AC
STATE PN GP
STATE TX HA
If I use STATE instead of STATE_ID to outer join
df_new = (df1.join(df2, "state", "outer")
It shows this error.
AnalysisException: 'Resolved attribute(s) STATE#1,STATE#9 missing from STATE#14,STATE_ID#0,STATE_ID#8 in operator !Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]. Attribute(s) with the same name appear in the operation: STATE,STATE. Please check if the right attribute(s) are used.;;\n!Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]\n+- Project [coalesce(STATE#1, STATE#9) AS STATE#14, STATE_ID#0, STATE_ID#8]\n +- Join FullOuter, (STATE#1 = STATE#9)\n :- Project [STATE_ID#0, STATE#1]\n : +- Relation[STATE_ID#0,STATE#1,RECORD_CREATED_TIMESTAMP#2] SnowflakeRelation\n +- Relation[STATE_ID#8,STATE#9] SnowflakeRelation\n
I would appreciate an explanation regarding this and want to know if there is a way this could run even if I give STATE as the key.
or
If there is some other code via which I can get the same output without this error, that would help too.
Seems that Spark is getting confused with the column names of both dfs. Try to give alias to them to make sure they match:
df1 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
df3 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
Also, make sure '''STATE_ID''' has no space/special charachters in this column's name
Is there anyway i can compare two different databases (postgresl, sql server) and return the missing rows? I am missing one row in the postgresql table that is not in the sql server one and have no clue how to return that answer to me.
I have two connections opened for postgresql (bpo_table_results) and for sql server(rps_table_results)
postgresql table:
date count amount
1/1/21 500 1,234,654.12
sql server table:
date count amount
1/1/21 500 1,234,654.12
1/2/21 4541 3,457,787.24
expected results:
The row in the amount of 3,457,787.24 is missing from your posgresql table.
code:
def queryRPS(sql_server_conn, sql_server_cursor):
rps_item_count_l = []
rps_icl_amt_l = []
rps_table_q_2 = f"""select * from rps..sendfile where processingdate = '{cd}' and datasetname like '%ICL%' """
rps_table_results = sql_server_cursor.execute(rps_table_q_2).fetchall()
for row in rps_table_results:
rps_item_count = row[16]
rps_item_count_l.append(rps_item_count)
rps_icl_amt = row[18]
rps_icl_amt_l.append(rps_icl_amt)
def queryBPO(postgres_conn, postgres_cursor,rps_item_count_l, rps_icl_amt_l):
bpo_results_l = []
rps_results_l = []
for rps_count, rps_amount in zip(rps_item_count_l, rps_icl_amt_l):
rps_amount_f = str(rps_amount).rstrip('0')
rps_amount_f = ("{:,}".format(float(rps_amount_f)))
bpo_icl_awk_q_2 = """select * from ppc_data.icl_awk where num_items = '%s' and
file_total = '%s' """ % (str(rps_count), str(rps_amount_f))
postgres_cursor.execute(bpo_icl_awk_q_2)
bpo_table_results = postgres_cursor.fetchall()
rps_table_q_2 = f"""select * from rps..sendfile where processingdate = '{cd}' and datasetname like '%ICL%' """
rps_table_results = sql_server_cursor.execute(rps_table_q_2).fetchall()
rps_item_count_l, rps_icl_amt_l = queryRPS(sql_server_conn, sql_server_cursor)
queryBPO(postgres_conn, postgres_cursor, rps_item_count_l, rps_icl_amt_l)
I have dataframe like this with more than 300 rows
name price percent volume Buy Sell
1 BID 41.30 -0.36 62292.0 604.0 6067.0
2 BVH 49.00 -1.01 57041.0 3786.0 3510.0
3 CTD 67.80 6.94 68098.0 2929.0 576.0
4 CTG 23.45 0.43 298677.0 16965.0 20367.0
5 EIB 18.20 -0.27 10517.0 306.0 210.0
For each name I create 1 table in mysql. Here is my code so far.
vn30 = vn30_list.iloc[:, [10,13,12,15,25,26]].dropna(how='all').fillna(0)
data = vn30_list.iloc[:, [13,12,15,25,26]].dropna(how='all').fillna(0)
data.columns = ['gia','percent','khoiluong','nnmua','nnban']
en = sa.create_engine('mysql+mysqlconnector://...', echo=True)
#insert into mysql
for i in range(30):
macp = vn30.iloc[i][0].lower()
#print(row)
compare_item = vn30.iloc[i][1]
if compare_item == data.iloc[i][0]:
row = data.iloc[i:i + 1, :]
#print(row)
row.to_sql(name=str(macp), con=en, if_exists= "append", index=False,schema="online")
Is there anyway to make it faster for 300 rows?
Thank you so much. And sorry for my English.
# import the module
from sqlalchemy import create_engine
# create sqlalchemy engine
engine = create_engine("mysql+pymysql://{user}:{pw}#localhost/{db}".format(user="root",pw="12345",db="employee"))
# Insert whole DataFrame into MySQL
data.to_sql('book_details', con = engine, if_exists = 'append', chunksize = 1000)
You can get all the details here: https://www.dataquest.io/blog/sql-insert-tutorial/
I have the following code :
from __future__ import division
import pyodbc
import csv
import pandas as pd
import numpy as np
count = 1
dsn = "DRIVER={SQL Server};server=XXXX;database=ABCD"
conn = pyodbc.connect(dsn)
cursor = conn.cursor()
#policy = cursor.execute("select distinct client_num, py_type, py_num, max(ex_date_n) as ex_date_n, max(ef_date_n) as ef_date_n from dbo.policy group by client_num, py_type, py_num")
policy = cursor.execute("select distinct client_num, py_type, py_num, max(ex_date_n) as ex_date_n,max(ef_date_n) as ef_date_n from dbo.policy where client_num = 62961 and py_type = 'A' and py_num = '01' group by client_num, py_type, py_num")
results1 = cursor.fetchall()
for row in results1:
pol_client_num = row.client_num.strip()
pol_py_type = row.py_type.strip()
pol_py_num = row.py_num.strip()
pol_number = pol_client_num+pol_py_type+pol_py_num
pol_exp_date = row.ex_date_n
pol_eff_date = row.ef_date_n
#related = cursor.execute("select distinct a.client_num,a.py_type,a.py_num,a.rclient_num,a.py_rtype,a.rpy_num from policy_replace a where a.client_num = "+pol_client_num+" and a.py_type = '"+pol_py_type+"' and a.py_num = '"+pol_py_num+"'")
related = cursor.execute("select distinct a.client_num,a.py_type,a.py_num,a.rclient_num,a.py_rtype,a.rpy_num from policy_replace a where a.client_num = 62961 and a.py_type = 'A' and a.py_num = 01")
results2 = cursor.fetchall()
for row in results2:
rel_client_num = row.rclient_num.strip()
rel_py_type = row.py_rtype.strip()
rel_py_num = row.rpy_num.strip()
rel_pol_number = rel_client_num+rel_py_type+rel_py_num
related_dates = cursor.execute("select max(b.ex_date_n) as ex_date_n, b.ef_date_n from policy b where b.ex_date_n >= 20200225 and b.client_num = "+rel_client_num+" and b.py_type = '"+rel_py_type+"' and b.py_num = '"+rel_py_num+"' group by b.ef_date_n")
#related_dates = cursor.execute("select max(b.ex_date_n) as ex_date_n, b.ef_date_n from policy b where b.client_num = 37916 and b.py_type = 'F' and b.py_num = 05 group by b.ef_date_n")
results3 = cursor.fetchall()
for row in results3:
rel_exp_date = row.ex_date_n
rel_eff_date = row.ef_date_n
final_result1 = (pol_number,pol_exp_date,pol_eff_date,rel_pol_number,rel_exp_date,rel_eff_date)
df = pd.DataFrame(final_result1)
df = df.transpose()
df.columns = ['pol_number','pol_exp_date','pol_eff_date','rel_pol_number','rel_exp_date','rel_eff_date']
df_grouped = df.groupby('pol_number')['rel_exp_date'].min()
print(df_grouped)
print('done')
On execution, the following is the data output:
For results1,
'62961 ', 'A', '01', 20210429, 20200429
For results2,
('62961 ', 'A', '01', '62961', 'A', '02'),
('62961 ', 'A', '01', '62961', 'A', '03'),
('62961 ', 'A', '01', '63832', 'A', '01')
For results3,
[(20201107, 20191107)]
[(20210407, 20200407)]
[(20200719, 20190719)]
The expected output is as follows:
'69621','A','01',20210429,20200429,'69621','A','02',20201107,20191107,'62961','A','03',20210407,20200407,'63832','A','01',20200719,20190719,'63832','A','01',20200719,20190719
The format of the output required is :
for every row in results1 --- every related row in results2 --- every related date in results3 --- minimum of the exp_date across all rows and the corresponding pol_num/rel_pol_num
This is the reason I am trying to use the df.min() function to get the min across all the exp_date's. But, it doesn't seem to do the job since I am possibly missing something. I also tried axis = 0 as mentioned in the comments but it didn't work. Any direction is appreciated.
I am trying to parameterize some parts of a SQL Query using the below dictionary:
query_params = dict(
{'target':'status',
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and %(target)s in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and %(target)s in ('ACT')
order by random() limit 50000);""")
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
However this returns a dataframe with no records at all. I am not sure what the error is since no error is being thrown.
df_data_sample.shape
Out[7]: (0, 1211)
The final PostgreSql query would be:
select *
from table_name
where dt = '201805'
and status in ('NPA')
----------------------------------------------------
union all
----------------------------------------------------
(select *
from table_name
where dt = '201712'
and status in ('ACT')
order by random() limit 50000);-- This part of random() is only for running it on my local and not on server.
Below is a small sample of data for replication. The original data has more than a million records and 1211 columns
service_change_3m service_change_6m dt grp_m2 status
0 -2 201805 $50-$75 NPA
0 0 201805 < $25 NPA
0 -1 201805 $175-$200 ACT
0 0 201712 $150-$175 ACT
0 0 201712 $125-$150 ACT
-1 1 201805 $50-$75 NPA
Can someone please help me with this?
UPDATE:
Based on suggestion by #shmee.. I am finally using :
target = 'status'
query_params = dict(
{
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and {0} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and {0} in ('ACT')
order by random() limit 50000);""").format(target)
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
Yes, I am quite confident that your issue results from trying to set column names in your query via parameter binding (and %(target)s in ('ACT')) as mentioned in the comments.
This results in your query restricting the result set to records where 'status' in ('ACT') (i.e. Is the string 'status' an element of a list containing only the string 'ACT'?). This is, of course, false, hence no record gets selected and you get an empty result.
This should work as expected:
import psycopg2.sql
col_name = 'status'
table_name = 'public.churn_data'
query_params = {'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
}
sql_data_sample = """select *
from {0}
where dt = %(date_to)s
and {1} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *
from {0}
where dt = %(date_from)s
and {1} in ('ACT')
order by random() limit 50000);"""
sql_data_sample = sql.SQL(sql_data_sample).format(sql.Identifier(table_name),
sql.Identifier(col_name))
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)