Column based comparison between two tables in different databases using pyspark

Column based comparison between two tables in different databases using pyspark - python

I have two databases in Snowflake, DB1 & DB2. the data is migrated from DB1 to DB2, so the schema and the table names are the same.
Assume DB1.SCHEMA_1.TABLE_1 has this data:
STATE_ID STATE
1 AL
2 AN
3 AZ
4 AR
5 CA
6 AD
7 PN
8 AP
9 JH
10 TX
12 LA
and
Assume DB2.SCHEMA_1.TABLE_1 has this data:
STATE_ID STATE
1 AL
2 AK
3 AZ
4 AR
5 AC
6 AD
7 GP
8 AP
9 JH
10 HA
They both have one more column 'record_created_timestamp' but I drop it in the code.
I wrote a pyspark script that would perform Column based comparison that would run in Aws Glue job. I got help from this link: Generate a report of mismatch Columns between 2 Pyspark dataframes
My code is :
import sys
from pyspark.sql.session import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import concat, col, lit, to_timestamp, when
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from py4j.java_gateway import java_import
import os
from pyspark.sql.types import *
from pyspark.sql.functions import substring
from pyspark.sql.functions import array, count, first
import json
import datetime
import time
import boto3
from botocore.exceptions import ClientError
now = datetime.datetime.now()
year = now.strftime("%Y")
month = now.strftime("%m")
day = now.strftime("%d")
glueClient = boto3.client('glue')
ssmClient = boto3.client('ssm')
region = os.environ['AWS_DEFAULT_REGION']
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'CONNECTION_INFO', 'TABLE_NAME', 'BUCKET_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
client = boto3.client("secretsmanager", region_name="us-east-1")
get_secret_value_response = client.get_secret_value(
SecretId=args['CONNECTION_INFO']
)
secret = get_secret_value_response['SecretString']
secret = json.loads(secret)
db_username = secret.get('db_username')
db_password = secret.get('db_password')
db_warehouse = secret.get('db_warehouse')
db_url = secret.get('db_url')
db_account = secret.get('db_account')
db_name = secret.get('db_name')
db_schema = secret.get('db_schema')
logger = glueContext.get_logger()
logger.info('Fetching configuration.')
job.init(args['JOB_NAME'], args)
java_import(spark._jvm, SNOWFLAKE_SOURCE_NAME)
spark._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(spark._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
sfOptions = {
"sfURL" : db_url,
"sfAccount" : db_account,
"sfUser" : db_username,
"sfPassword" : db_password,
"sfSchema" : db_schema,
"sfDatabase" : db_name,
"sfWarehouse" : db_warehouse
}
print(f'database: {db_name}')
print(f'db_warehouse: {db_warehouse}')
print(f'db_schema: {db_schema}')
print(f'db_account: {db_account}')
table_name = args['TABLE_NAME']
bucket_name = args['BUCKET_NAME']
MySql_1 = f"""
select * from DB1.SCHEMA_1.TABLE_1
"""
df = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_1).load()
df1 = df.drop('record_created_timestamp')
MySql_2 = f"""
select * from DB2.SCHEMA_1.TABLE_1
"""
df2 = spark.read.format("snowflake").options(**sfOptions).option("query", MySql_2).load()
df3 = df.drop('record_created_timestamp')
# list of columns to be compared
cols = df1.columns[1:]
df_new = (df1.join(df3, "state_id", "outer")
.select([ when(~df1[c].eqNullSafe(df3[c]), array(df1[c], df3[c])).alias(c) for c in cols ])
.selectExpr('stack({},{}) as (Column_Name, mismatch)'.format(len(cols), ','.join('"{0}",`{0}`'.format(c) for c in cols)))
.filter('mismatch is not NULL'))
df_newv1 = df_new.selectExpr('Column_Name', 'mismatch[0] as Mismatch_In_DB1_Table', 'mismatch[1] as Mismatch_In_DB2_Table')
df_newv1.show()
SNOWFLAKE_SOURCE_NAME = "snowflake"
job.commit()
This provides me the correct output:
Column_Name Mismatch_In_DB1_Table Mismatch_In_DB2_Table
STATE AN AK
STATE CA AC
STATE PN GP
STATE TX HA
If I use STATE instead of STATE_ID to outer join
df_new = (df1.join(df2, "state", "outer")
It shows this error.
AnalysisException: 'Resolved attribute(s) STATE#1,STATE#9 missing from STATE#14,STATE_ID#0,STATE_ID#8 in operator !Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]. Attribute(s) with the same name appear in the operation: STATE,STATE. Please check if the right attribute(s) are used.;;\n!Project [CASE WHEN NOT (STATE#1 <=> STATE#9) THEN array(STATE#1, STATE#9) END AS STATE#18]\n+- Project [coalesce(STATE#1, STATE#9) AS STATE#14, STATE_ID#0, STATE_ID#8]\n +- Join FullOuter, (STATE#1 = STATE#9)\n :- Project [STATE_ID#0, STATE#1]\n : +- Relation[STATE_ID#0,STATE#1,RECORD_CREATED_TIMESTAMP#2] SnowflakeRelation\n +- Relation[STATE_ID#8,STATE#9] SnowflakeRelation\n
I would appreciate an explanation regarding this and want to know if there is a way this could run even if I give STATE as the key.
or
If there is some other code via which I can get the same output without this error, that would help too.

Seems that Spark is getting confused with the column names of both dfs. Try to give alias to them to make sure they match:
df1 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
df3 = df.drop('record_created_timestamp')\
.select(df.STATE_ID.alias('state_id'), df.STATE.alias('state'))
Also, make sure '''STATE_ID''' has no space/special charachters in this column's name

Related

Convert the results of a dictironary to a dataframe

From this commands
from stackapi import StackAPI
lst = ['11786778','12370060']
df = pd.DataFrame(lst)
SITE = StackAPI('stackoverflow', key="xxxx")
results = []
for i in range(1,len(df)):
SITE.max_pages=10000000
SITE.page_size=100
post = SITE.fetch('/users/{ids}/reputation-history', ids=lst[i])
results.append(post)
The results variable prints the results of the json format
How is it possible to converts the results variable to a dataframe with five columns?
reputation_history_type, reputation_change, post_id, creation_date,
user_id

Here try this :
from stackapi import StackAPI
import pandas as pd
lst = ['11786778','12370060']
SITE = StackAPI('stackoverflow')
results = []
SITE.max_pages=10000000
SITE.page_size=100
for i in lst:
post = SITE.fetch('/users/{ids}/reputation-history', ids=[i]).get('items')
results.extend([list(j.values()) for j in post])
df = pd.DataFrame(results, columns = ['reputation_history_type', 'reputation_change', 'post_id', 'creation_date', 'user_id'])
Output :
print(df.head()) gives :
reputation_history_type reputation_change post_id creation_date user_id
0 asker_accepts_answer 2 59126012 1575207944 11786778.0
1 post_undownvoted 2 59118819 1575139301 11786778.0
2 post_upvoted 10 59118819 1575139301 11786778.0
3 post_downvoted -2 59118819 1575139299 11786778.0
4 post_upvoted 10 59110166 1575094452 11786778.0
print(df.tail()) gives :
reputation_history_type reputation_change post_id creation_date user_id
170 post_upvoted 10 58906292 1574036540 12370060.0
171 answer_accepted 15 58896536 1573990105 12370060.0
172 post_upvoted 10 58896044 1573972834 12370060.0
173 post_downvoted 0 58896299 1573948372 12370060.0
174 post_downvoted 0 58896158 1573947435 12370060.0
NOTE :
You can just create a dataframe direct from the result which will be list of lists.
You don't need to declare SITE.max_page and SIZE.page_size every time you loop through the lst.

from stackapi import StackAPI
import pandas as pd
lst = ['11786778', '12370060']
df = pd.DataFrame(lst)
SITE = StackAPI('stackoverflow', key="xxxx")
results = []
for i in range(1, len(df)):
SITE.max_pages = 10000000
SITE.page_size = 100
post = SITE.fetch('/users/{ids}/reputation-history', ids=lst[i])
results.append(post)
data = []
for item in results:
data.append(item)
df = pd.DataFrame(data, columns=['reputation_history_type', 'reputation_change', 'post_id', 'creation_date', 'user_id']
print(df)

Kinda flying in the blind since I maxed out my StackOverflow API limit, but this should work:
from stackapi import StackAPI
from pandas.io.json import json_normalize
lst = ['11786778','12370060']
SITE = StackAPI('stackoverflow', key="xxx")
results = []
for ids in lst:
SITE.max_pages=10000000
SITE.page_size=100
post = SITE.fetch('/users/{ids}/reputation-history', ids=ids)
results.append(json_normalize(post, 'items'))
df = pd.concat(results, ignore_index=True)
json_normalize converts the JSON to dataframe
pd.concat concatenates the dataframes together to make a single frame

python - presto - timestamps and decimal(38,18) returned as strings?

Why are presto timestamp/decimal(38,18) data types returned a string (enclosed in u'') instead of python datetime/numeric types?
presto jdbc:
select typeof(col1),typeof(col2),typeof(col3),typeof(col4),typeof(col5),typeof(col6) from hive.x.y
result is
timestamp timestamp bigint decimal(38,18) varchar varchar
desc hive.x.y
#result is
for_dt timestamp NO NO NO NO 1
for_d timestamp NO NO NO NO 2
for_h bigint NO NO NO NO 3
value decimal(38,18) NO NO NO NO 4
metric varchar(2147483647) NO NO NO NO 5
lat_lon varchar(2147483647) NO NO NO NO 6
attempt 1
#python
from sqlalchemy.engine import create_engine
engine = create_engine('presto://u:p#host:port',connect_args={'protocol': 'https', 'requests_kwargs': {'verify': 'mypem'}})
result = engine.execute('select * from hive.x.y limit 1')
print(result.fetchall())
#result is
[(u'2010-02-18 03:00:00.000', u'2010-02-18 00:00:00.000', 3, u'-0.191912651062011660', u'hey', u'there')]
attempt 2
#python
from pyhive import presto
import requests
from requests.auth import HTTPBasicAuth
req_kw = {
'verify': 'mypem',
'auth': HTTPBasicAuth('u', 'p')
}
cursor = presto.connect(
host='host',
port=port,
protocol='https',
username='u',
requests_kwargs=req_kw,
).cursor()
query = '''select * from x.y limit 1'''
cursor.execute(query)
print cursor.fetchall()
#result is
[(u'2010-02-18 03:00:00.000', u'2010-02-18 00:00:00.000', 3, u'-0.191912651062011660', u'hey', u'there')]

The output you are getting from your sql query comes from the database in that format.
You have two choices
Map the Data Yourself (Write Your Own ORM)
Learn to use the ORM
Option 1
Note I've just hardcoded your query result in here for my testing.
from sqlalchemy.engine import create_engine
from datetime import datetime
from decimal import Decimal
# 2010-02-18 03:00:00.000
dateTimeFormat = "%Y-%m-%d %H:%M:%S.%f"
class hivexy:
def __init__(self, for_dt, for_d, for_h, value, metric, lat_lon):
self.for_dt = for_dt
self.for_d = for_d
self.for_h = for_h
self.value = value
self.metric = metric
self.lat_lon = lat_lon
# Pretty Printing on print(hivexy)
def __str__(self):
baseString = ("for_dt: {}\n"
"for_d: {}\n"
"for_h: {}\n"
"value: {}\n"
"metric: {}\n"
"lat_lon: {}\n")
return baseString.format(for_dt, for_d, for_h, value, metric, lat_lon)
#engine = create_engine('presto://u:p#host:port',connect_args={'protocol': 'https', 'requests_kwargs': {'verify': 'mypem'}})
#results = engine.execute("select * from 'hive.x.y' limit 1")
results = [(u'2010-02-18 03:00:00.000', u'2010-02-18 00:00:00.000', 3, u'-0.191912651062011660', u'hey', u'there')]
hiveObjects = []
for row in results:
for_dt = datetime.strptime(row[0], dateTimeFormat)
for_d = datetime.strptime(row[1], dateTimeFormat)
for_h = row[2]
value = Decimal(row[3])
metric = row[4]
lat_lon = row[5]
hiveObjects.append(hivexy(for_dt, for_d, for_h, value, metric, lat_lon))
for hiveObject in hiveObjects:
print(hiveObject)
Option 2
This uses reflection - it queries the database metadata for field types so you don't have to do all that stuff in option 1.
from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('presto://u:p#host:port',connect_args={'protocol': 'https', 'requests_kwargs': {'verify': 'mypem'}})
# Reflection - SQLAlchemy will get metadata from database including field types
hiveXYTable = Table('hive.x.y', MetaData(bind=engine), autoload=True)
s = select([hiveXYTable]).limit(1)
results = engine.execute(s)
for row in results:
print(row)

How to resolve duplicate column names while joining two dataframes in PySpark?

I have a file A and B which are exactly the same. I am trying to perform inner and outer joins on these two dataframes. Since I have all the columns as duplicate columns, the existing answers were of no help.
The other questions that I have gone through contain a col or two as duplicate, my issue is that the whole files are duplicates of each other: both in data and in column names.
My code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql import DataFrameReader, DataFrameWriter
from datetime import datetime
import time
# #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print("All imports were successful.")
df = spark.read.orc(
's3://****'
)
print("First dataframe read with headers set to True")
df2 = spark.read.orc(
's3://****'
)
print("Second dataframe read with headers set to True")
# df3 = df.join(df2, ['c_0'], "outer")
# df3 = df.join(
# df2,
# df["column_test_1"] == df2["column_1"],
# "outer"
# )
df3 = df.alias('l').join(df2.alias('r'), on='c_0') #.collect()
print("Dataframes have been joined successfully.")
output_file_path = 's3://****'
)
df3.write.orc(
output_file_path
)
print("Dataframe has been written to csv.")
job.commit()
The error that I am facing is:
pyspark.sql.utils.AnalysisException: u'Duplicate column(s): "c_4", "c_38", "c_13", "c_27", "c_50", "c_16", "c_23", "c_24", "c_1", "c_35", "c_30", "c_56", "c_34", "c_7", "c_46", "c_49", "c_57", "c_45", "c_31", "c_53", "c_19", "c_25", "c_10", "c_8", "c_14", "c_42", "c_20", "c_47", "c_36", "c_29", "c_15", "c_43", "c_32", "c_5", "c_37", "c_18", "c_54", "c_3", "__created_at__", "c_51", "c_48", "c_9", "c_21", "c_26", "c_44", "c_55", "c_2", "c_17", "c_40", "c_28", "c_33", "c_41", "c_22", "c_11", "c_12", "c_52", "c_6", "c_39" found, cannot save to file.;'
End of LogType:stdout

There is no shortcut here. Pyspark expects the left and right dataframes to have distinct sets of field names (with the exception of the join key).
One solution would be to prefix each field name with either a "left_" or "right_" as follows:
# Obtain columns lists
left_cols = df.columns
right_cols = df2.columns
# Prefix each dataframe's field with "left_" or "right_"
df = df.selectExpr([col + ' as left_' + col for col in left_cols])
df2 = df2.selectExpr([col + ' as right_' + col for col in right_cols])
# Perform join
df3 = df.alias('l').join(df2.alias('r'), on='c_0')

Here is a helper function to join two dataframes adding aliases:
def join_with_aliases(left, right, on, how, right_prefix):
renamed_right = right.selectExpr(
[
col + f" as {col}_{right_prefix}"
for col in df2.columns
if col not in on
]
+ on
)
right_on = [f"{x}{right_prefix}" for x in on]
return left.join(renamed_right, on=on, how=how)
and here an example in how to use it:
df1 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))
df2 = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"]], ("id", "value"))
join_with_aliases(
left=df1,
right=df2,
on=["id"],
how="inner",
right_prefix="_right"
).show()
+---+-----+------------+
| id|value|value_right|
+---+-----+------------+
| 1| a| a|
| 3| c| c|
| 2| b| b|
+---+-----+------------+

I did something like this but in scala, you can convert the same into pyspark as well...
Rename the column names in each dataframe
dataFrame1.columns.foreach(columnName => {
dataFrame1 = dataFrame1.select(dataFrame1.columns.head, dataFrame1.columns.tail: _*).withColumnRenamed(columnName, s"left_$columnName")
})
dataFrame1.columns.foreach(columnName => {
dataFrame2 = dataFrame2.select(dataFrame2.columns.head, dataFrame2.columns.tail: _*).withColumnRenamed(columnName, s"right_$columnName")
})
Now join by mentioning the column names
resultDF = dataframe1.join(dataframe2, dataframe1("left_c_0") === dataframe2("right_c_0"))

read_sql query returns an empty dataframe after I pass parameters as a dict in python pandas

I am trying to parameterize some parts of a SQL Query using the below dictionary:
query_params = dict(
{'target':'status',
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and %(target)s in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and %(target)s in ('ACT')
order by random() limit 50000);""")
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)
However this returns a dataframe with no records at all. I am not sure what the error is since no error is being thrown.
df_data_sample.shape
Out[7]: (0, 1211)
The final PostgreSql query would be:
select *
from table_name
where dt = '201805'
and status in ('NPA')
----------------------------------------------------
union all
----------------------------------------------------
(select *
from table_name
where dt = '201712'
and status in ('ACT')
order by random() limit 50000);-- This part of random() is only for running it on my local and not on server.
Below is a small sample of data for replication. The original data has more than a million records and 1211 columns
service_change_3m service_change_6m dt grp_m2 status
0 -2 201805 $50-$75 NPA
0 0 201805 < $25 NPA
0 -1 201805 $175-$200 ACT
0 0 201712 $150-$175 ACT
0 0 201712 $125-$150 ACT
-1 1 201805 $50-$75 NPA
Can someone please help me with this?
UPDATE:
Based on suggestion by #shmee.. I am finally using :
target = 'status'
query_params = dict(
{
'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
})
sql_data_sample = str("""select *
from table_name
where dt = %(date_to)s
and {0} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *,
from table_name
where dt = %(date_from)s
and {0} in ('ACT')
order by random() limit 50000);""").format(target)
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)

Yes, I am quite confident that your issue results from trying to set column names in your query via parameter binding (and %(target)s in ('ACT')) as mentioned in the comments.
This results in your query restricting the result set to records where 'status' in ('ACT') (i.e. Is the string 'status' an element of a list containing only the string 'ACT'?). This is, of course, false, hence no record gets selected and you get an empty result.
This should work as expected:
import psycopg2.sql
col_name = 'status'
table_name = 'public.churn_data'
query_params = {'date_from':'201712',
'date_to':'201805',
'drform_target':'NPA'
}
sql_data_sample = """select *
from {0}
where dt = %(date_to)s
and {1} in (%(drform_target)s)
----------------------------------------------------
union all
----------------------------------------------------
(select *
from {0}
where dt = %(date_from)s
and {1} in ('ACT')
order by random() limit 50000);"""
sql_data_sample = sql.SQL(sql_data_sample).format(sql.Identifier(table_name),
sql.Identifier(col_name))
df_data_sample = pd.read_sql(sql_data_sample,con = cnxn,params = query_params)

Get rows with blank value as text from python pandas

I have this python code that pulls out some data from DB2 database into pandas dataframe data1 and data2. Data2 has a column named ARCD which has Text Values as '','01','03','14' etc.
TABLE1
CSOFF CSDATE
ABC 20180101
ADV 20180212
AFS 20180121
ADF 20180202
ABC 20180115
TABLE2
AROFF ARAMT ARCD ARTRDT
ABC 200 20180101
AFS 150 01 20180121
ADV 210 20180129
I need only those records in data3 where values in ARCD is blank i.e '', and '01'.
I can get all the values that have codes like '01', '03' etc. But I am not able to pull records with blank values, i.e ''.
import pyodbc
import pandas as pd
con = pyodbc.connect(
driver='{iSeries Access ODBC Driver}',
system='',
uid='',
pwd='')
cur = con.cursor()
query = """
SELECT * FROM QS99F.TABLE1 WHERE CSDATE > 20180100
"""
data1 = pd.read_sql(query,con,index_col = None)
query = """
SELECT * FROM QS99F.TABLE2 WHERE ARTRDT > 20180100
"""
data2 = pd.read_sql(query,con,index_col = None)
data3 = pd.merge(data1[['CSOFF','CSRATE']],data2[['AROFF','ARAMT','ARCD']],left_on=['CSOFF','CSMKT','CSSUFX'],right_on=['AROFF','ARMKT','ARSUFX'],how='inner')
dp = data3['ARCD'] == "01"
ar = data3['ARCD'] == ""
data3 = data3[dp|ar]
print (data3)

It is likely that your blank values are actually nan rather than "".
You can change this as follows:
data3 = data3.fillna("")
before your tests.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Column based comparison between two tables in different databases using pyspark - python

Related

Convert the results of a dictironary to a dataframe

python - presto - timestamps and decimal(38,18) returned as strings?

How to resolve duplicate column names while joining two dataframes in PySpark?

read_sql query returns an empty dataframe after I pass parameters as a dict in python pandas

Get rows with blank value as text from python pandas

Categories

Resources