how to compare databases with tables using pandas - python

I am trying to compare different databases and trying to figure out if the tables inside those databases are same/equal. For example I have set it up has follows
Database 'a' has only one table called "abc"
Database 'b' has only one table called "abc"
Database 'c' has two tables called "abc" & "xyz"
I have written the following code and it works fine when executed but as you can see from the output
it says both as "false". But if you see my setup, database 'a' and database 'b' has only one identical Table and i expect it to print "True" BUT it prints "false" and when you compare database 'b' and database 'c', they are not identical because database 'c' has an extra table called 'xyz', so i expect it to print "false" which is correct.
please let me know what is wrong with my code or if there is work around. Basically i want to do a diff and compare two databases and check to see if they have same identical tables or not?
import pandas as pd
import mysql.connector
mydb1 = mysql.connector.connect(host="localhost", user="xxxxxxxx", passwd="xxxxxxxx", database="a")
mydb2 = mysql.connector.connect(host="localhost", user="xxxxxxxx", passwd="xxxxxxxx", database="b")
mydb3 = mysql.connector.connect(host="localhost", user="xxxxxxxx", passwd="xxxxxxxx", database="c")
querystmt1 = "SHOW TABLES"
querystmt2 = "SHOW TABLES"
querystmt3 = "SHOW TABLES"
df1 = pd.read_sql(querystmt1, mydb1)
df2 = pd.read_sql(querystmt2, mydb2)
df3 = pd.read_sql(querystmt3, mydb3)
print(df1)
print(df2)
print(df3)
print(df1.equals(df2))
print(df2.equals(df3))

Since you are interested in the values of the dataframes then a solution would be to convert the dataframes to dictionaries and then check if the values are the same:
df1 = pd.read_sql(querystmt1, mydb1)
d1 = df1.to_dict()
df2 = pd.read_sql(querystmt2, mydb2)
d2 = df2.to_dict()
df3 = pd.read_sql(querystmt3, mydb3)
d3 = df3.to_dict()
# Checking
print(list(d1.values()) == list(d2.values())) # True
print(list(d2.values()) == list(d3.values())) # False
This is not the most computationally efficient way to do it (contains a lot of type conversions) but it's sufficient if it's a one time thing.
If you want to check if the two dataframes contain at least one common value then you may use:
print(any(i in list(d3.values()) for i in list(d2.values())))
# The output is True since 'abc' is a table in both df2 and df3.

The headers are possibly different.
Try setting the headers to just indexes before comparing
df1.columns = range(df1.shape[1])
df2.columns = range(df2.shape[1])
df3.columns = range(df3.shape[1])
Under the assumption that the column order in all dataframes is the same

Try pd.testing.assert_frame_equal: It'll return nothing if the two dataframes are equal, and will raise an AssertionError if they're not.
It can receive all sorts of keyword arguments to select what to check for in the comparison (e.g. you can pass check_names=False if you don't want to check for column names).
It will also be explicit about where are the dataframes not equal; different sizes, different column names, different values - whatever it is, it'll be explicit about it.
Give it a try!

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Fill specific columns in a pandas dataframe rows with values from another dataframes values

I am trying to replace some missing and incorrect values in my master dataset by filling it in with correct values from two different datasets.
I created a miniature version of the full dataset like so (note the real dataset is several thousand rows long):
import pandas as pd
data = {'From':['GA0251','GA5201','GA5551','GA510A','GA5171','GA5151'],
'To':['GA0201_T','GA5151_T','GA5151_R','GA5151_V','GA5151_P','GA5171_B'],
'From_Latitude':[55.86630869,0,55.85508787,55.85594626,55.85692217,55.85669934],
'From_Longitude':[-4.27138731,0,-4.24126866,-4.24446585,-4.24516129,-4.24358251,],
'To_Latitude':[55.86614756,0,55.85522197,55.85593762,55.85693878,0],
'To_Longitude':[-4.271040979,0,-4.241466534,-4.244607602,-4.244905037,0]}
dataset_to_correct = pd.DataFrame(data)
However, some values in the From lat/long and the To lat/long are incorrect. I have two tables like the one below for each of From and To, which I would like to substitute into the table in place of the two values for that row.
Table of Corrected From lat/long:
data = {'Site':['GA5151_T','GA5171_B'],
'Correct_Latitude':[55.85952791,55.87044558],
'Correct_Longitude':[55.85661767,-4.24358251,]}
correct_to_coords = pd.DataFrame(data)
I would like to match this table to the From column and then replace the From_Latitude and From_Longitude with the correct values.
Table of Corrected To lat/long:
data = {'Site':['GA5201','GA0251'],
'Correct_Latitude':[55.857577,55.86616756],
'Correct_Longitude':[-4.242770,-4.272140979]}
correct_from_coords = pd.DataFrame(data)
I would like to match this table to the To column and then replace the To_Latitude and To_Longitude with the correct values.
Is there a way to match the site in each table to the corresponding From or To column and then replace only the values in the respective columns?
I have tried using code from this answer (Elegant way to replace values in pandas.DataFrame from another DataFrame) but it seems to have no effect on the database.
(correct_to_coords.set_index('Site').rename(columns = {'Correct_Latitude':'To_Latitude'}) .combine_first(dataset_to_correct.set_index('To')))
#zswqa 's answer produces right result, #Anurag Dabas 's doesn't.
Another possible solution, It is a bit faster than merge method suggested above, although both are correct.
dataset_to_correct.set_index("To",inplace=True)
correct_to_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_to_coords.index, "To_Latitude"] = correct_to_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_to_coords.index, "To_Longitude"] = correct_to_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
dataset_to_correct.set_index("From",inplace=True)
correct_from_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_from_coords.index, "From_Latitude"] = correct_from_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_from_coords.index, "From_Longitude"] = correct_from_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
merge = dataset_to_correct.merge(correct_to_coords, left_on='To', right_on='Site', how='left')
merge.loc[(merge.To == merge.Site), 'To_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.To == merge.Site), 'To_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge = merge.merge(correct_from_coords, left_on='From', right_on='Site', how='left')
merge.loc[(merge.From == merge.Site), 'From_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.From == merge.Site), 'From_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge
lets try dual merge by merge()+pop()+fillna()+drop():
dataset_to_correct=dataset_to_correct.merge(correct_to_coords,left_on='To',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['From_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['From_Latitude'])
dataset_to_correct['From_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['From_Longitude'])
dataset_to_correct=dataset_to_correct.merge(correct_from_coords,left_on='From',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['To_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['To_Latitude'])
dataset_to_correct['To_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['To_Longitude'])

Spark dataframe to dict with set

I'm having an issue with the output of my spark dataframe. The file can range from few GB to 50+GB
SparkDF = spark.read.format("csv").options(header="true", delimiter="|", maxColumns="100000").load(my_file.csv)
This give me the correct DF that I want. But as per requirement I need to have as key the column name and all the values in a set related to that key.
For example:
df = {'col1': ['1', '2', '3', '4'], 'col2': ['Jean', 'Cecil', 'Annie', 'Maurice'], 'col3': ['test', 'aaa', 'bbb', 'ccc','ddd']}
df = pd.DataFrame(data=d)
Should give me at the end:
{'col1': {'1', '2', '3', '4'},'col2': {'Jean', 'Cecil', 'Annie', 'Maurice'},'col3': {'test', 'aaa', 'bbb', 'ccc','ddd'}
I've implemented the following:
def columnDict(dataFrame):
colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())
However, it returned me a dict with a tuple as value and not a set as I require.
I would like either to convert the tuple in the dictionary into a set or just directly get a dictionary as a set as an output of my function.
EDIT:
For the full requirements:
Beside the dictionary mentioned above, there is another one that contains similar data for checking.
Means that the file that I load to a spark DF and transform into a dictionary contains data that must be checked against the other dictionary.
The goal is to check every key from my dict (the loaded file), against the check dictionary, first to see if they exist, then if it exist to check if the values of the keys are a subset of the check values.
If I load the check data in a dataframe it would look like this : (note that I may not be able to change the fact that it's a dict, I will see if I can modify from dict to spark df)
df = {'KeyName': ['col1', 'col2', 'col3'], 'ValueName': ['1, 2, 3, 4', 'Jean, Cecil, Annie, Maurice, Annie, Maurice', 'test, aaa, bbb, ccc,ddd,eee']}
df = pd.DataFrame(data=df)
print(df)
KeyName ValueName
0 col1 1, 2, 3, 4
1 col2 Jean, Cecil, Annie, Maurice, Annie, Maurice
2 col3 test, aaa, bbb, ccc,ddd,eee
So at the end, the data in my file should be a subset of a row that have the same KeyName as my dict.
I'm slightly stuck with legacy code and I'm little bit struggling to migrate it to spark databricks.
EDIT 2:
hopefully this will work. I uploaded the 2 files with modified data:
https://filebin.net/1rnnvqn2b0ww7qc8
FakeData.csv contains the data that I loaded on my side with the above code and must be a subset of the second one.
FakeDataChecker.csv contains the data that is the actual full set available
EDIT 3:
Forgot to add that all empty string in the FakeData should not be taken in account as well as the one in FakeDataChecker
So I'm not sure I have understood your usecase perfectly. But let's try with a first draft.
From what I'm understanding, you have a first file with all your data. And a file checker with keys that you need to have in the data foreach column. And additional keys present in the data should be filtered out.
This could be done with inner join between your initial data and the data checker. If there aren't too many keys in the data checker, Spark should automatically broadcast the data checker dataframe for optimized joins.
Here is the first draft of the code, this isn't yet completely automated waiting for your first questions and remarks.
First let's import the needed functions and the data:
from pyspark.sql.functions import col
from pyspark.sql import Window
spark.sql("set spark.sql.caseSensitive=true")
data = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeData.csv")
.na.drop()
)
data_checker = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeDataChecker.csv")
.na.drop(subset=["ValueName"])
)
We drop null values as you need, you can specify the wanted columns with the subset keyword
Then let's prepare the join dataframes
data_checker_date = data_checker.filter(col("KeyName") == "DATE").select(col("ValueName").alias("date"))
data_checker_location = data_checker.filter(col("KeyName") == "LOCATION").select(col("ValueName").alias("location"))
data_checker_location_id = data_checker.filter(col("KeyName") == "LOCATIONID").select(col("ValueName").alias("locationid"))
data_checker_type = data_checker.filter(col("KeyName") == "TYPE").select(col("ValueName").alias("type"))
We need to alias the column during the joins to avoid duplicated column names. And we specify the case sensitive option for when we drop the columns, so that we don't drop the initial ones in CAPS.
Finally we filter out, through inner join all keys not present in the data checker:
(
data
.join(data_checker_date, data.DATE == data_checker_date.date)
.join(data_checker_location, data.LOCATION == data_checker_location.location)
.join(data_checker_location_id, data.LOCATIONID == data_checker_location_id.locationid)
.join(data_checker_type, data.TYPE == data_checker_type.type)
.drop("date", "location", "locationid", "type")
.show()
)
In next steps, we can automate this through retrieving the distinct keyNames of the columns (e.g.: "DATE", "LOCATION", etc...) So we can don't have to copy paste the code 4 times or X times in the future.
Something in the line of:
from pyspark.sql.functions import collect_set
distinct_keynames = data_checker.select(collect_set('KeyName').alias('KeyName')).first()['KeyName']
for keyname in distinct_keynames:
etc... implement the logic of chaining joins

use dictionary value as a variable for df

I'm importing multiple dataframes and wrote the following process: 1. list of files to be coverted to dataframes + 2. list of names I want for the corresponding dataframes. 3. I combined the list into a dictionary:
tbls = ['tbl1', 'tbl2', 'tbl3']
dbname = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(tbls, dbname))
Then I cycle through tbls to import the dataframes. (getdf below is a short function I wrote that reads the path, sheetname etc. for the excel/csv file in which the table(data) sits and imports the data.
for tbl in tbls:
dictdf[tbl] = getdf(tbl, dfRT, sfsession)
The process works except that the dataframes are written into the dictionary, i.e dfABC in the dictionary is replaced with a dataframe of 65K rows and 27 cols and so on.
What I want is dfABC = dataframe of 65krows and 27 cols. i.e in the above code. I tried:
str(dictdf[tbl]) = getdf(tbl, dfRT, sfsession)
but that gave an error. Is there a way to do this? thanks.
solved using exec and flipping the dictionary (the flip isn't needed to solve):
tbls = ['tbl1', 'tbl2', 'tbl3']
dfa = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(dbname, tbls))
for df in dfs:
tbl = dictdf[df]
exec(f'{df} = getdf(\'{tbl}\', dfRT, sfsession)')
please note #Xukrao and #Yo_Chris's comments on keeping the dfs within the dictionary as a superior solution.
I found this question useful to understand how exec worked: What's the difference between eval, exec, and compile?

Pandas, append column based on unique subset of column values

I have a dataframe with many rows. I am appending a column using data produced from a custom function, like this:
import numpy
df['new_column'] = numpy.vectorize(fx)(df['col_a'], df['col_b'])
# takes 180964.377 ms
It is working fine, what I am trying to do is speed it up. There is really only a small group of unique combinations of col_a and col_b. Many of the iterations are redundant. I was thinking maybe pandas would just figure that out on its own but I don't think that is the case. Consider this:
print len(df.index) #prints 127255
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
print len(df_unique.index) #prints 9834
I also convinced myself of the possible speedup by running this:
df_unique['new_column'] = numpy.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
# takes 14611.357 ms
Since there is a lot of redundant data, what I am trying to do is update the large dataframe ( df 127255 rows ) but only need to run the fx function the minimum amount of times ( 9834 times ). This is because of all the duplicate rows for col_a and col_b. Of course this means that there will be multiple rows in df that have the same values for col_a and col_b, but that is OK, the other columns of df are different and make each row unique.
Before I create a normal iterative for loop to loop through the df_unique dataframe and do a conditional update on df, I wanted to ask if there was a more "pythonic" neat way of doing this kind of update. Thanks a lot.
** UPDATE **
I created the simple for loop mentioned above, like this:
df = ...
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
df_unique['new_column'] = np.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
for index, row in df_unique.iterrows():
df.loc[(df['col_a'] == row['col_a']) & (df['col_b'] == row['col_b']),'new_column'] = row['new_column']
# takes 165971.890
So with this for loop there may be a slight performance increase but not nearly what I would have expected.
FYI
This is the fx function. It queries a mysql database.
def fx(d):
exp_date = datetime.strptime(d.col_a, '%m/%d/%Y')
if exp_date.weekday() == 5:
exp_date -= timedelta(days=1)
p = pandas.read_sql("select stat from table where a = '%s' and b_date = '%s';" % (d.col_a,exp_date.strftime('%Y-%m-%d')),engine)
if len(p.index) == 0:
return None
else:
return p.iloc[0].close
UPDATE:
if you can manage to read up your three columns ['stat','a','b_date'] belonging to table table into tab DF then you could merge it like this:
tab = pd.read_sql('select stat,a,b_date from table', engine)
df.merge(tab, left_on=[...], right_on=[...], how='left')
OLD answer:
you can merge/join your precalculated df_unique DF with the original df DF:
df['new_column'] = df.merge(df_unique, on=['col_a','col_b'], how='left')['new_column']
MaxU's answer may be already something you want. But I'll show another approach which may be a bit faster (I didn't measure).
I assume that:
df[['col_a', 'col_b']] is sorted so that all identical entries are in consecutive rows (it's important)
df has a unique index (if not, you may create some temporary unique index).
I'll use the fact that df_unique.index is a subset of df.index.
# (keep='first' is actually default)
df_unique = df[['col_a', 'col_b']].drop_duplicates(keep='first').copy()
# You may try .apply instead of np.vectorize (I think it may be faster):
df_unique['result'] = df_unique.apply(fx, axis=1)
# Main part:
df['result'] = df_unique['result'] # uses 2.
df['result'].fillna(method='ffill', inplace=True) # uses 1.

Categories

Resources