Compare columns from two different dataframes pandas - python

I am querying AD for a list of machines. I filter this list with pandas by last log on date. When I am done with this data I have one column in
a dataframe.
I have another report that has a list of machines that a product we use is installed. I clean this data and I am left with the devices that I want to use to compare to the AD data. Which is just one column in a dataframe.
I have also tried comparing list to list. I am not sure on the best the method.
I tried the merge but my guess this compares DF1 row 1 to DF2 row 1.
DF1 = comp1,comp2,comp3,comp5
DF2 = comp1,comp2,comp3
How would I check each row in DF1 to make sure that each value in DF2 exist and return true or false?
I am trying to figure out machines in DF1 that don't exist in DF2.

DataFrame.isin
this is a simple check to see if one value is in another, you do this in a multitude of ways, this is probably one of the simpliest.
I'm providing some dummy data but please check out How to make good reproducible pandas examples
machines = ['A','B','C']
machines_to_check = ['A','B']
df = pd.DataFrame({'AD' : machines})
df2 = pd.DataFrame({'AD' : machines_to_check})
now, if we want to check for the machines that exist in df but not in df2 we can use ~ which inverts the .isin function.
non_matching_machines = df.loc[~df['AD'].isin(df2['AD'])]
print(non_matching_machines)
AD
2 C

Related

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

What is a reason why two PySpark dataframes wouldn't be equal if they have the same schema and data?

Suppose we have two PySpark dataframes df1 and df2. Also suppose that they have the same number of rows (5 rows). If df1.schema = df2.schema and df1.take(5) = df2.take(5), why wouldn't df1 = df2?
Data handled by Spark are distributed randomly across worker nodes (or executors), they're also unordered and not predictable. Therefore it makes no sense to compare df1 == df2. If you truly want to compare them both, and as long as they have the same schema, you can do df1.subtract(df2).count() == 0 to see if they have exact same data.

Using select() on all existing columns AND using a list to generate new lit columns

I have an existing df with n columns and I need to also add thousands of lit columns for a later application. Because there is so many columns to be added, I cannot use a for loop of withColumn().
Is it possible to combine the following two .select() functions?
df1 = df.select([lit(f"{i}").alias(f"{i}") for i in test_list])
df2 = df.select("*")
This didn't seem to work as well as a few other variations
df1 = df.select("*", [lit(f"{i}").alias(f"{i}") for i in test_list])
we can use df.columns which provides list of existing columns and append with lit columns.
df1 = df.select(df.columns+[lit(f"{i}").alias(f"{i}") for i in test_list])

How should I filter one dataframe by entries from another one in pandas with isin?

I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]

How to match two columns from different Dataframes and with different length?

I have already Generated df1 and df2.
df1
df2
Both Dataframes have a common column, df1[TB_DIV] and df2[DIV].
I want to generate a new df3 that contains all the info in df1 filtered by all the df2[DIV] which are NOT IN df1.
I tried to use the .isin function to filter df1 with the df2 info, but wasn't able to get the expected values.
m = DIV_LIST.DIV.isin(DIV_TABLE.TB_DIV)
DIV_LIST1 = DIV_LIST[m]
I obtained a empty df3 and in some cases errors due to a length mismatch.
Try going about it like this:
df1.loc[df1['TB_DIV'].isin(df2['DIV'])]
To get those that are not in, use:
df1.loc[~df1['TB_DIV'].isin(df2['DIV'])]

Categories

Resources