How to avoid duplicated columns after join operation? - python

In Scala it's easy to avoid duplicate columns after join operation:
df1.join(df1, Seq("id"), "left").show()
However, is there a similar solution in PySpark? If I do df1.join(df1, df1["id"] == df2["id"], "left").show() in PySpark, I get two columns id...

You have 3 options :
1. Use outer join
aDF.join(bDF, "id", "outer").show()
2. Use Aliasing: You will lose data related to B Specific Id's in this.
aDF.alias("a").join(bDF.alias("b"), aDF.id == bDF.id, "outer").drop(col("b.id")).show()
3. Use drop to drop the columns
columns_to_drop = ['ida', 'idb']
df = df.drop(*columns_to_drop)
Let me know if that helps.

Related

How to use the content of one dataframe to index another multi-level index dataframe?

I have the following dataframes site_1_df and `site_2_df (both are similar):
site_1_df:
And the following dataframe:
site_1_index_df = pd.DataFrame(site_1_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
site_2_index_df = pd.DataFrame(site_2_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["WeekNumber", "PG"]]
index_intersection:
Consequently, it is clear that site_1_df and site_2_df are multi-level indexed dataframes. Therefore, I woulld like to use the index_intersection to index the above dataframe. Or, If I am indexing from site_1_df, then I want a subset of the rows from the same dataframe. And technically, I should get back a dataframe that has (8556 rows x 6 columns), i.e., the same number of rows of index_intersection. How can I achieve that efficiently in pandas?
I tried:
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["SiteNumber_x", "WeekNumber", "PG"]]
index_intersection = index_intersection.rename(columns={"SiteNumber_x": "SiteNumber"})
index_intersection = index_intersection.set_index(["SiteNumber", "WeekNumber", "PG"])
index_intersection
And I got:
However, indexing the dataframe using another dataframe such as:
site_2_df.loc[index_intersection]
# or
site_2_df.loc[index_intersection.index]
# or
site_2_df.loc[index_intersection.index.values]
will give me an error:
NotImplementedError: Indexing a MultiIndex with a DataFrame key is not implemented
Any help is much appreciated!!
So I figured out that I can find the intersection of 2 dataframes, based on their index through:
sites_common_rows = pd.merge(left=site_1_df.reset_index([0]), right=site_2_df.reset_index([0]),
left_index=True, right_index=True, how="inner")
The reset_index([0]) above is used to ignore the SiteNumber since this totally different from one dataframe to another. Consequently, I am able to find the inner join between two dataframes from their indexes.

Pyspark dataframe joins with few duplicated column names and few without duplicate columns

I need to implement pyspark dataframe joins in my project.
I need to join in 3 different cases.
1)
If both dataframes have same name join columns. I joined as below. It eliminated duplicate columns col1, col2.
cond = ['col1', 'col2']
df1.join(df2, cond, "inner")
2) If both dataframes have different name join columns. I joined as below. It maintains all 4 join columns as expected.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b]
df1.join(df2, cond, "inner")
3) If dataframes have few same name join columns and few different name join columns. I tried as below. But, it is failing.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b,
'col1',
'col2',
'col3']
df1.join(df2, cond, "inner")
I tried as below which worked.
cond = [df1.col_x == df2.col_y,
df1.col_a == df2.col_b,
df1.col1 == df2.col1,
df1.col2 == df2.col2,
df1.col3 == df2.col3]
df1.join(df2, cond, "inner")
But col1, col2, col3 have duplicate columns. I want to eliminate these duplicate columns while joining itself instead of drop the columns later.
Please suggest how #3 can be achieved or suggest alternative approaches.
There is no way to do it: behind the scenes an equi-join (colA == colB) where the condition is given as a (sequence of) string(s) (which is called a natural join) is executed as if it were a regular equi-join (source) so
frame1.join(frame2,
"shared_column",
"inner")
gets translated to
frame1.join(frame2,
frame1.shared_column == frame2.shared_column,
"inner")
Afterwards, the duplicate gets dropped (projection).
If you have a condition that uses both keys that could be a natural join as well as a regular equi-join, then either drop the duplicate columns afterwards, or rename the ones that are not equally named before the join.

Compare Dataframes of different size and create a new one if condition is met

I need help to figure out the following problem:
I have two(2) dataframes of different sizes. I need to compare the values and, if the condition is met, replace the values in Dataframe 1.
If the values for a Material and Char in Dataframe 1 are = "Y", I need to get the "Required or Optional" value from Dataframe 2. If it's Required, then I replace the "Y" with "Y_REQD" otherwise if it's Optional then replace "Y" with "Y_OPT".
I've been using For loops but now the code is getting too complicated which hints me this may not be the best way.
Thanks in advance.
This is more like a pivot problem , then we can reindex the dataframe then sum
df1=df1.replace({'Y':'Y_'})+df2.pivot(*df2.columns).reindex_like(df1).fillna('')
Mostly agree with #WeNYoBen's answer. But to make it completely right, dataframe2 need to be revised using df.replace.
Short version:
df1=df1.replace({'Y':'Y_'})+df2.replace({'Rqd': 'REQD', 'Opt': 'OPT'}).pivot(*df2.columns).reindex_like(df1).fillna('')
Long version:
# break short into steps
# 1. replace
df2 = df2.replace({'Rqd': 'REQD', 'Opt': 'OPT'})
# 2. pivot
df2 = df2.pivot(*df2.columns)
# 3. reindex
df2 = df2.reindex_like(df1)
# 4. fillna(cleanup df with string form)
df2 = df2.fillna('')
# 5. map on df1 and add up with df2
df1=df1.replace({'Y':'Y_'})+df2
Hope it helps.

How to merge two pandas dataframes on column of sets

I have columns in two dataframes representing interacting partners in a biological system, so if gene_A interacts with gene_B, the entry in column 'gene_pair' would be {gene_A, gene_B}. I want to do an inner join, but trying:
pd.merge(df1, df2, how='inner', on=['gene_pair'])
throws the error
TypeError: type object argument after * must be a sequence, not itertools.imap
I need to merge on the unordered pair, so as far as I can tell I can't merge on a combination of two individual columns with gene names. Is there another way to achieve this merge?
Some example dfs:
gene_pairs1 = [
set(['gene_A','gene_B']),
set(['gene_A','gene_C']),
set(['gene_D','gene_A'])
]
df1 = pd.DataFrame({'r_name': ['r1','r2','r3'], 'gene_pair': gene_pairs1})
gene_pairs2 = [
set(['gene_A','gene_B']),
set(['gene_F','gene_A']),
set(['gene_C','gene_A'])
]
df2 = pd.DataFrame({'function': ['f1','f2','f3'], 'gene_pair': gene_pairs2})
pd.merge(df1,df2,how='inner',on=['gene_pair'])
and I would like entry 'r1' line up with 'f1' and 'r2' to line up with 'f3'.
Pretty simple in the end: I used frozenset, rather than set.
I suggest u get an extra Id column for each pair and then join on that!
for eg.
df2['gp'] = df2.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
df1['gp'] = df1.gene_pair.apply(lambda x: list(x)[0][-1]+list(x)[1][-1])
pd.merge(df1, df2[['function','gp']],how='inner',on=['gp']).drop('gp', axis=1)

pandas iterrows() two data frame

I'm doing something that I know that I shouldn't be doing. I'm doing a for loop within a for loop (it sounds even more horrible, as I write it down.) Basically, what I want to do, theoretically, using two dataframes is something like this:
for index, row in df_2.iterrows():
for index_1, row_1 in df_1.iterrows():
if row['column_1'] == row_1['column_1'] and row['column_2'] == row_1['column_2'] and row['column_3'] == row_1['column_2']:
row['column_4'] = row_1['column_4']
There has got to be a (better) way to do something like this. Please help!
As pointed out by #Andy Hayden in is it possible to do fuzzy match merge with python pandas?, you can use difflib : get_closest_matches function to create new join columns.
import difflib
df_2['fuzzy_column_1'] = df_2['column_1'].apply(lambda x: difflib.get_close_matches(x, df_1['column_1'])[0])
# Do same for all other columns
Now you can apply inner join using pandas merge function.
result_df = df_1.merge(df_2,left_on=['column_1', 'column_2','column_3'], and right_on=['fuzzy_column_1','fuzzy_column_2','fuzzy_column_3] )
You can use drop function to remove unwanted columns.

Categories

Resources