Pyspark replace values on array column based on another dataframe

Pyspark replace values on array column based on another dataframe - python

I have two dataframes, one simply with some unique ids with associated names like so:
Id name
0 name_a
1 name_b
2 name_c
Second dataframe contains the ids from the first dataframe stored in an array, in each row:
Row_1 row_2
0 [0,2]
1 [1,0]
My question is it possible to replace the arrays from the second dataframe so it checks the names from the first df based on the ids, so:
Row_1 row_2
0 [name_a, name_c]
1 [name_b, name_a]
It seems too time consuming to create a map of the first df and just add it to the second df with an udf. Any help is much appreciated on how to approach this.

Join using array_contains function + groupby and collect_list:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(0, "name_a"), (1, "name_b"), (2, "name_c")], ["Id", "name"])
df2 = spark.createDataFrame([(0, [0, 2]), (1, [1, 0])], ["Row_1", "Row_2"])
result = df2.join(
df1, on=F.array_contains("Row_2", F.col("Id")), how="left"
).groupBy("Row_1").agg(
F.collect_list("name").alias("Row_2")
)
result.show()
#+-----+----------------+
#|Row_1| Row_2|
#+-----+----------------+
#| 0|[name_a, name_c]|
#| 1|[name_a, name_b]|
#+-----+----------------+

You can try using explode function to convert array into rows, then join the data with the initial data frame, in the last step do a group by & .agg(collect_list())
from pyspark.sql.functions import explode
df3 = df2.select(df2.row_1,explode(df2.row_2))
df4 = df3.join(df1,df3.row_1==df1.Id).select(df3.row_1,df1.name)
df5 = df4.groupBy('row_1').agg(collect_list('name').alias('name'))
Reference links:
https://sparkbyexamples.com/pyspark/pyspark-explode-array-and-map-columns-to-rows/#:~:text=explode%20%E2%80%93%20PySpark%20explode%20array%20or,it%20contains%20all%20array%20elements.
https://www.owenrumney.co.uk/pyspark-opposite-of-explode/

Related

How to factorize entire DataFrame in pyspark

I have a Pyspark DataFrame and I want to factorize the entire df instead of each column to avoid the case that 2 different values in 2 columns have the same factorized value. I could do it with pandas as following:
_, b = pd.factorize(df.values.T.reshape(-1, ))
df = df.apply(lambda x: pd.Categorical(x, b).codes)
df = df.replace(-1, np.NaN)
Does anyone know how to do the same in Pyspark? Thank you very much.

How to concatenate two dataframes with different indices along column axis

I want to merge 2 dataframes and first is dm.shape = (21184, 34), second is po.shape = (21184, 6). I want to merge them then it will be 40 columns. I write as this
dm = dm.merge(po, left_index=True, right_index=True)
then it is dm.shape = (4554, 40) my rows decreased.
P.s po is the PolynomialFeatures of numerical data of dm.

Problem is different index values, so convert them to default RangeIndex in both DataFrames:
df = dm.reset_index(drop=True).merge(po.reset_index(drop=True),
left_index=True,
right_index=True)
Solution with concat - by default outer join, but if same index values in both working same:
df = pd.concat([dm.reset_index(drop=True), po.reset_index(drop=True)], axis=1)

Or use:
dm = pd.DataFrame([dm.values.flatten().tolist(), po.values.flatten().tolist()]).rename(index=dict(zip(range(2),[*po.columns.tolist(), *dm.columns.tolist()]))).T

You can use the method join and set the parameter on to the index of the joined dataframe:
df1 = pd.DataFrame({'col1': [1, 2]}, index=[1,2])
df2 = pd.DataFrame({'col2': [3, 4]}, index=[3,4])
df1.join(df2, on=df2.index)
Output:
col1 col2
1 1 3
2 2 4
The joined dataframe must not contain duplicated indices.

How to select dataframe row using regular expression on the index column?

I'm a pandas newbie.
Here's the problem with an example
df = pd.DataFrame(data={'id':['john','joe','zack']})
I know that I can select rows where the "id" column contains "jo" like so
mask = df['id'].str.contains('jo')
df[mask]
But suppose that id column is indexed
df = df.set_index('id')
Now how do I select the rows where the index column contains "jo"?

You need to change id to index:
df = pd.DataFrame(data={'id':['john','joe','zack'],
'col':[1,2,3]})
df = df.set_index('id')
df1 = df[df.index.str.contains('jo')]
print (df1)
col
id
john 1
joe 2

Check if entry to panda dataframe is unique when index might be the same

I have a panda DataFrame that I want to add rows to. The Dataframe looks like this:
col1 col2
a 1 5
b 2 6
c 3 7
I want to add rows to the dataframe, but only if they are unique. The problem is that some new rows might have the same index, but different values in the columns. If this is the case, I somehow need to know.
Some example rows to be added and the desired result:
row 1:
col1 col2
a 1 5
desired row 1 result: Not added - it is already in the dataframe
row 2:
col1 col2
a 9 9
desired row 2 result: something like,
print('non-unique entries for index a')
row 3:
col1 col2
d 4 4
desired row 3 result: just add the row to the dataframe.

try this:
# existing dataframe == df
# new rows == df_newrows
# dividing newrows dataframe into two, one for repeated indexes, one without.
df_newrows_usable = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))==False]
df_newrows_discarded = df_newrows.loc[df_newrows.index.isin(list(df.index.get_values()))]
print ('repeated indexes:', df_newrows_discarded)
# concat df and newrows without repeated indexes
new_df = pd.concat([df,df_newrows],0)
print ('new dataframe:', new_df)

the easy option would be to merge all rows and then keep the unique ones via the dataframe method drop_duplicates
However, this option doesn't report a warning / error when a duplicate row is appended.
drop_duplicates doesn't consider indexes, so the dataframe must be reset before dropping the duplicates, and set back after:
import pandas as pd
# set up data frame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2':[5, 6, 7]}, index=['a', 'b', 'c'])
# set up row to be appended
row = pd.DataFrame({'col1':[3], 'col2': [7]}, index=['c'])
# append row (don't care if it's duplicate)
df = df.append([row])
# drop duplicatesdf2 = df2.reset_index()
df2 = df2.drop_duplicates()
df2 = df2.set_index('index')
if the warning message is an absolute requirement, we can write a function to that effect that checks if a row is duplicate via a merge operation and appends the row only if it is unique.
def append_unique(df, row):
d = df.reset_index()
r = row.reset_index()
if d.merge(r, on=list(d.columns), how='inner').empty:
d2 = d.append(r)
d2 = d2.set_index('index')
return d2
print('non-unique entries for index a')
return df
df2 = append_unique(df2, row)

Create label for two column in pandas

I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you

reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1

If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark replace values on array column based on another dataframe - python

Related

How to factorize entire DataFrame in pyspark

How to concatenate two dataframes with different indices along column axis

How to select dataframe row using regular expression on the index column?

Check if entry to panda dataframe is unique when index might be the same

Create label for two column in pandas

Categories

Resources