How to factorize entire DataFrame in pyspark

How to factorize entire DataFrame in pyspark - python

I have a Pyspark DataFrame and I want to factorize the entire df instead of each column to avoid the case that 2 different values in 2 columns have the same factorized value. I could do it with pandas as following:
_, b = pd.factorize(df.values.T.reshape(-1, ))
df = df.apply(lambda x: pd.Categorical(x, b).codes)
df = df.replace(-1, np.NaN)
Does anyone know how to do the same in Pyspark? Thank you very much.

Related

Pandas Apply returns a Series instead of a dataframe

The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?

When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64

Group dataframe and aggregate data from several columns into a new column

I want to group this dataframe by column a, and create a new column (d) with all values from both column b and column c.
data_dict = {'a': list('aabbcc'),
'b': list('123456'),
'c': list('xxxyyy')}
df = pd.DataFrame(data_dict)
From this...
to this
I've figured out one way of doing it,
df['d'] = df['b'] + df['c']
df.groupby('a').agg({'d': lambda x: ','.join(x)})
but is there a more pandas way?

I think "more pandas" is hard to define, but you are able to groupby agg directly on the series if you're trying to avoid the temp column:
g = (df['b'] + df['c']).groupby(df['a']).agg(','.join).to_frame('d')
g:
d
a
a 1x,2x
b 3x,4y
c 5y,6y

DataFrame 'groupby' is fixing group columns with index

I have used a simple 'groupby' to condense rows in a Pandas dataframe:
df = df.groupby(['col1', 'col2', 'col3']).sum()
In the new DataFrame 'df', the three columns that were used in the 'groupby' function are now fixed within the index and are no longer column indexes 0, 1 and 2 - what was previously column index 4 is now column index 0.
How do I stop this from happening / reinclude the three 'groupby' columns along with the original data?

Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()

Try resetting the index
df = df.reset_index()

Create label for two column in pandas

I have a pandas dataframe with two column of data. Now i want to make a label for two column, like the picture bellow:
Because two column donot have the same value so cant use groupby. I just only want add the label AAA like that. So, how to do it? Thank you

reassign to the columns attribute with an newly constructed pd.MultiIndex
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
Consider the dataframe df
df = pd.DataFrame(1, ['hostname', 'tmserver'], ['value', 'time'])
print(df)
value time
hostname 1 1
tmserver 1 1
Then
df.columns = pd.MultiIndex.from_product([['AAA'], df.columns.tolist()])
print(df)
AAA
value time
hostname 1 1
tmserver 1 1

If need create MultiIndex in columns, simpliest is:
df.columns = [['AAA'] * len(df.columns), df.columns]
It is similar as MultiIndex.from_arrays, also is possible add names parameter:
n = ['a','b']
df.columns = pd.MultiIndex.from_arrays([['AAA'] * len(df.columns), df.columns], names=n)

Converting rows in pandas dataframe to columns

I want to convert rows in the foll. pandas dataframe to column headers:
transition area
0 A_to_B -9.339710e+10
1 B_to_C 2.135599e+02
result:
A_to_B B_to_C
0 -9.339710e+10 2.135599e+02
I tried using pivot table, but that does not seem to give the result I want.

I think you can first set_index with column transition, then transpose by T, remove columns name by rename_axis and last reset_index:
print df.set_index('transition').T.rename_axis(None, axis=1).reset_index(drop=True)
A_to_B B_to_C
0 -9.339710e+10 213.5599

df = df.T
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to factorize entire DataFrame in pyspark - python

Related

Pandas Apply returns a Series instead of a dataframe

Group dataframe and aggregate data from several columns into a new column

DataFrame 'groupby' is fixing group columns with index

Create label for two column in pandas

Converting rows in pandas dataframe to columns

Categories

Resources