Group dataframe and aggregate data from several columns into a new column - python

I want to group this dataframe by column a, and create a new column (d) with all values from both column b and column c.
data_dict = {'a': list('aabbcc'),
'b': list('123456'),
'c': list('xxxyyy')}
df = pd.DataFrame(data_dict)
From this...
to this
I've figured out one way of doing it,
df['d'] = df['b'] + df['c']
df.groupby('a').agg({'d': lambda x: ','.join(x)})
but is there a more pandas way?

I think "more pandas" is hard to define, but you are able to groupby agg directly on the series if you're trying to avoid the temp column:
g = (df['b'] + df['c']).groupby(df['a']).agg(','.join).to_frame('d')
g:
d
a
a 1x,2x
b 3x,4y
c 5y,6y

Related

How to factorize entire DataFrame in pyspark

I have a Pyspark DataFrame and I want to factorize the entire df instead of each column to avoid the case that 2 different values in 2 columns have the same factorized value. I could do it with pandas as following:
_, b = pd.factorize(df.values.T.reshape(-1, ))
df = df.apply(lambda x: pd.Categorical(x, b).codes)
df = df.replace(-1, np.NaN)
Does anyone know how to do the same in Pyspark? Thank you very much.

Create new column with Pandas .apply() even if DataFrame is empty

I want to use Pandas apply to create a new column, and I want this functionality to be fail-save even if that DataFrame is empty. Here is a minimal example that works as expected:
df = pd.DataFrame(np.array([[1,2],[3,4]]), columns=['a','b']) # two columns
add = lambda x: x['a'] + x['b'] # add column a and b # add two values
df['c'] = df.apply( add, axis=1 ) # creates new column c, as anticipated
However, it gets problematic when df happens to be empty. Consider the following example where now the DataFrame is empty, but otherwise equal:
df = pd.DataFrame( columns=['a','b']) # two columns, but no values
df['c'] = df.apply( add, axis=1 ) # raises an error!
How can I execute this last column safely, such that it just appends a column 'c' to the DataFrame, even if df is empty?
Interestingly enough, this works
df.apply( add, axis=1 )
but cannot be appended as column 'c'.
If you want to create a new column c based on the sum of columns a and b, then you can just do the following:
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated :)
That way, you don't need to assign a lambda expression to a function add, (it is not recommended to assign lambda expressions a function).
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['a', 'b']) # two columns
print(df)
a b
0 1 2
1 3 4
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
a b c
0 1 2 3
1 3 4 7
df = pd.DataFrame(columns=['a', 'b']) # two columns, but no values
df['c'] = df['a'] + df['b'] # creates new column c, as anticipated
print(df)
Empty DataFrame
Columns: [a, b, c]
Index: []
Even if the dataframe is empty, above method will work.
If one axis (rows or columns) is empty then the apply function returns an empty result.
Your defined lambda function returns a pandas.Series. For handling empty pandas.DataFrame it is necessary to be more explicit on the result type of the apply method and use the reduce mode.
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
This will work:
df = pd.DataFrame(columns=['a','b'])
df['c'] = df.apply(add, axis=1, result_type='reduce')

pandas: dataframes row-wise comparison

I have two data frames that I would like to compare for equality in a row-wise manner. I am interested in computing the number of rows that have the same values for non-joined attributes.
For example,
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
I will be joining these two data frames on column a and b. There are two rows (first two) that have the same values for c and d in both the data frames.
I am currently using the following approach where I first join these two data frames, and then compute each row's values for equality.
df = df1.merge(df2, on=['a','b'])
cols1 = [c for c in df.columns.tolist() if c.endswith("_x")]
cols2 = [c for c in df.columns.tolist() if c.endswith("_y")]
num_rows_equal = 0
for index, row in df.iterrows():
not_equal = False
for col1,col2 in zip(cols1,cols2):
if row[col1] != row[col2]:
not_equal = True
break
if not not_equal: # row values are equal
num_rows_equal += 1
num_rows_equal
Is there a more efficient (pythonic) way to achieve the same result?
A shorter way of achieving that:
import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,10,30]})
df2 = pd.DataFrame({'a': [1,2,3,5], 'b': [2,3,4,6], 'c':[60,20,40,30], 'd':[50,90,40,40]})
df = df1.merge(df2, on=['a','b'])
comparison_cols = [c.strip('_x') for c in df.columns.tolist() if c.endswith("_x")]
num_rows_equal = (df1[comparison_cols][df1[comparison_cols] == df2[comparison_cols]].isna().sum(axis=1) == 0).sum()
use pandas merge ordered, merging with 'inner'. From there, you can get your dataframe shape and by extension your number of rows.
df_r = pd.merge_ordered(df1,df2,how='inner')
a b c d
0 1 2 60 50
1 2 3 20 90
no_of_rows = df_r.shape[0]
#print(no_of_rows)
#2

Pandas create a custom groupby aggregation for column

Is there a way in Pandas to create a new column that is a function of two column's aggregation, so that for any arbitrary grouping it preserves the function? This would be functionally similar to creating a calculated column in excel and pivoting by labels.
df1 = pd.DataFrame({'lab':['lab1','lab2']*5,'A':[1,2]*5,'B':[4,5]*5})
df1['C'] = df1.apply(lambda x: x['A']/x['B'],axis=1)
pd.pivot_table(df1,index='lab',{'A':sum,'B':sum,'C':lambda x: x['A']/x['B']})
should return:
|lab|A B|C|
|----|---|---|
|lab1|5 |20|.25|
|lab2|10|25 |.4|
i'd like to aggregate by 'lab' (or any combination of labels) and have the dataframe return the aggregation without having to re-define the column calculation. I realize this is trivial to manually code, but it's repetitive when you have many columns.
There are two ways you can do this using apply or agg:
import numpy as np
import pandas as pd
# Method 1
df1.groupby('lab').apply(lambda df: pd.Series({'A': df['A'].sum(), 'B': df['B'].sum(), 'C': df['C'].unique()[0]})).reset_index()
# Method 2
df1.groupby('lab').agg({'A': 'sum',
'B': 'sum',
'C': lambda x: np.unique(x)}).reset_index()
# output
lab A B C
0 lab1 5 20 0.25
1 lab2 10 25 0.40

Pandas: get string from specific column header

Context:
i am new in Pandas and I need a function that creates new columns based on existing columns. The new columns name will have the name from the original column plus new characters (example: create a "As NEW" column from "As" column).
Can i access the old column header string to make the name of the new column?
Problem:
i have df['columnA'] and need to get "columnA" string
If I understand you correctly, this may be what you're looking for.
You can use str.contains() for the columns, then use string formatting to create the new column name.
df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'As': ['B','B','C','C'], 'col2': ['C','C','A','A'], 'col3': [30,10,14,91]})
col = df.columns[df.columns.str.contains('As')]
df['%s New' % col[0]] = 'foo'
print (df)
As col1 col2 col3 As New
0 B A C 30 foo
1 B A C 10 foo
2 C B A 14 foo
3 C B A 91 foo
Assuming that you have an empty DataFrame df with columns, you could access the columns of df as a list with:
>>> df.columns
Index(['columnA', 'columnB'], dtype='object')
.columns will allow you to overwrite the columns of df, but you don't need to pass in another Index. You can pass it a regular list, like so:
>>> df.columns = ['columna', 'columnb']
>>> df
Empty DataFrame
Columns: [columna, columnb]
Index: []
This can be done through the columns attribute.
cols = df.columns
# Do whatever operation you want on the list of strings in cols
df.columns = cols

Categories

Resources