Pandas apply to create multiple columns, using multiple columns as input - python

I am trying to use a function to create multiple outputs, using multiple columns as inputs. Here's my attempt:
df = pd.DataFrame(np.random.randint(0,10,size=(6, 4)), columns=list('ABCD'))
df.head()
A B C D
0 8 2 5 0
1 9 9 8 6
2 4 0 1 7
3 8 4 0 3
4 5 6 9 9
def some_func(a, b, c):
return a+b, a+b+c
df['dd'], df['ee'] = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
df.head()
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1
The outputs are all 0 for the first new column, and all 1 for the next new column. I am interested in the correct solution, but I am also curious about why my code resulted this way.

You can assign to subset ['dd','ee']:
def some_func(a, b, c):
return a+b, a+b+c
df[['dd','ee']] = df.apply(lambda x: some_func(a = x['A'],
b = x['B'],
c = x['C']), axis=1, result_type='expand')
print (df)
A B C D dd ee
0 4 7 7 3 11 18
1 2 1 3 4 3 6
2 4 7 6 0 11 17
3 0 9 1 1 9 10
4 5 6 5 9 11 16
5 3 2 4 9 5 9
If possible, better/ fatser is use vectorized solution:
df = df.assign(dd = df.A + df.B, ee = df.A + df.B + df.C)

Just to explain the 0, 1 part. 0 and 1 are actually the column names of
df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
That is
x = df.apply(lambda x: some_func(a = x['A'], b = x['B'], c = x['C']), axis=1, result_type='expand')
a, b = x
print(a) # first column name
print(b) # second column name
output:
0
1
Finally, you assign
df['dd'], df['ee'] = 0, 1
results in
A B C D dd ee
0 8 2 5 0 0 1
1 9 9 8 6 0 1
2 4 0 1 7 0 1
3 8 4 0 3 0 1
4 5 6 9 9 0 1

Alternative way:
df['dd'], df['ee'] = zip(*df.apply(lambda x: some_func(x['A'], x['B'], x['C]) )

Related

split columns wrt column names using pandas dataframe

I have a dataframe df:
(A,B) (B,C) (D,B) (E,F)
0 3 0 1
1 1 3 0
2 2 4 2
I want to split it into different columns for all columns in df as shown below:
A B B C D B E F
0 0 3 3 0 0 1 1
1 1 1 1 3 3 0 0
2 2 2 2 4 4 2 2
and add similar columns together:
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 6 2 4 2 2
how to achieve this using pandas?
With pandas, you can use this :
out = (
df
.T
.reset_index()
.assign(col= lambda x: x.pop("index").str.strip("()").str.split(","))
.explode("col")
.groupby("col", as_index=False).sum()
.set_index("col")
.T
.rename_axis(None, axis=1)
)
# Output :
print(out)
​
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
i think (A, B) as tuple
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).sum(level=0).T
result:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
if future warning occur, use following code:
pd.concat([pd.DataFrame([df[i].tolist()] * len(i), index=list(i)) for i in df.columns]).groupby(level=0).sum().T
same result
Use concat with removed levels with MultiIndex in columns by Series.str.findall:
df.columns = df.columns.str.findall('(\w+)').map(tuple)
df = (pd.concat([df.droplevel(x, axis=1) for x in range(df.columns.nlevels)], axis=1)
.groupby(level=0, axis=1)
.sum())
print (df)
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
For write ouput to file without index use:
df.to_csv('file.csv', index=False)
You can use findall to extract the variables in the header, then melt and explode, finallypivot_table:
out = (df
.reset_index().melt('index')
.assign(variable=lambda d: d['variable'].str.findall('(\w+)'))
.explode('variable')
.pivot_table(index='index', columns='variable', values='value', aggfunc='sum')
.rename_axis(index=None, columns=None)
)
Output:
A B C D E F
0 0 3 3 0 1 1
1 1 5 1 3 0 0
2 2 8 2 4 2 2
Reproducible input:
df = pd.DataFrame({'(A,B)': [0, 1, 2],
'(B,C)': [3, 1, 2],
'(D,B)': [0, 3, 4],
'(E,F)': [1, 0, 2]})
printing/saving without index:
print(out.to_string(index=False))
A B C D E F
0 3 3 0 1 1
1 5 1 3 0 0
2 8 2 4 2 2
# as file
out.to_csv('yourfile.csv', index=False)

Pandas stack() if columns have a specific value

I am trying to stack this table based on ID column but only considering columns [A-D] where the value is 1 and not 0.
Current df:
ID
A
B
C
D
1
1
0
0
1
3
0
1
0
1
7
1
0
1
1
8
1
0
0
0
What I want:
ID
LETTER
1
A
1
D
3
B
3
D
7
A
7
C
7
D
8
A
The following code works but I need a more efficient solution as I have a df with 93434 rows x 12377 columns.
stacked_df = df.set_index('ID').stack().reset_index(name='has_letter').rename(columns={'level_1':'LETTER'})
stacked_df = stacked_df[stacked_df['has_letter']==1].reset_index(drop=True)
stacked_df.drop(['has_letter'], axis=1, inplace=True)
Try:
print(
df.set_index("ID")
.apply(lambda x: x.index[x == 1], axis=1)
.reset_index()
.explode(0)
.rename(columns={0: "LETTERS"})
)
Prints:
ID LETTERS
0 1 A
0 1 D
1 3 B
1 3 D
2 7 A
2 7 C
2 7 D
3 8 A
Or:
x = df.set_index("ID").stack()
print(
x[x == 1]
.reset_index()
.drop(columns=0)
.rename(columns={"level_1": "LETTER"})
)
Prints:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
You can mask the non-1 values and stack to remove the NaNs:
df2 = df.rename_axis(columns='LETTERS').set_index('ID')
stacked_df = (df2.where(df2.eq(1)).stack()
.reset_index().iloc[:,:2]
)
Output:
ID LETTERS
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A
Try this:
(df.set_index('ID').dot(df.columns[1:]) # use inner product of column names and values
.apply(list) # separate each letter
.explode() # explode each list
.reset_index(name='LETTER') # reset index for df
)
Output:
ID LETTER
0 1 A
1 1 D
2 3 B
3 3 D
4 7 A
5 7 C
6 7 D
7 8 A

How two merge two dataframes without any index being based

Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how?
Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}
Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11

add new column to pandas DataFrame with value depended on previous row

I have an existing pandas DataFrame, and I want to add a new column, where the value of each row will depend on the previous row.
for example:
df1 = pd.DataFrame(np.random.randint(10, size=(4, 4)), columns=['a', 'b', 'c', 'd'])
df1
Out[31]:
a b c d
0 9 3 3 0
1 3 9 5 1
2 1 7 5 6
3 8 0 1 7
and now I want to create column e, where for each row i the value of df1['e'][i] would be: df1['e'][i] = df1['d'][i] - df1['d'][i-1]
desired output:
df1:
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1
how can I achieve this?
You can use sub with shift:
df['e'] = df.d.sub(df.d.shift(), fill_value=0)
print (df)
a b c d e
0 9 3 3 0 0.0
1 3 9 5 1 1.0
2 1 7 5 6 5.0
3 8 0 1 7 1.0
If need convert to int:
df['e'] = df.d.sub(df.d.shift(), fill_value=0).astype(int)
print (df)
a b c d e
0 9 3 3 0 0
1 3 9 5 1 1
2 1 7 5 6 5
3 8 0 1 7 1

Pandas - how to replace specific values in a Series?

I have a dataframe with a column called product_type such as:
df1.product_type.unique()
>> ["prod_1", "prod_2", "prod_3"]
df.prod_cost.dtype
>> dtype('O')
I am looking for the most efficient way to replace that by numerical values [1, 2, 3].
Thanks
Use factorize to encode a new column:
In [2]:
df = pd.DataFrame({'a':list('abcdbcbccc')})
df
Out[2]:
a
0 a
1 b
2 c
3 d
4 b
5 c
6 b
7 c
8 c
9 c
In [5]:
df['code'] = df['a'].factorize()[0] + 1
df
Out[5]:
a code
0 a 1
1 b 2
2 c 3
3 d 4
4 b 2
5 c 3
6 b 2
7 c 3
8 c 3
9 c 3
so in your case:
df1['product_type'] = df1['product_type'].factorize()[0] + 1
should work
Cast the column as a category, and then get the codes.
df1 = pd.DataFrame({'product_type': ['prod_1'] * 3 + ['prod_2'] * 3 + ['prod_3'] * 3})
df1['product_type_code'] = df1.product_type.astype('category').cat.codes
>>> df1
product_type product_type_code
0 prod_1 0
1 prod_1 0
2 prod_1 0
3 prod_2 1
4 prod_2 1
5 prod_2 1
6 prod_3 2
7 prod_3 2
8 prod_3 2

Categories

Resources