Map function across multi-column DataFrame

Map function across multi-column DataFrame - python

Given a DataFrame like the following
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [4, 3, 2, 1]})
I would like to map a row-wise function across its columns
In [3]: df.map(lambda (x, y): x + y)
and get something like the following
0 5
1 5
2 5
3 5
Name: None, dtype: int64
Is this possible?

You can apply a function row-wise by setting axis=1
df.apply(lambda row: row.x + row.y, axis=1)
Out[145]:
0 5
1 5
2 5
3 5
dtype: int64

Related

Extract rows with repeats, from dataframe where column value matches value from an array

I have a pandas.Dataframe df with one of the column headers being 'X'. Let's say this is of size (N,M). N=3,M=2 in this example:
X Y
0 1 a
1 2 b
2 3 c
I have a 1D numpy.array arr of size (Q,), that contains values, some of which are repeats. Q=5 in this example:
array([1, 2, 3, 2, 2])
I would like to create a new pandas.Dataframe df_op that contains rows from df, where each row.X matches an entry from arr. This means some rows are extracted more than once, and the resultant df_op has size (Q,M). If possible, I would like to keep the same order of entries as in arr as well.
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b
Using the usual boolean indexing does not work, because that only picks up unique rows. I would also like to avoid loops if possible, because Q is large.
How can I get df_op? Thank you.

Use indexing to get multiple times the same row:
x = [1, 2, 3, 2, 2]
df = pd.DataFrame({'X': [1, 2, 3], 'Y': ['a', 'b', 'c']})
out = df.set_index('X').loc[x].reset_index()
Output:
>>> out
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b

pandas.apply expand column ValueError: If using all scalar values, you must pass an index

I want to apply a function to a DataFrame that returns several columns for each column in the original dataset. The apply function returns a DataFrame with columns and indexes but it still raises the error ValueError: If using all scalar values, you must pass an index.
I've tried to set the name of the output dataframe, to set the columns as a multiindex and set the index as a multiindex but it doesn't work.
Example: I have this input dataframe
df_all_users = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=["user_1", "user_2", "user_3"])
user_1 user_2 user_3
2020-01-01 1 2 3
2020-01-02 1 2 3
2020-01-03 1 2 3
The apply_function is like this:
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
# these columns are in reality computed used some other functions
df_out["column_1"] = df.values # example: pyod.ocsvm.OCSVM.fit_predict(df.values)
df_out["column_2"] = - df.values # example: pyod.knn.KNN.fit_predict(df.values)
# these are the things I've tried without working
df_out.name = df.name
df_out.columns = pd.MultiIndex.from_tuples([(df.name, column) for column in df_out.columns],
names=["user", "score"])
df_out.index = pd.MultiIndex.from_tuples([(df.name, idx) for idx in df_out.index],
names=["user", "date"])
print(df_out)
return df_out
df_all_users.apply(apply_function, axis=0, result_type="expand")
Which raises the error:
ValueError: If using all scalar values, you must pass an index
The output that I expect would be like this:
out_df = pd.DataFrame(
[[1, 1, 2, 2, 3, 3],
[1, 1, 2, 2, 3, 3],
[1, 1, 2, 2, 3, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=pd.MultiIndex.from_tuples([(user, column)
for user in ["user_1", "user_2", "user_3"]
for column in ["column_1", "column_2"]],
names=("user", "score"))
)
user_1 user_2 user_3
column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3

Do this:
import numpy as np
df_all_users[np.repeat(df_all_users.columns.values,2)]

Ok, the answer was to transform output to a series of arrays, then concatenate the results:
import pandas as pd
df_all_users = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=["user_1", "user_2", "user_3"])
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
df_out["column_1"] = df.values
df_out["column_2"] = df.values
df_out = pd.Series([values for values in df_out.values], index=df.index)
df_out.name = df.name
return df_out
df_out = df_all_users.groupby(level=0, axis=1).apply(apply_function)
df_out = pd.DataFrame([np.concatenate(values, axis=0) for values in df_out.values],
index=df_out.index,
columns=pd.MultiIndex.from_tuples([(user, column)
for column in ["column_1", "column_2"]
for user in df_out.columns
], names=["user", "algorithm"]))
df_out
user user_1 user_2 user_3
algorithm column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3

pandas apply to append total row

I have gone through numerous examples of putting a Total row to the end of a dataframe. I am just curious to know that why can't below approach work:
import pandas as pd
dict = {
'a' : [1,2 ,3, 4, 5],
'b' : [3, 5, 7, 9, 10]
}
df = pd.DataFrame(dict)
# print(df)
def func(x):
x['sum'] = x.sum()
df2 = df.apply(lambda x: func(x), axis=0)
print(df2)
In this, x is always a Series containing a full column and I am appending an index called sum. Please guide.
EDIT: If we can calculate sum with axis=1 why can't we do it with axis=0.

Here missing return x from your function:
def func(x):
x['sum'] = x.sum()
return x
df2 = df.apply(lambda x: func(x), axis=0)
print(df2)
a b
0 1 3
1 2 5
2 3 7
3 4 9
4 5 10
sum 15 34
But simpliest is use setting with enlargement:
df.loc['sum'] = df.sum()

Pandas Dataframe 2D selection with row number and column label

What's the easiest way to get a value in a Pandas 2D Dataframe with the row number and the column title as indicers (a combo of loc and iloc)?

df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]}, index=['i', 'j', 'k'])
df
a b
i 1 4
j 2 5
k 3 6
If you have an index that isn't numeric, and want to grab the first and second rows from column 'a', you can either use loc with indexing—
df.loc[df.index[[0, 1]], 'a']
i 1
j 2
Name: a, dtype: int64
Or, iloc + get_loc—
df.iloc[[0, 1], df.columns.get_loc('a')]
i 1
j 2
Name: a, dtype: int64

You can just use loc and pass it the row/index and column name:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
print(df.loc[0, 'b'])

Multiple filters Python Data.frame

I'm pretty new to python. I'm trying to filter rows in a data.frame as I do in R.
sub_df = df[df[main_id]==3]
works, but
df[df[main_id] in [3,7]]
gives me error
"The truth value of a Series is ambiguous"
Can you please suggest me a correct syntax to write similar selections?

You can use pandas isin function. This would look like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
df[df['A'].isin([2, 3])]
giving:
A B
1 2 b
2 3 f

df[df[main_id].apply(lambda x: x in [3, 7])]

yet another solution:
In [60]: df = pd.DataFrame({'main_id': [0,1, 2, 3], 'x': list('ABCD')})
In [61]: df
Out[61]:
main_id x
0 0 A
1 1 B
2 2 C
3 3 D
In [62]: df.query("main_id in [0,3]")
Out[62]:
main_id x
0 0 A
3 3 D

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Map function across multi-column DataFrame - python

You can apply a function row-wise by setting axis=1 df.apply(lambda row: row.x + row.y, axis=1) Out[145]: 0 5 1 5 2 5 3 5 dtype: int64

Related

Extract rows with repeats, from dataframe where column value matches value from an array

pandas.apply expand column ValueError: If using all scalar values, you must pass an index

pandas apply to append total row

Pandas Dataframe 2D selection with row number and column label

Multiple filters Python Data.frame

Categories

Resources