Map function across multi-column DataFrame - python

Given a DataFrame like the following
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [4, 3, 2, 1]})
I would like to map a row-wise function across its columns
In [3]: df.map(lambda (x, y): x + y)
and get something like the following
0 5
1 5
2 5
3 5
Name: None, dtype: int64
Is this possible?

You can apply a function row-wise by setting axis=1
df.apply(lambda row: row.x + row.y, axis=1)
Out[145]:
0 5
1 5
2 5
3 5
dtype: int64

Related

Extract rows with repeats, from dataframe where column value matches value from an array

I have a pandas.Dataframe df with one of the column headers being 'X'. Let's say this is of size (N,M). N=3,M=2 in this example:
X Y
0 1 a
1 2 b
2 3 c
I have a 1D numpy.array arr of size (Q,), that contains values, some of which are repeats. Q=5 in this example:
array([1, 2, 3, 2, 2])
I would like to create a new pandas.Dataframe df_op that contains rows from df, where each row.X matches an entry from arr. This means some rows are extracted more than once, and the resultant df_op has size (Q,M). If possible, I would like to keep the same order of entries as in arr as well.
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b
Using the usual boolean indexing does not work, because that only picks up unique rows. I would also like to avoid loops if possible, because Q is large.
How can I get df_op? Thank you.
Use indexing to get multiple times the same row:
x = [1, 2, 3, 2, 2]
df = pd.DataFrame({'X': [1, 2, 3], 'Y': ['a', 'b', 'c']})
out = df.set_index('X').loc[x].reset_index()
Output:
>>> out
X Y
0 1 a
1 2 b
2 3 c
3 2 b
4 2 b

pandas.apply expand column ValueError: If using all scalar values, you must pass an index

I want to apply a function to a DataFrame that returns several columns for each column in the original dataset. The apply function returns a DataFrame with columns and indexes but it still raises the error ValueError: If using all scalar values, you must pass an index.
I've tried to set the name of the output dataframe, to set the columns as a multiindex and set the index as a multiindex but it doesn't work.
Example: I have this input dataframe
df_all_users = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=["user_1", "user_2", "user_3"])
user_1 user_2 user_3
2020-01-01 1 2 3
2020-01-02 1 2 3
2020-01-03 1 2 3
The apply_function is like this:
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
# these columns are in reality computed used some other functions
df_out["column_1"] = df.values # example: pyod.ocsvm.OCSVM.fit_predict(df.values)
df_out["column_2"] = - df.values # example: pyod.knn.KNN.fit_predict(df.values)
# these are the things I've tried without working
df_out.name = df.name
df_out.columns = pd.MultiIndex.from_tuples([(df.name, column) for column in df_out.columns],
names=["user", "score"])
df_out.index = pd.MultiIndex.from_tuples([(df.name, idx) for idx in df_out.index],
names=["user", "date"])
print(df_out)
return df_out
df_all_users.apply(apply_function, axis=0, result_type="expand")
Which raises the error:
ValueError: If using all scalar values, you must pass an index
The output that I expect would be like this:
out_df = pd.DataFrame(
[[1, 1, 2, 2, 3, 3],
[1, 1, 2, 2, 3, 3],
[1, 1, 2, 2, 3, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=pd.MultiIndex.from_tuples([(user, column)
for user in ["user_1", "user_2", "user_3"]
for column in ["column_1", "column_2"]],
names=("user", "score"))
)
user_1 user_2 user_3
column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3
Do this:
import numpy as np
df_all_users[np.repeat(df_all_users.columns.values,2)]
Ok, the answer was to transform output to a series of arrays, then concatenate the results:
import pandas as pd
df_all_users = pd.DataFrame(
[[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
],
index=["2020-01-01", "2020-01-02", "2020-01-03"],
columns=["user_1", "user_2", "user_3"])
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
df_out["column_1"] = df.values
df_out["column_2"] = df.values
df_out = pd.Series([values for values in df_out.values], index=df.index)
df_out.name = df.name
return df_out
df_out = df_all_users.groupby(level=0, axis=1).apply(apply_function)
df_out = pd.DataFrame([np.concatenate(values, axis=0) for values in df_out.values],
index=df_out.index,
columns=pd.MultiIndex.from_tuples([(user, column)
for column in ["column_1", "column_2"]
for user in df_out.columns
], names=["user", "algorithm"]))
df_out
user user_1 user_2 user_3
algorithm column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3

pandas apply to append total row

I have gone through numerous examples of putting a Total row to the end of a dataframe. I am just curious to know that why can't below approach work:
import pandas as pd
dict = {
'a' : [1,2 ,3, 4, 5],
'b' : [3, 5, 7, 9, 10]
}
df = pd.DataFrame(dict)
# print(df)
def func(x):
x['sum'] = x.sum()
df2 = df.apply(lambda x: func(x), axis=0)
print(df2)
In this, x is always a Series containing a full column and I am appending an index called sum. Please guide.
EDIT: If we can calculate sum with axis=1 why can't we do it with axis=0.
Here missing return x from your function:
def func(x):
x['sum'] = x.sum()
return x
df2 = df.apply(lambda x: func(x), axis=0)
print(df2)
a b
0 1 3
1 2 5
2 3 7
3 4 9
4 5 10
sum 15 34
But simpliest is use setting with enlargement:
df.loc['sum'] = df.sum()

Pandas Dataframe 2D selection with row number and column label

What's the easiest way to get a value in a Pandas 2D Dataframe with the row number and the column title as indicers (a combo of loc and iloc)?
df = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]}, index=['i', 'j', 'k'])
df
a b
i 1 4
j 2 5
k 3 6
If you have an index that isn't numeric, and want to grab the first and second rows from column 'a', you can either use loc with indexing—
df.loc[df.index[[0, 1]], 'a']
i 1
j 2
Name: a, dtype: int64
Or, iloc + get_loc—
df.iloc[[0, 1], df.columns.get_loc('a')]
i 1
j 2
Name: a, dtype: int64
You can just use loc and pass it the row/index and column name:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
print(df.loc[0, 'b'])

Multiple filters Python Data.frame

I'm pretty new to python. I'm trying to filter rows in a data.frame as I do in R.
sub_df = df[df[main_id]==3]
works, but
df[df[main_id] in [3,7]]
gives me error
"The truth value of a Series is ambiguous"
Can you please suggest me a correct syntax to write similar selections?
You can use pandas isin function. This would look like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
df[df['A'].isin([2, 3])]
giving:
A B
1 2 b
2 3 f
df[df[main_id].apply(lambda x: x in [3, 7])]
yet another solution:
In [60]: df = pd.DataFrame({'main_id': [0,1, 2, 3], 'x': list('ABCD')})
In [61]: df
Out[61]:
main_id x
0 0 A
1 1 B
2 2 C
3 3 D
In [62]: df.query("main_id in [0,3]")
Out[62]:
main_id x
0 0 A
3 3 D

Categories

Resources