I am trying to do a calculation in Pandas that looks obvious, but after several tries I did not find how to do it correctly.
I have a dataframe that looks like this:
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "a", 5.0 ],
["B", "b", 6.0 ],
["B", "c", 7.0 ]])
The first column is a test name, the second column is a class, and third column gives a time. Each test is normally present in the table with the 3 classes.
This is the correct format to plot it like this:
sns.factorplot(x="2", y="0", hue="1", data=df,
kind="bar")
So that for each test, I get a group of 3 bars, one for each class.
However I would like to change the dataframe so that each value in column 2 is not an absolute value, but a ratio compared to class "a".
So I would like to transform it to this:
df = pd.DataFrame([["A", "a", 1.0],
["A", "b", 1.2],
["A", "c", 1.3],
["B", "a", 1.0],
["B", "b", 1.2],
["B", "c", 1.4]])
I am able to extract the series, change the index so that they match, do the computation, for example:
df_a = df[df[1] == "a"].set_index(0)
df_b = df[df[1] == "b"].set_index(0)
df_b["ratio_a"] = df_b[2] / df_a[2]
But this is certainly very inefficient, and I need to group it back to the format.
What is the correct way to do it?
You could use groupby/transform('first') to find the first value in each group:
import pandas as pd
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "b", 6.0 ],
["B", "a", 5.0 ],
["B", "c", 7.0 ]])
df = df.sort_values(by=[0,1])
df[2] /= df.groupby(0)[2].transform('first')
yields
0 1 2
0 A a 1.0
1 A b 1.2
2 A c 1.3
3 B a 1.0
4 B b 1.2
5 B c 1.4
You can also do this with some index alignment.
df1 = df.set_index(['test', 'class'])
df1 / df1.xs('a', level='class')
But transform is better
Related
I am comparing dataframes with pandas. I want to distinguish the compared dataframe columns by naming them, therefore I'm using the parameter result_names from pandas documentation but it returns: 'TypeError: DataFrame.compare() got an unexpected keyword argument 'result_names''.
Here is the code, that is simply the suggested one in the documentation: (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html)
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df.compare(df2, result_names=("left", "right"))
Any ideas why?
You need pandas ≥1.5.
For earlier versions, you can instead rename the level:
df.compare(df2).rename({'self': 'left', 'other': 'right'}, axis=1, level=1)
output:
col1 col3
left right left right
0 a c NaN NaN
2 NaN NaN 3.0 4.0
I am trying to reindex the columns, but it's displaying nan values. I am not able to understand why?
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["P", "Q", "R", "S"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C", "D"]
newdf = df.reindex(new)
print(newdf)
Output:
age qualified
A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN
I think you need DataFrame.set_index, with nested list if need replace index values by new values:
new = ["A", "B", "C", "D"]
newdf = df.set_index([new])
#alternative
#newdf.index = new
print(newdf)
age qualified
A 50 True
B 40 False
C 30 False
D 40 False
Method DataFrame.reindex working different - it create new index by list with alignment data - it means first match existing values of index by values of new list new and for not matching values create NaNs:
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["A", "Q", "D", "C"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C"]
newdf = df.reindex(new)
print(newdf)
age qualified
A 50.0 True
B NaN NaN
C 40.0 False
I have two dataframes, where in one dataframe(df1) each user is having string values, while in another dataframe (df2) there is a value associated with string values.
I want to have a new dataframe similar to df1 but with string being replaced with values corresponding to df2. Let me know if a simple method exist to create such new dataframe?
here are the sample query for df1 and df2
df1 = pd.DataFrame({"user": ["user1", "user2", "user3", "user4"], "p1": ["A", "C", "D", "D"],"p2": ["B", "D", "D", "A"],"p3": ["A", "B", "C", "D"],"p4": ["D", "A", "B", "C"], }, index=[0, 1, 2, 3], )
df2 = pd.DataFrame({"N1": ["A", "B", "C", "D"],"N2": ["1", "2", "5", "6"], }, index=[0, 1, 2, 3], )
My desired output should look like this
You can use df.stack() with Series.map and df.unstack:
In [95]: df3 = df1.set_index('user').stack().map(df2.set_index('N1')['N2']).unstack()
In [96]: df3
Out[96]:
p1 p2 p3 p4
user
user1 1 2 1 6
user2 5 6 2 1
user3 6 6 5 2
user4 6 1 6 5
Sample data frame in Python:
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.
data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df
df %>%
dplyr::group_by(col1) %>%
dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])
In python, I tried:
def perc_change(column):
index_1 = tu_in[column].iloc[0]
index_2 = tu_in[column].iloc[-1]
perc_change = (index_2 - index_1) / index_1
return(perc_change)
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))
But it gives me an error saying: 'method' object is not subscriptable.
I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!
You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.
df['perc_change'] = df.groupby('col1')['col2'].pct_change()
# col1 col2 perc_change
#0 a 3 NaN
#1 a 4 0.333333
#2 a 5 0.250000
#3 b 1 NaN
#4 b 3 2.000000
#5 b 9 2.000000
#6 c 5 NaN
#7 c 7 0.400000
#8 c 23 2.285714
But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.
df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])
0 0.666667
1 0.666667
2 0.666667
3 8.000000
4 8.000000
5 8.000000
6 3.600000
7 3.600000
8 3.600000
Name: col2, dtype: float64
I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.