How to replace string with values from one datafame to another dataframe - python

I have two dataframes, where in one dataframe(df1) each user is having string values, while in another dataframe (df2) there is a value associated with string values.
I want to have a new dataframe similar to df1 but with string being replaced with values corresponding to df2. Let me know if a simple method exist to create such new dataframe?
here are the sample query for df1 and df2
df1 = pd.DataFrame({"user": ["user1", "user2", "user3", "user4"], "p1": ["A", "C", "D", "D"],"p2": ["B", "D", "D", "A"],"p3": ["A", "B", "C", "D"],"p4": ["D", "A", "B", "C"], }, index=[0, 1, 2, 3], )
df2 = pd.DataFrame({"N1": ["A", "B", "C", "D"],"N2": ["1", "2", "5", "6"], }, index=[0, 1, 2, 3], )
My desired output should look like this

You can use df.stack() with Series.map and df.unstack:
In [95]: df3 = df1.set_index('user').stack().map(df2.set_index('N1')['N2']).unstack()
In [96]: df3
Out[96]:
p1 p2 p3 p4
user
user1 1 2 1 6
user2 5 6 2 1
user3 6 6 5 2
user4 6 1 6 5

Related

python: pandas: add values from other dataframe into new column by condition

I have two dataframes with the following data:
fixtures = pd.DataFrame(
{'HomeTeam': ["A", "B", "C", "D"], 'AwayTeam': ["E", "F", "G", "H"]})
ratings = pd.DataFrame({'team': ["A", "B", "C", "D", "E", "F", "G", "H"], "rating": [
"1,5", "0,2", "0,5", "2", "3", "4,8", "0,9", "-0,4"]})
now i want to map the value from ratings["rating"] to the respective team names but i can't get it to work. is it possible to have new columns with the ratings appear to the right of the HomeTeam and AwayTeam columns?
expected output:
fixtures:
homeTeam homeTeamRating awayTeam AwayTeamRating
Team A 1,5 Team E 3
you can use:
to_replace=dict(zip(ratings.team,ratings.rating)) #create a dict. Key is team name value is rating.
#{'A': '1,5', 'B': '0,2', 'C': '0,5', 'D': '2', 'E': '3', 'F': '4,8', 'G': '0,9', 'H': '-0,4'}
fixtures['homeTeamRating']=fixtures['HomeTeam'].map(to_replace) #use map and replace team column as a new column.
fixtures['AwayTeamRating']=fixtures['AwayTeam'].map(to_replace)
fixtures=fixtures[['HomeTeam','homeTeamRating','AwayTeam','AwayTeamRating']]
'''
HomeTeam homeTeamRating AwayTeam AwayTeamRating
0 A 1,5 E 3
1 B 0,2 F 4,8
2 C 0,5 G 0,9
3 D 2 H -0,4
'''
If you need to apply a method over an existing column in order to compute some values that will eventually be added as a new column in the existing DataFrame, then pandas.DataFrame.apply() method should do the trick.

Why nan values are shown after reindexing?

I am trying to reindex the columns, but it's displaying nan values. I am not able to understand why?
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["P", "Q", "R", "S"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C", "D"]
newdf = df.reindex(new)
print(newdf)
Output:
age qualified
A NaN NaN
B NaN NaN
C NaN NaN
D NaN NaN
I think you need DataFrame.set_index, with nested list if need replace index values by new values:
new = ["A", "B", "C", "D"]
newdf = df.set_index([new])
#alternative
#newdf.index = new
print(newdf)
age qualified
A 50 True
B 40 False
C 30 False
D 40 False
Method DataFrame.reindex working different - it create new index by list with alignment data - it means first match existing values of index by values of new list new and for not matching values create NaNs:
data = {
"age": [50, 40, 30, 40],
"qualified": [True, False, False, False]
}
index = ["A", "Q", "D", "C"]
df = pd.DataFrame(data, index=index)
new = ["A", "B", "C"]
newdf = df.reindex(new)
print(newdf)
age qualified
A 50.0 True
B NaN NaN
C 40.0 False

Pandas new col with indexes of rows sharing a code in another col

Let say I've a DataFrame indexed on unique Code. Each entry may herit from another (unique) entry: the parent's Code is given in col Herit.
I need a new column giving the list of children for every entries. I can obtain it providing the Code, but I don't succeed in setting up the whole column.
Here is my M(non)WE:
import pandas as pd
data = pd.DataFrame({
"Code": ["a", "aa", "ab", "b", "ba", "c"],
"Herit": ["", "a", "a", "", "b", ""],
"C": [12, 15, 13, 12, 14, 10]
}
)
data.set_index("Code", inplace=True)
print(data)
child_a = data[data.Herit == "a"].index.values
print(child_a)
data["child"] = data.apply(lambda x: data[data.Herit == x.index].index.values, axis=1)
print(data)
You can group by the Herit column and then reduce the corresponding Codes into lists:
>>> herits = df.groupby("Herit").Code.agg(list)
>>> herits
Herit
[a, b, c]
a [aa, ab]
b [ba]
Then you can map the Code column of your frame with this and assign to a new column and fill the slots who don't have any children with "":
>>> df["Children"] = df.Code.map(herits).fillna("")
>>> df
Code Herit C Children
0 a 12 [aa, ab]
1 aa a 15
2 ab a 13
3 b 12 [ba]
4 ba b 14
5 c 10

equivalent python and pandas operation for group_by + mutate + indexing column vectors within mutate in R

Sample data frame in Python:
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
Now I want to get the same output in Python with pandas as I get in R with the code below. So I want to get the change in percentage in col1 by group in col2.
data.frame(col1 = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
col2 = c(3, 4, 5, 1, 3, 9, 16, 18, 23)) -> df
df %>%
dplyr::group_by(col1) %>%
dplyr::mutate(perc = (dplyr::last(col2) - col2[1]) / col2[1])
In python, I tried:
def perc_change(column):
index_1 = tu_in[column].iloc[0]
index_2 = tu_in[column].iloc[-1]
perc_change = (index_2 - index_1) / index_1
return(perc_change)
d = {'col1': ["a", "a", "a", "b", "b", "b", "c", "c", "c"],
'col2': [3, 4, 5, 1, 3, 9, 5, 7, 23]}
df = pd.DataFrame(data=d)
df.assign(perc_change = lambda x: x.groupby["col1"]["col2"].transform(perc_change))
But it gives me an error saying: 'method' object is not subscriptable.
I am new to python and trying to convert some R code into python. How can I solve this in an elegant way? Thank you!
You don't want transform here. transform is typically used when your aggregation returns a scalar value per group and you want to broadcast that result to all rows that belong to that group in the original DataFrame. Because GroupBy.pct_change already returns a result indexed like the original, you aggregate and assign back.
df['perc_change'] = df.groupby('col1')['col2'].pct_change()
# col1 col2 perc_change
#0 a 3 NaN
#1 a 4 0.333333
#2 a 5 0.250000
#3 b 1 NaN
#4 b 3 2.000000
#5 b 9 2.000000
#6 c 5 NaN
#7 c 7 0.400000
#8 c 23 2.285714
But if instead what you need is the overall percentage change within a group, so it's the difference in the first and last value divided by the first value, you would then want transform.
df.groupby('col1')['col2'].transform(lambda x: (x.iloc[-1] - x.iloc[0])/x.iloc[0])
0 0.666667
1 0.666667
2 0.666667
3 8.000000
4 8.000000
5 8.000000
6 3.600000
7 3.600000
8 3.600000
Name: col2, dtype: float64

Divide a column depending on a row value in pandas

I am trying to do a calculation in Pandas that looks obvious, but after several tries I did not find how to do it correctly.
I have a dataframe that looks like this:
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "a", 5.0 ],
["B", "b", 6.0 ],
["B", "c", 7.0 ]])
The first column is a test name, the second column is a class, and third column gives a time. Each test is normally present in the table with the 3 classes.
This is the correct format to plot it like this:
sns.factorplot(x="2", y="0", hue="1", data=df,
kind="bar")
So that for each test, I get a group of 3 bars, one for each class.
However I would like to change the dataframe so that each value in column 2 is not an absolute value, but a ratio compared to class "a".
So I would like to transform it to this:
df = pd.DataFrame([["A", "a", 1.0],
["A", "b", 1.2],
["A", "c", 1.3],
["B", "a", 1.0],
["B", "b", 1.2],
["B", "c", 1.4]])
I am able to extract the series, change the index so that they match, do the computation, for example:
df_a = df[df[1] == "a"].set_index(0)
df_b = df[df[1] == "b"].set_index(0)
df_b["ratio_a"] = df_b[2] / df_a[2]
But this is certainly very inefficient, and I need to group it back to the format.
What is the correct way to do it?
You could use groupby/transform('first') to find the first value in each group:
import pandas as pd
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "b", 6.0 ],
["B", "a", 5.0 ],
["B", "c", 7.0 ]])
df = df.sort_values(by=[0,1])
df[2] /= df.groupby(0)[2].transform('first')
yields
0 1 2
0 A a 1.0
1 A b 1.2
2 A c 1.3
3 B a 1.0
4 B b 1.2
5 B c 1.4
You can also do this with some index alignment.
df1 = df.set_index(['test', 'class'])
df1 / df1.xs('a', level='class')
But transform is better

Categories

Resources