I have a DataFrame like below
df = pd.DataFrame({
'A' : [x,x,x,x,x],
'B' : [1,2,1,1,2]
})
I would like to replace x by y where df['B'] == 2
I know there are lots of ways but what is the shortest code to accomplish this? I believe np.where is one way but can it change value (or overwrite variable) based on values in another column?
Here are some alternatives select by both conditions or using replace with select by one condition:
df = pd.DataFrame({
'A' : ['x'] * 5,
'B' : [1,2,1,1,2]
})
df.loc[df.B.eq(2) & df.A.eq('x'), 'A'] = 'y'
print (df)
A B
0 x 1
1 y 2
2 x 1
3 x 1
4 y 2
Or:
df.A = df.A.mask(df.B.eq(2) & df.A.eq('x'), 'y')
Or:
df.A = df.A.mask(df.B.eq(2), df.A.replace('x','y'))
Or:
df.loc[df['B'].eq(2), 'A'] = df['A'].replace('x', 'y')
Try using loc:
>>> df.loc[df['B'] == 2, 'A'] = df.loc[df['B'] == 2, 'A'].replace('x', 'y')
>>> df
A B
0 x 1
1 y 2
2 x 1
3 x 1
4 y 2
>>>
Related
With the following dataframe as an example :
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 10, 3], 'A':[0,5,100], 'C':[0,10,7]})
I would like to add a new column called df["indices"] with the indices of columns df["A"] and/or df["C"] provided they satisfy 2 conditions:
Must be greater than 5
df["A"]/df["Base"] or df["C"]/df["Base"] must be greater than or equal to 1
The resulting dataframe would be:
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 20, 3], 'A':[0,6,100], 'C':[0,10,7], 'indices': ['','C','A,C']})
I can get True or False values for my first condition with df[['A','C']] > 5 but I cannot get it to work with my condition 2 which is based on another column in my dataframe. Getting the indices where I get True in a new column is yet another story. I imagine something with apply and get_loc or index but I cannot get it to work no matter how I try.
Let's create a boolean mask satisfying the two given conditions, then use DataFrame.dot on this mask to get the indices:
m = df[['A', 'C']].gt(5) & df[['A', 'C']].div(df['Base'], axis=0).ge(1)
df['indices'] = m.dot(m.columns + ',').str.rstrip(',')
Sample Base A C indices
0 X 2 0 0
1 Y 10 5 10 C
2 Z 3 100 7 A,C
You can use df.loc to assign values back to the column when any number of conditions are met. A simple approach would be to have 3 of these, each with your desired conditions. You could also probably chain together np.where to achieve the same thing if you wanted.
import pandas as pd
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'],
'Base':[2, 10, 3],
'A':[0,5,100],
'C':[0,10,7]})
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'A,C'
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] <1), 'indicies'] = 'A'
df.loc[(df['A'] / df['Base'] <1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'C'
Output
Sample Base A C indicies
0 X 2 0 0 NaN
1 Y 10 5 10 C
2 Z 3 100 7 A,C
I have a TYPE column
and a VOLUME column
What I'm looking to do if first check if TYPE column == 'var1'
If so I would like to make a calculation in the VOLUME column.
So far I have something like this:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 4
This seems to set the entire column that meets the condition to the last variable. So I end up with just two values.
Out:
4
4
4
4
8
8
8
Another option:
data['VOLUME'] = data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
This works for the first condition but show NaN for the second condition
Then when I run:
data['VOLUME'] = data.loc[data['TYPE'] == 'var2', ['VOLUME']] * 4
The whole column show as NaN.
Consider a simple example which demonstrates what is happening.
df = pd.DataFrame({'A': [1, 2, 3]})
df
A
0 1
1 2
2 3
Now, only values below 2 in column "A" are to be modified. So, try something like
df.loc[df.A < 2, 'A'] * 2
0 2
Name: A, dtype: int64
This series only has 1 row at index 0. If you try assigning this back, the implicit assumption is that the other index values are to be reset to NaN.
df.assign(A=df.loc[df.A < 2, 'A'] * 2)
A
0 2.0
1 NaN
2 NaN
What we want to do is to modify only the rows we're interested in. This is best done with the in-place modification arithmetic operator *=:
df.loc[df.A < 2, 'A'] *= 2
In your case, it is
data.loc[data['TYPE'] == 'var1', 'VOLUME'] *= 2
You are really close. The problem is in how you are storing the result. This should work:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] = data['VOLUME'] * 2
You can use *= with loc:
In [11]: df = pd.DataFrame([[1], [2]], columns=["A"])
In [12]: df
Out[12]:
A
0 1
1 2
In [13]: df.loc[df.A == 1, "A"] *= 3
In [14]: df
Out[14]:
A
0 3
1 2
I'm currently merging the first and last string in a row. These strings are merged when they are to the right of a specific value. I'm hoping to change that to below a specific value.
import pandas as pd
d = ({
'A' : ['X','Foo','','X','Big'],
'B' : ['No','','','No',''],
'C' : ['Merge','Bar','','Merge','Cat'],
})
df = pd.DataFrame(data = d)
m = df.A == 'X'
def f(x):
s = x[x!= '']
x[s.index[1]] = x[s.index[1]] + ' ' + x[s.index[-1]]
x[s.index[-1]] = ''
return x
df = df.astype(str).mask(m, df[m].apply(f, axis=1))
This code merges the first and last string when followed by X.
Output:
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat
I'm hoping to change it to rows beneath the value X.
Intended Output:
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat
Solution is very similar, only boolean mask is shifted and first NaN is replaced to False and also indices from [1] are changed to [0] for seelct first value (of column A):
m = (df.A == 'X').shift().fillna(False)
def f(x):
s = x[x!= '']
x[s.index[0]] = x[s.index[0]] + ' ' + x[s.index[-1]]
x[s.index[-1]] = ''
return x
df = df.astype(str).mask(m, df[m].apply(f, axis=1))
print (df)
A B C
0 X No Merge
1 Foo Bar
2
3 X No Merge
4 Big Cat
I have a pandas dataframe and I need to modify all values in a given string column. Each column contains string values of the same length. The user provides the index they want to be replaced for each value
for example: [1:3] and the replacement value "AAA".
This would replace the string from values 1 to 3 with the value AAA.
How can I use the applymap(), map() or apply() function to get this done?
SOLUTION: Here is the final solution I went off of using the answer marked below:
import pandas as pd
df = pd.DataFrame({'A':['ffgghh','ffrtss','ffrtds'],
#'B':['ffrtss','ssgghh','d'],
'C':['qqttss',' 44','f']})
print df
old = ['g', 'r', 'z']
new = ['y', 'b', 'c']
vals = dict(zip(old, new))
pos = 2
for old, new in vals.items():
df.ix[df['A'].str[pos] == old, 'A'] = df['A'].str.slice_replace(pos,pos + len(new),new)
print df
Use str.slice_replace:
df['B'] = df['B'].str.slice_replace(1, 3, 'AAA')
Sample Input:
A B
0 w abcdefg
1 x bbbbbbb
2 y ccccccc
3 z zzzzzzzz
Sample Output:
A B
0 w aAAAdefg
1 x bAAAbbbb
2 y cAAAcccc
3 z zAAAzzzzz
IMO the most straightforward solution:
In [7]: df
Out[7]:
col
0 abcdefg
1 bbbbbbb
2 ccccccc
3 zzzzzzzz
In [9]: df.col = df.col.str[:1] + 'AAA' + df.col.str[4:]
In [10]: df
Out[10]:
col
0 aAAAefg
1 bAAAbbb
2 cAAAccc
3 zAAAzzzz
All I'm trying to do is add columns data1 and data2 if in the same row letters is a and subtract if it is c. multipy if it is b. Here is my code.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3],['1/3/2014', 'b', 3, 5]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
df['result']=df['data1']
for i in range(0,len(df)):
if df['letters'][i]=='a':
df['result'][i]=df['data1'][i]+df['data2'][i]
if df['letters'][i]=='b':
df['result'][i]=df['data1'][i]*df['data2'][i]
if df['letters'][i]=='c':
df['result'][i]=df['data1'][i]-df['data2'][i]
>>> df
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 a 3 1 4
2 1/3/2014 c 1 3 -2
3 1/3/2014 b 3 5 15
My question: is there a way to do it in one line without looping? something to the spirit of:
df['result']=df['result'].map(lambda x:df['data1'][i]+df['data2'][i] if x =='a' df['data1'][i]-df['data2'][i] elif x =='c' else x)`
You can use df.apply in combination with a lambda function. You have to use the keyword argument axis=1 to ensure you work on rows as opposed to the columns.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
from operator import add, sub, mul
d = dict(a=add, b=mul, c=sub)
df['result'] = df.apply(lambda r: d[r['letters']](r['data1'], r['data2']), axis=1)
This will use the dictionary d to get the function you wish to use (add, sub, or mul).
Original solution below
df['result'] = df.apply(lambda r: r['data1'] + r['data2'] if r['letters'] == 'a'
else r['data1'] - r['data2'] if r['letters'] == 'c'
else r['data1'] * r['data2'], axis=1)
print(df)
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 b 3 1 3
2 1/3/2014 c 1 3 -2
The lambda function is a bit complex now so I'll go into it in a bit more detail...
The lambda function uses a so-called ternary operator to make boolean conditions inside one line, a typical ternary expession is of the form
a if b else c
Unfortunately you can't have an elif with a ternary expression, but what you can do is place another one inside the else statement, then it becomes
a if b else c if d else e
You can use the .where method:
where(cond, other=nan, inplace=False, axis=None, level=None, try_cast=False, raise_on_error=True) method of pandas.core.series.Series instance
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
as in:
>>> df['data1'] + df['data2'].where(df['letters'] == 'a', - df['data2'])
0 7
1 4
2 -2
dtype: int64
alternatively, numpy.where:
>>> df['data1'] + np.where(df['letters'] == 'a', 1, -1) * df['data2']
0 7
1 4
2 -2
dtype: int64