pandas if then statement without looping - python

All I'm trying to do is add columns data1 and data2 if in the same row letters is a and subtract if it is c. multipy if it is b. Here is my code.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3],['1/3/2014', 'b', 3, 5]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
df['result']=df['data1']
for i in range(0,len(df)):
if df['letters'][i]=='a':
df['result'][i]=df['data1'][i]+df['data2'][i]
if df['letters'][i]=='b':
df['result'][i]=df['data1'][i]*df['data2'][i]
if df['letters'][i]=='c':
df['result'][i]=df['data1'][i]-df['data2'][i]
>>> df
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 a 3 1 4
2 1/3/2014 c 1 3 -2
3 1/3/2014 b 3 5 15
My question: is there a way to do it in one line without looping? something to the spirit of:
df['result']=df['result'].map(lambda x:df['data1'][i]+df['data2'][i] if x =='a' df['data1'][i]-df['data2'][i] elif x =='c' else x)`

You can use df.apply in combination with a lambda function. You have to use the keyword argument axis=1 to ensure you work on rows as opposed to the columns.
import pandas as pd
a=[['Date', 'letters', 'data1', 'data2'], ['1/2/2014', 'a', 6, 1], ['1/2/2014', 'a', 3, 1], ['1/3/2014', 'c', 1, 3]]
df = pd.DataFrame.from_records(a[1:],columns=a[0])
from operator import add, sub, mul
d = dict(a=add, b=mul, c=sub)
df['result'] = df.apply(lambda r: d[r['letters']](r['data1'], r['data2']), axis=1)
This will use the dictionary d to get the function you wish to use (add, sub, or mul).
Original solution below
df['result'] = df.apply(lambda r: r['data1'] + r['data2'] if r['letters'] == 'a'
else r['data1'] - r['data2'] if r['letters'] == 'c'
else r['data1'] * r['data2'], axis=1)
print(df)
Date letters data1 data2 result
0 1/2/2014 a 6 1 7
1 1/2/2014 b 3 1 3
2 1/3/2014 c 1 3 -2
The lambda function is a bit complex now so I'll go into it in a bit more detail...
The lambda function uses a so-called ternary operator to make boolean conditions inside one line, a typical ternary expession is of the form
a if b else c
Unfortunately you can't have an elif with a ternary expression, but what you can do is place another one inside the else statement, then it becomes
a if b else c if d else e

You can use the .where method:
where(cond, other=nan, inplace=False, axis=None, level=None, try_cast=False, raise_on_error=True) method of pandas.core.series.Series instance
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
as in:
>>> df['data1'] + df['data2'].where(df['letters'] == 'a', - df['data2'])
0 7
1 4
2 -2
dtype: int64
alternatively, numpy.where:
>>> df['data1'] + np.where(df['letters'] == 'a', 1, -1) * df['data2']
0 7
1 4
2 -2
dtype: int64

Related

Function Value with Combination(or Permutation) of Variables and Assign to Dataframe

I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6

How do I keep original pandas dataframe values while using df.astype() ? I need to raise a value error for below example

df = pd.DataFrame({
'A': ['a', 'b', 'c', 'd', 'e'],
'B': [1, 2.5, 3, 4, 5],
'C': ['abc', 'def', 'ghi', 'jkl', 'mno'] })
col_type = {'A':str, 'B':int, 'C':str}
df = df.astype(col_type)
df
Output is:
A B C
0 a 1 abc
1 b 2 def
2 c 3 ghi
3 d 4 jkl
4 e 5 mno
But I want to raise a value error at index 1 for column B. I don't need the integer value. I want to do it automatically( Like loop through all columns)
Panda's built .astype() in doesn't appear to have a 'safe casting' method as you want.
In numpy you can use
np.ndarray.astype(preferred_type, casting='safe')
So unfortunately I don't have a pretty solution for you but I would do something like
coltypes = [str,int,str]
colnames = ['a','b','c']
data_for_df = [df.values[:,i].astype(coltypes[i], casting='safe') for i in range(len(df))]
df = pd.DataFrame(data_for_df,columns=colnames)
Someone might be able to give a better answer than me :)
If you want to control that some columns of floating point values only contain integers, you can simply examine the difference between the original column(s) and the same one(s) after int conversion:
(df['B'] - df['B'].astype('int')) == 0
gives following Series:
0 True
1 False
2 True
3 True
4 True
Name: B, dtype: bool
From there, you can raise an exception
tmp = (df['B'] - df['B'].astype('int')) == 0
if not tmp.all():
raise TypeError("Non int value at "+ ', '.join(df[(df['B'] - df['B'].astype('int')) != 0]
.index.astype(str)))
With the sample data, it gives as expected:
TypeError: Non int value at 1

How to add multiple columns to dataframe by function

If I have a df such as this:
a b
0 1 3
1 2 4
I can use df['c'] = '' and df['d'] = -1 to add 2 columns and become this:
a b c d
0 1 3 -1
1 2 4 -1
How can I make the code within a function, so I can apply that function to df and add all the columns at once, instead of adding them one by one seperately as above? Thanks
Create a dictionary:
dictionary= { 'c':'', 'd':-1 }
def new_columns(df, dictionary):
return df.assign(**dictionary)
then call it with your df:
df = new_columns(df, dictionary)
or just ( if you don't need a function call, not sure what your use case is) :
df.assign(**dictionary)
def update_df(a_df, new_cols_names, new_cols_vals):
for n, v in zip(new_cols_names, new_cols_vals):
a_df[n] = v
update_df(df, ['c', 'd', 'e'], ['', 5, 6])

Slicing a DataFrameGroupBy object

Is there a way to slice a DataFrameGroupBy object?
For example, if I have:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3], 'B': ['x', 'y', 'z', 'r', 'p']})
A B
0 2 x
1 1 y
2 1 z
3 3 r
4 3 p
dfg = df.groupby('A')
Now, the returned GroupBy object is indexed by values from A, and I would like to select a subset of it, e.g. to perform aggregation. It could be something like
dfg.loc[1:2].agg(...)
or, for a specific column,
dfg['B'].loc[1:2].agg(...)
EDIT. To make it more clear: by slicing the GroupBy object I mean accessing only a subset of groups. In the above example, the GroupBy object will contain 3 groups, for A = 1, A = 2, and A = 3. For some reasons, I may only be interested in groups for A = 1 and A = 2.
It seesm you need custom function with iloc - but if use agg is necessary return aggregate value:
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[0:3]))
print (df)
A
1 y,z
2 x
3 r,p
Name: B, dtype: object
df = df.groupby('A')['B'].agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
A
1 z
2
3 p
Name: B, dtype: object
For multiple columns:
df = pd.DataFrame({'A': [2, 1, 1, 3, 3],
'B': ['x', 'y', 'z', 'r', 'p'],
'C': ['g', 'y', 'y', 'u', 'k']})
print (df)
A B C
0 2 x g
1 1 y y
2 1 z y
3 3 r u
4 3 p k
df = df.groupby('A').agg(lambda x: ','.join(x.iloc[1:3]))
print (df)
B C
A
1 z y
2
3 p k
If I understand correctly, you only want some groups, but those are supposed to be returned completely:
A B
1 1 y
2 1 z
0 2 x
You can solve your problem by extracting the keys and then selecting groups based on those keys.
Assuming you already know the groups:
pd.concat([dfg.get_group(1),dfg.get_group(2)])
If you don't know the group names and are just looking for random n groups, this might work:
pd.concat([dfg.get_group(n) for n in list(dict(list(dfg)).keys())[:2]])
The output in both cases is a normal DataFrame, not a DataFrameGroupBy object, so it might be smarter to first filter your DataFrame and only aggregate afterwards:
df[df['A'].isin([1,2])].groupby('A')
The same for unknown groups:
df[df['A'].isin(list(set(df['A']))[:2])].groupby('A')
I believe there are some Stackoverflow answers refering to this, like How to access pandas groupby dataframe by key

Pandas - Modify string values in each cell

I have a pandas dataframe and I need to modify all values in a given string column. Each column contains string values of the same length. The user provides the index they want to be replaced for each value
for example: [1:3] and the replacement value "AAA".
This would replace the string from values 1 to 3 with the value AAA.
How can I use the applymap(), map() or apply() function to get this done?
SOLUTION: Here is the final solution I went off of using the answer marked below:
import pandas as pd
df = pd.DataFrame({'A':['ffgghh','ffrtss','ffrtds'],
#'B':['ffrtss','ssgghh','d'],
'C':['qqttss',' 44','f']})
print df
old = ['g', 'r', 'z']
new = ['y', 'b', 'c']
vals = dict(zip(old, new))
pos = 2
for old, new in vals.items():
df.ix[df['A'].str[pos] == old, 'A'] = df['A'].str.slice_replace(pos,pos + len(new),new)
print df
Use str.slice_replace:
df['B'] = df['B'].str.slice_replace(1, 3, 'AAA')
Sample Input:
A B
0 w abcdefg
1 x bbbbbbb
2 y ccccccc
3 z zzzzzzzz
Sample Output:
A B
0 w aAAAdefg
1 x bAAAbbbb
2 y cAAAcccc
3 z zAAAzzzzz
IMO the most straightforward solution:
In [7]: df
Out[7]:
col
0 abcdefg
1 bbbbbbb
2 ccccccc
3 zzzzzzzz
In [9]: df.col = df.col.str[:1] + 'AAA' + df.col.str[4:]
In [10]: df
Out[10]:
col
0 aAAAefg
1 bAAAbbb
2 cAAAccc
3 zAAAzzzz

Categories

Resources