assigning an alternative value to pandas dataFrame conditional on its value - python

I am trying to assign alternative values to a column in a pandas dataFrame object. The condition to assigning an alternative value is that the element has value zero now.
This is my code snippet:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
However, as it turns out, the values in these elements remain zero! The above has zero effect.
What's going on?

The original answer below works for some inputs, but it's not entirely right. Testing your code with the dataframe in your question, I found that it works, but it's not guaranteed to work with all dataframes. Here's an example where it doesn't work:
df = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=['A', 'B', 'C', 'D'])
This dataframe will cause your code to fail because the indices are not 0, 1, 2... as your algorithm expects, they're 0, 2, 4, ..., as defined by index=list(range(0,12,2)).
That means the values of i returned by the iterator will also be 0, 2, 4,..., so you'll get unexpected results when you try to use i-1 as a parameter to iloc.
In short, when you use for i, row in df.iterrows(): to iterate over a dataframe, i takes on the index values of the dimension you're iterating over as they're defined in the dataframe. Make sure you know what those values are when using them with offsets inside the loop.
Original answer:
I can't figure out why your code doesn't work, but I can verify that it doesn't. It may have something to do with modifying a dataframe while iterating over it, since you can use df.iloc[1]['A'] = 0.0 to set a value outside a loop with no problems.
Try using DataFrame.at instead:
for i, row in df.iterrows():
if row['A'] == 0.0:
df.at[i, 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
This doesn't do anything to account for df.iloc[i-1] returning the last row in the dataframe, so be aware of that when the first value in column A is 0.0.

What about:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
df['A'] = df.where(df[['A']] != 0,
df['A'].shift() + df['B'] - df['B'].shift(),
axis=0)['A']
print(df)
A B
0 NaN 1
1 1.0 2
2 2.0 3
3 3.0 4
4 -3.0 1
5 1.0 2
6 1.0 3
7 2.0 4
The NaN is there since there is no element prior to the first element

You are using chained indexing which is related to the famous SettingWithCopy warning. Check the SettingWithCopy setting in modern pandas by Tom Augspurger.
In general this means that assigments of the form df['A']['B']= ...are discouraged. It doesn't matter if you use a loc acessor there.
If you add print statements to your code:
for i, row in df.iterrows():
print(df)
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
you see strange things happening. The dataframe df is modified if and only if the first row the column 'A' is 0.
As Bill the Lizard pointed out, you need a single accessor. However, note that Bill's method has the disadvantage of providing label based access. This may not be what you want when having a dataframe that is differently indexed. Then a better solutions would be to use loc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.loc[df.index[i], 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
or iloc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i, df.columns.get_loc('A')] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
assuming the index is unique in the last case.
Note that the chained indexing occurs when setting values.
Though this approach works, it's - by the quote above - not encouraged!

Related

Scripting a simple counter

I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?
I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.
Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)
IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64

Python pandas: elegant division within dataframe

I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.

Is there a python function to fill missing data with consecutive value

I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))

Issues with adding new rows to a pandas dataframe

Apologies if the formatting on this is strange, it's the first time I've posted anything. I've created a multi-index data frame in Python, which works fine:
arrays = [['one','one', 'two', 'two'],
['A','B','A','B']]
tuples = list(zip(*arrays))
mindex = pd.MultiIndex.from_tuples(tuples)
s = pd.DataFrame(data=np.random.randn(4), index=mindex, columns=(['Values']))
s
This works fine, except that I think I should be able to add new rows by simply typing
s['Values'].loc[('Three', 'A')] = 1
s['Values'].loc[('Three','B')]= 2
This returns no error message, and I can check it has worked by entering
s['Values'].loc[('Three', 'A')]
Which gives me 1. So all as expected.
However, I can't see the 'Three' data in Jupyter notebook - if simply type
s
then it only shows me the original one, two, A & B rows. This is probably because the new row is not the index:
s.index
returns
MultiIndex(levels=[['one', 'two'], ['A', 'B']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Can anyone please give me a hint as to what's going on here? I'd like rows I subsequently add to appear in the index. Should I be using the .append function instead? It seems a bit cumbersome and other posts have recommended using the .loc approach above to add rows.
Thanks!
I believe you need select column(s) in function DataFrame.loc:
s.loc[('Three', 'A'), 'Values'] = 1
s.loc[('Three', 'B'), 'Values'] = 2
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
B 2.000000
print (s.index)
MultiIndex(levels=[['one', 'two', 'Three'], ['A', 'B']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
because your solution add values to column (Series), but not to DataFrame:
s['Values'].loc[('Three', 'A')] = 1
print (s['Values'])
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234
Three A 1.000000
Name: Values, dtype: float64
print (s)
Values
one A -0.808372
B 0.904552
two A -0.443619
B 1.157234

Get row index from DataFrame row

Is it possible to get the row number (i.e. "the ordinal position of the index value") of a DataFrame row without adding an extra row that contains the row number (the index can be arbitrary, i.e. even a MultiIndex)?
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [2, 3, 4, 2, 4, 6]})
>>> result = df[df.a > 3]
>>> result.iloc[0]
a 4
Name: 2, dtype: int64
# but how can I get the original row index of iloc[0] in df?
I could have done df['row_index'] = range(len(df)) which would maintain the original row number, but I am wondering if Pandas has a built-in way of doing this.
Access the .name attribute and use get_loc:
In [10]:
df.index.get_loc(result.iloc[0].name)
Out[10]:
2
Looking this from a different side:
for r in df.itertuples():
getattr(r, 'Index')
Where df is the data frame. May be you want to use a conditional to get the index when a condition are met.

Categories

Resources