I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?
I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.
Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)
IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64
Related
I'm struggling to understand how the parameters for df.groupby works. I have the following code:
df = pd.read_sql(query_cnxn)
codegroup = df.groupby(['CODE'])
I then attempt a for loop as follows:
for code in codegroup:
dfsize = codegroup.size()
dfmax = codegroup['ID'].max()
dfmin = codegroup['ID'].min()
result = ((dfmax-dfmin)-dfsize)
if result == 1:
df2 = df2.append(itn)
else:
df3 = df3.append(itn)
I'm trying to iterate over each unique code. Does the for loop understand that i'm trying to loop through each code based on the above? Thank you in advance.
Pandas groupby returns an iterator that emits the key of the iterating group and group df as a tuple. You can perform your max and min operation on the group as:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [0, 0, 0, 1, 1, 1], 'b': [3, 4, 5, 6, 7, 8]})
In [3]: for k, g in df.groupby('a'):
...: print(g['b'].max())
...:
5
8
You can also get the min-max directly as df using agg:
In [4]: df.groupby('a')['b'].agg(['max', 'min'])
Out[4]:
max min
a
0 5 3
1 8 6
I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.
Let say I have the following function which returns a tuple:
def return_tuple(x):
if x in [1,'1','one']:
return (1, 'one')
else:
return (2, 'two')
If I use apply method, this would returns:
df = pd.DataFrame({'col1': [1,2,3]})
df['test'] = df['col1'].apply(return_tuple)
>>
col1 test
0 1 (1, one)
1 2 (2, two)
2 one (1, one)
But I would like something like this:
df['test_1'] = df['col1'].apply(return_tuple)??? # get 0-index in tuple
df['test_2'] = df['col1'].apply(return_tuple)??? # get 1 index in tuple
>>
col1 test_1 test_2
0 1 1 one
1 2 2 two
2 one 1 one
Thanks.
Somewhere inbetween Alexander's and razdi's answers, using zip and tuple unpacking:
import pandas as pd
def return_tuple(x):
if x in [1, '1', 'one']:
return 1, 'one'
else:
return 2, 'two'
df_1 = pd.DataFrame({'col1': [1, 2, 3]})
df_1['test_1'], df_1['test_2'] = zip(*df_1['col1'].apply(return_tuple))
You could also do it in a single step:
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3]})
def return_tuple(x):
if x['col1'] in [1,'1','one']:
return pd.Series([1, 'one'])
else:
return pd.Series([2, 'two'])
df[['test_1', 'test_2']] = df.apply(return_tuple, axis=1)
import pandas as pd
def return_tuple(x):
if x in [1, '1', 'one']:
return 1, 'one'
else:
return 2, 'two'
df_1 = pd.DataFrame({'col1': [1, 2, 3]})
df_1['test_1'] = df_1['col1'].apply(lambda item: return_tuple(item)[0])
df_1['test_2'] = df_1['col1'].apply(lambda item: return_tuple(item)[1])
print(df_1)
It's as simple as that!
For more on lambda functions, you can see https://realpython.com/python-lambda/. There are a few relevant question on SO, like this one.
After a bit of modification, here is the result:
df[['test1', 'test2']] = pd.DataFrame(df['col1'].apply(return_tuple).tolist(), index=df.index)
df
Here df join is used
You can expand it to do what you want:
def return_tuple(x):
if x in [1,'1','one']:
return (1, 'one')
else:
return (2, 'two')
df = pd.DataFrame({'col1': [1,2,3]})
df['test'] = df['col1'].apply(return_tuple)
df[['test','test2']] = pd.DataFrame(df['test'].to_list(), index=df.index)
Out[32]:
col1 test test2
0 1 1 one
1 2 2 two
2 3 2 two
you could also do this one-liner without altering your existing function:
df[['test_1','test_2']] = pd.DataFrame(df['col1'].apply(return_tuple).tolist(),index=df.index)
I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))
I am trying to assign alternative values to a column in a pandas dataFrame object. The condition to assigning an alternative value is that the element has value zero now.
This is my code snippet:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
However, as it turns out, the values in these elements remain zero! The above has zero effect.
What's going on?
The original answer below works for some inputs, but it's not entirely right. Testing your code with the dataframe in your question, I found that it works, but it's not guaranteed to work with all dataframes. Here's an example where it doesn't work:
df = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=['A', 'B', 'C', 'D'])
This dataframe will cause your code to fail because the indices are not 0, 1, 2... as your algorithm expects, they're 0, 2, 4, ..., as defined by index=list(range(0,12,2)).
That means the values of i returned by the iterator will also be 0, 2, 4,..., so you'll get unexpected results when you try to use i-1 as a parameter to iloc.
In short, when you use for i, row in df.iterrows(): to iterate over a dataframe, i takes on the index values of the dimension you're iterating over as they're defined in the dataframe. Make sure you know what those values are when using them with offsets inside the loop.
Original answer:
I can't figure out why your code doesn't work, but I can verify that it doesn't. It may have something to do with modifying a dataframe while iterating over it, since you can use df.iloc[1]['A'] = 0.0 to set a value outside a loop with no problems.
Try using DataFrame.at instead:
for i, row in df.iterrows():
if row['A'] == 0.0:
df.at[i, 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
This doesn't do anything to account for df.iloc[i-1] returning the last row in the dataframe, so be aware of that when the first value in column A is 0.0.
What about:
df = pd.DataFrame({'A': [0, 1, 2, 0, 0, 1, 1 ,0], 'B': [1, 2, 3, 4, 1, 2, 3, 4]})
df['A'] = df.where(df[['A']] != 0,
df['A'].shift() + df['B'] - df['B'].shift(),
axis=0)['A']
print(df)
A B
0 NaN 1
1 1.0 2
2 2.0 3
3 3.0 4
4 -3.0 1
5 1.0 2
6 1.0 3
7 2.0 4
The NaN is there since there is no element prior to the first element
You are using chained indexing which is related to the famous SettingWithCopy warning. Check the SettingWithCopy setting in modern pandas by Tom Augspurger.
In general this means that assigments of the form df['A']['B']= ...are discouraged. It doesn't matter if you use a loc acessor there.
If you add print statements to your code:
for i, row in df.iterrows():
print(df)
if row['A'] == 0.0:
df.iloc[i]['A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
you see strange things happening. The dataframe df is modified if and only if the first row the column 'A' is 0.
As Bill the Lizard pointed out, you need a single accessor. However, note that Bill's method has the disadvantage of providing label based access. This may not be what you want when having a dataframe that is differently indexed. Then a better solutions would be to use loc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.loc[df.index[i], 'A'] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
or iloc
for i, row in df.iterrows():
if row['A'] == 0.0:
df.iloc[i, df.columns.get_loc('A')] = df.iloc[i-1]['A'] + df.iloc[i]['B'] - df.iloc[i-1]['B']
assuming the index is unique in the last case.
Note that the chained indexing occurs when setting values.
Though this approach works, it's - by the quote above - not encouraged!