Python pandas: elegant division within dataframe - python

I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?

I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5

Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)

Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.

Related

Scripting a simple counter

I wanted to create a simple script, which counts values in one column, that are higher in another column:
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 1 0
2 3 2
My function:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
return a_counter, b_counter
However
diff(df)
returns (3, 1), instead of (2,0). I know the problem is that every single value of one column gets compared to every value of the other column (e.g. 1 gets compared to 0 and 2 of column b). There probably is a special function for my problem, but can you help me fix my script?
I would suggest adding some helper columns in an intuitive way to help compute the sum of each condition a > b and b > a
A working example based on your code :
import numpy as np
import pandas as pd
d = {'a': [1, 3], 'b': [0, 2]}
df = pd.DataFrame(data=d, index=[1, 2])
def diff(dataframe):
dataframe['a>b'] = np.where(dataframe['a']>dataframe['b'], 1, 0)
dataframe['b>a'] = np.where(dataframe['b']>dataframe['a'], 1, 0)
return dataframe['a>b'].sum(), dataframe['b>a'].sum()
print(diff(df))
>>> (2, 0)
Basically what np.where() does, the way I used it, is that it produces 1 if the condition is met and 0 otherwise. You can then add those columns up using a simple sum() function applied on the desired columns.
Update
Maybe you can use:
>>> df['a'].gt(df['b']).sum(), df['b'].gt(df['a']).sum()
(2, 0)
IIUC, to fix your code:
def diff(dataframe):
a_counter=0
b_counter=0
for i in dataframe["a"]:
for ii in dataframe["b"]:
if i>ii:
a_counter+=1
elif ii>i:
b_counter+=1
# Subtract the minimum of counters
m = min(a_counter, b_counter)
return a_counter-m, b_counter-m
Output:
>>> diff(df)
(2, 0)
IIUC, you can use the sign of the difference and count the values:
d = {1: 'a', -1: 'b', 0: 'equal'}
(np.sign(df['a'].sub(df['b']))
.map(d)
.value_counts()
.reindex(list(d.values()), fill_value=0)
)
output:
a 2
b 0
equal 0
dtype: int64

how do I select a specific column in a pivot_table - Python [duplicate]

I have a pandas DataFrame with 4 columns and I want to create a new DataFrame that only has three of the columns. This question is similar to: Extracting specific columns from a data frame but for pandas not R. The following code does not work, raises an error, and is certainly not the pandasnic way to do it.
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new = pd.DataFrame(zip(old.A, old.C, old.D)) # raises TypeError: data argument can't be an iterator
What is the pandasnic way to do it?
There is a way of doing this and it actually looks similar to R
new = old[['A', 'C', 'D']].copy()
Here you are just selecting the columns you want from the original data frame and creating a variable for those. If you want to modify the new dataframe at all you'll probably want to use .copy() to avoid a SettingWithCopyWarning.
An alternative method is to use filter which will create a copy by default:
new = old.filter(['A','B','D'], axis=1)
Finally, depending on the number of columns in your original dataframe, it might be more succinct to express this using a drop (this will also create a copy by default):
new = old.drop('B', axis=1)
The easiest way is
new = old[['A','C','D']]
.
Another simpler way seems to be:
new = pd.DataFrame([old.A, old.B, old.C]).transpose()
where old.column_name will give you a series.
Make a list of all the column-series you want to retain and pass it to the DataFrame constructor. We need to do a transpose to adjust the shape.
In [14]:pd.DataFrame([old.A, old.B, old.C]).transpose()
Out[14]:
A B C
0 4 10 100
1 5 20 50
columns by index:
# selected column index: 1, 6, 7
new = old.iloc[: , [1, 6, 7]].copy()
As far as I can tell, you don't necessarily need to specify the axis when using the filter function.
new = old.filter(['A','B','D'])
returns the same dataframe as
new = old.filter(['A','B','D'], axis=1)
Generic functional form
def select_columns(data_frame, column_names):
new_frame = data_frame.loc[:, column_names]
return new_frame
Specific for your problem above
selected_columns = ['A', 'C', 'D']
new = select_columns(old, selected_columns)
As an alternative:
new = pd.DataFrame().assign(A=old['A'], C=old['C'], D=old['D'])
If you want to have a new data frame then:
import pandas as pd
old = pd.DataFrame({'A' : [4,5], 'B' : [10,20], 'C' : [100,50], 'D' : [-30,-50]})
new= old[['A', 'C', 'D']]
You can drop columns in the index:
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
df[df.columns.drop(['B', 'C'])]
or
df.loc[:, df.columns.drop(['B', 'C'])]
Output:
A D
0 1 4
1 1 4
df = pd.DataFrame({'A': [1, 1], 'B': [2, 2], 'C': [3, 3], 'D': [4, 4]})
new = df.filter(['A','B','D'], axis=1)

How to find all occurrences for each value in column A that is also in column B

Using Pandas, I'm trying to find the most recent overlapping occurrence of some value in Column A that also happens to be in Column B (though, not necessarily occurring in the same row); This is to be done for all rows in column A.
I've accomplished something close with an n^2 solution (by creating a list of each column and iterating through with a nested for-loop), but I would like to use something faster if possible; as this needs to be implemented in a table with tens of thousands of entries. (So, a Vectorized solution would be ideal, but I am more looking for the "right" way to do this.)
df['idx'] = range(0, len(df.index))
A = list(df['r_A'])
B = list(df['r_B'])
A_B_Dict = {}
for i in range(0, len(B)-1):
for j in range(0, len(A)-1):
if B[i] == A[j]:
A_search = df.loc[df['r_A'] == A[j]].index
A_B_Dict[B[i]] = A_search
Given some df like so:
df = [[1, 'A', 'A'],
[2, 'B', 'D'],
[3, 'C', 'B']
[4, 'D', 'D']
]
df = pd.DataFrame(data, columns = ['idx', 'A', 'B'])
It should give back something like:
A_B_Dict = {'A': 1, 'B': 3, 'C':None', 'D':4}
Such that, the most recent observance (Or all observances, for that matter) from Column A that occur in Column B are stored as the value of A_B_Dict where the key of A_B_Dict is the original value observed in Column A.
IIUC
d=dict(zip(df.B,df.idx))
dict(zip(df.A,df.A.map(d)))
{'A': 1.0, 'B': 3.0, 'C': nan, 'D': 4.0}

Is there a python function to fill missing data with consecutive value

I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))

Count and sort pandas dataframe

I have a dataframe with column 'code' which I have sorted based on frequency.
In order to see what each code means, there is also a column 'note'.
For each counting/grouping of the 'code' column, I display the first note that is attached to the first 'code'
df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
Now my question is, how do I display only those rows that have frequency of e.g. >= 30?
Add a query call before you sort. Also, if you only want those rows EQUALing < insert frequency here >, sort_values isn't needed (right?!).
df.groupby('code')['note'].agg(['count', 'first']).query('count == 30')
If the question is for all groups with AT LEAST < insert frequency here >, then
(
df.groupby('code')
.note.agg(['count', 'first'])
.query('count >= 30')
.sort_values('count', ascending=False)
)
Why do I use query? It's a lot easier to pipe and chain with it.
You can just filter your result accordingly:
grp = grp[grp['count'] >= 30]
Example with data
import pandas as pd
df = pd.DataFrame({'code': [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3],
'note': ['A', 'B', 'A', 'A', 'C', 'C', 'C', 'A', 'A',
'B', 'B', 'C', 'A', 'B'] })
res = df.groupby('code')['note'].agg(['count', 'first']).sort_values('count', ascending=False)
# count first
# code
# 2 5 C
# 3 5 B
# 1 4 A
res2 = res[res['count'] >= 5]
# count first
# code
# 2 5 C
# 3 5 B

Categories

Resources