Can anyone explain the below code?
pima_df[~pima_df.applymap(np.isreal).all(1)]
pima_df is a dataframe.
You are extracting rows in which atleast one complex number occurs.
e.g : pima_df =
a b
0 1 2
1 2 4+3j
2 3 5
result would be :
a b
1 2 (4+3j)
in short :
applymap - apply function on each and every element of dataframe.
np.isreal - returns true for real otherwise false
all - returns true if each element along an axis is true otherwise false.
~ - negates the boolean index
Please look at the doc or help(np.isreal).
Returns a bool array, where True if input element is real.
If element has complex type with zero complex part, the return value
for that element is True.
To be precise Numpy Provides a set of methods for comparing and performing operations on arrays elementwise:
np.isreal : Determines whether each element of array is real.
np.all : Determines whether all array element of a specific array evaluate to True.
tilde(~) : used for Boolean indexing which means not.
applymap: applymap works element-wise on a DataFrame.
all() : used to find rows where all the values are True.
The ~ is the operator equivalent of the invert dunder which has been overridden explicitly for the purpose performing vectorized logical inversions on pd.DataFrame/pd.Series objects.
Example of boolean index (~):
>>> df
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
>>> df.query('a in b')
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
OR
>>> df[~df.a.isin(df.b)] # same as above
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
hope this will help.
Related
This is similar to LabelEncoder from scikit-learn, but with the requirement that the number value assignments occur in order of frequency of the category, i.e., the higher occurring category being assigned the highest/lowest (depending on use-case) number.
E.g. If the variable can take values [a, b, c] with frequencies such as
Category
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
1 b
2 c
2 c
a occurs 5 times, b occurs 10 times and c occurs 2 times.
Then I want the replacements be done as b=1, a=2 and c=3.
See argsort:
df['Order'] = df['Frequency'].argsort() + 1
df
returns
Category Frequency Order
0 a 5 3
1 b 10 1
2 c 2 2
If you are using pandas, you can use its map() method:
import pandas as pd
data = pd.DataFrame([['a'], ['b'], ['c']], columns=['category'])
print(data)
category
0 a
1 b
2 c
mapping_dict = {'b':1, 'a':2, 'c':3}
print(data['category'].map(mapping_dict))
0 2
1 1
2 3
LabelEncoder uses np.unique to find the unique values present in a column which returns values in alphabetically sorted order, so you cannot use the custom ordering in it.
As suggested by #Vivek Kumar, I used the map functionality, using a dict of the sorted column values as key and their position as value:
data.Category = data.Category.map(dict(zip(data.Category.value_counts().index, range(1, len(data.Category.value_counts().index)+1))))
Looks a bit dirty, would be much better to divide it into a couple of lines like this:
sorted_indices = data.Category.value_counts().index
data.Category = data.Category.map(dict(zip(sorted_indices, range(1, len(sorted_indices)+1))))
This is the closest I have to my requirement. The output looks like this:
Category
0 2
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 3
16 3
I have a dataframe with patients, date, medications, and diagnosis.
Each patient has a unique id ('pid'), and may or may not be treated with different drugs.
What is best practice to select all patients that at some point have been treated with a certain drug?
Since my dataset is so huge, for-loops and if-statement is the last resort.
Example:
IN:
pid drug
1 A
1 B
1 C
2 A
2 C
2 E
3 B
3 C
3 D
4 D
4 E
4 F
Select all patient who has at some point been treated with drug 'B'. Note that all entries of that patient must to be included, meaning not just treatments with drug B, but all treatments:
OUT:
1 A
1 B
1 C
3 B
3 C
3 D
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
I do support COLDSPEED's answer, but If you say
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
can be solved a lot simpler than hardcoding the if's
patients_B = df.loc[df['drug'] == 'B', 'pid]
or
patients_B = set(df.loc[df['drug'] == 'B', 'pid])
and then
result = df[df['pid'].isin(patients_B)]
The easiest method involves groupby + transform:
df[df.drug.eq('B').groupby(df.pid).transform('any')]
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D
In pursuit of a faster solution, call groupby on df, not a Series:
df[~df.groupby('pid').drug.transform(lambda x: x.eq('B').any())]
pid drug
3 2 A
4 2 C
5 2 E
9 4 D
10 4 E
11 4 F
Here is one way.
s = df.groupby('drug')['pid'].apply(set)
result = df[df['pid'].isin(s['B'])]
# pid drug
# 0 1 A
# 1 1 B
# 2 1 C
# 6 3 B
# 7 3 C
# 8 3 D
Explanation
Create a mapping series s as a separate initial step so that it
does not need recalculating for each result.
For the comparisons, use set for O(1) complexity lookup.
IIUC filter
df.groupby('pid').filter(lambda x : (x['drug']=='B').any())
Out[18]:
pid drug
0 1 A
1 1 B
2 1 C
6 3 B
7 3 C
8 3 D
Assume that we have this array in Python:
import pandas as pd
arr = pd.DataFrame(['aabbc','aabccca','aa'])
I want to split each row to columns of its character. The length of the rows may differ.
It is the output that I expect to have (3*7 matrix in this case):
1 2 3 4 5 6 7
1 a a b b c Na Na
2 a a b c c c a
3 a a Na Na Na Na Na
The number of the rows of my matrix is 20000 and I prefer not to use for loops. The original data is protein sequences.
I read [1], [2], [3], etc, and they didn't help me.
Option 1
One simple way to do this is using a list comprehension.
pd.DataFrame([list(x) for x in arr[0]])
0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Alternatively, use apply(list) which does the same thing.
pd.DataFrame(arr[0].apply(list).tolist())
0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Option 2
Alternative with extractall + unstack. You'll end up with a multi-index of columns. You can drop the first level of the result.
v = arr[0].str.extractall(r'(\w)').unstack()
v.columns = v.columns.droplevel(0)
v
match 0 1 2 3 4 5 6
0 a a b b c None None
1 a a b c c c a
2 a a None None None None None
Option 3
Manipulating view -
v = arr[0].values.astype(str)
pd.DataFrame(v.view('U1').reshape(v.shape[0], -1))
0 1 2 3 4 5 6
0 a a b b c
1 a a b c c c a
2 a a
This gives you empty strings ('') instead of Nones in cells. Use replace if you want to add them back.
I have a few columns in the Pandas dataframe which are sparsely populated, and I wish to modify their respective values to, whether they are filled or not.
For instance, my dataframe is:
A B C D
0 8 8
1 9 4 4
2 5 3 8
3 4 1
I want it to be :
A B C D
0 8 F F 8
1 9 T F 4
2 5 F T 8
3 4 F F 1
F = False, T= True
Running pandas.isnull(df['B']) returns a new series with boolean values, how do I update it efficiently in the original dataframe?
You can just do df['B'] = df.B.notnull().
I have a pandas dataframe like the following:
A B C
1 2 1
3 4 0
5 2 0
5 3 1
And would like to get the value from A if the value of C is 1 and the value of B if C is zero. How would I do this? Ultimately I'd like to end up with a vector with the values of A if C is one and B if C is 0 which would be [1,4,2,5]
Assuming you mean "from A is the value of C is 1 and from B if the value of C is 0", which makes sense given your intended output, I might use Series.where:
>>> df
A B C
0 1 2 1
1 3 4 0
2 5 2 0
3 5 3 1
>>> df.A.where(df.C, df.B)
0 1
1 4
2 2
3 5
dtype: int64
which is read "make a series using values of A if the corresponding value of C is true, otherwise use the corresponding value of B". Here since 1 is true we can just use df.C, but we could use df.C == 1 or df.C*5+3 < 4 or any other boolean Series.