Lets say I have the following df:
> Name A B C D
John Nan 1 2 Nan
Mike 2 Nan Nan Nan
Fred Nan 5 6 7
Ana 3 Nan 3 2
Fran 2 Nan 1 1
What I want to do is sorting some columns so, I what everyone who has only column A filled (in this case, Mike):
> df_1 = df[(df['A'] > 0)&(~(df['A'] == 0))]
or I want only two columns filled (in this case, none):
df_1 = df[(df['A','B'] > 0)&(~(df['A','B'] == 0))]
I am really strugling with this.
tks
isnull + all
Your syntax is incorrect. You can use pd.DataFrame.isnull:
mask1 = df['A'] > 0
mask2 = df[['B', 'C', 'D']].isnull().all(1)
df_1 = df_1[mask1 & mask2]
Similarly, for your second query:
mask1 = (df[['A', 'B']] > 0).all(1)
mask2 = df[['C', 'D']].isnull().all(1)
df_1 = df_1[mask1 & mask2]
This assumes you wish to filter explicitly for values greater than 0 in mask1. If any non-null number suffices, you can use pd.DataFrame.notnull.
Don't be afraid to split your masks across multiple lines in this way. It will make your code clearer and easier to manage.
pipe + isnull + all
More generically, you can write a function to calculate and apply your Boolean series mask:
def masker(df, cols_required):
""" Supply list cols_required. These must be > 0; others null. """
mask1 = (df[cols_required] > 0).all(1)
mask2 = df[df.columns.difference(cols_required)].isnull().all(1)
return df[mask1 & mask2]
df = df.pipe(masker, cols_required=['A', 'B'])
Related
I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!
You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN
With the following dataframe as an example :
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 10, 3], 'A':[0,5,100], 'C':[0,10,7]})
I would like to add a new column called df["indices"] with the indices of columns df["A"] and/or df["C"] provided they satisfy 2 conditions:
Must be greater than 5
df["A"]/df["Base"] or df["C"]/df["Base"] must be greater than or equal to 1
The resulting dataframe would be:
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 20, 3], 'A':[0,6,100], 'C':[0,10,7], 'indices': ['','C','A,C']})
I can get True or False values for my first condition with df[['A','C']] > 5 but I cannot get it to work with my condition 2 which is based on another column in my dataframe. Getting the indices where I get True in a new column is yet another story. I imagine something with apply and get_loc or index but I cannot get it to work no matter how I try.
Let's create a boolean mask satisfying the two given conditions, then use DataFrame.dot on this mask to get the indices:
m = df[['A', 'C']].gt(5) & df[['A', 'C']].div(df['Base'], axis=0).ge(1)
df['indices'] = m.dot(m.columns + ',').str.rstrip(',')
Sample Base A C indices
0 X 2 0 0
1 Y 10 5 10 C
2 Z 3 100 7 A,C
You can use df.loc to assign values back to the column when any number of conditions are met. A simple approach would be to have 3 of these, each with your desired conditions. You could also probably chain together np.where to achieve the same thing if you wanted.
import pandas as pd
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'],
'Base':[2, 10, 3],
'A':[0,5,100],
'C':[0,10,7]})
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'A,C'
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] <1), 'indicies'] = 'A'
df.loc[(df['A'] / df['Base'] <1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'C'
Output
Sample Base A C indicies
0 X 2 0 0 NaN
1 Y 10 5 10 C
2 Z 3 100 7 A,C
My goal is to conditionally index a data frame and change the values in a column for these indexes.
I intend on looking through the column 'A' to find entries = 'a' and update their column 'B' with the word 'okay.
group = ['a']
df = pd.DataFrame({"A": [a,b,a,a,c], "B": [NaN,NaN,NaN,NaN,NaN]})
>>>df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
df[df['A'].apply(lambda x: x in group)]['B'].fillna('okay', inplace=True)
This gives me the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
Following the documentation (what I understood of it) I tried the following instead:
df[df['A'].apply(lambda x: x in group)].loc[:,'B'].fillna('okay', inplace=True)
I can't figure out why the reassignment of 'NaN' to 'okay' is not occurring inplace and how this can be rectified?
Thank you.
Try this with lambda:
Solution First:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
Using lambda + map or apply..
>>> df["B"] = df["A"].map(lambda x: "okay" if "a" in x else "NaN")
OR# df["B"] = df["A"].map(lambda x: "okay" if "a" in x else np.nan)
OR# df['B'] = df['A'].apply(lambda x: 'okay' if x == 'a' else np.nan)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution second:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
another fancy way to Create Dictionary frame and apply it using map function across the column:
>>> frame = {'a': "okay"}
>>> df['B'] = df['A'].map(frame)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution Third:
This is already been posted by #d_kennetz but Just want to club together, wher you can also do the assignment to both columns (A & B)in one shot:..
>>> df.loc[df.A == 'a', 'B'] = "okay"
If I understand this correctly, you simply want to replace the value for a column on those rows matching a given condition (i.e. where A column belongs to a certain group, here with a single value 'a'). The following should do the trick:
import pandas as pd
group = ['a']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
print(df)
df.loc[df['A'].isin(group),'B'] = 'okay'
print(df)
What we're doing here is we're using the .loc filter, which just returns a view on the existing dataframe.
First argument (df['A'].isin(group)) filters on those rows matching a given criterion. Notice you can use the equality operator (==) but not the in operator and therefore have to use .isin() instead).
Second argument selects only the 'B' column.
Then you just assign the desired value (which is a constant).
Here's the output:
A B
0 a None
1 b None
2 a None
3 a None
4 c None
A B
0 a okay
1 b None
2 a okay
3 a okay
4 c None
If you wanted to fancier stuff, you might want do the following:
import pandas as pd
group = ['a', 'b']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
df.loc[df['A'].isin(group),'B'] = "okay, it was " + df['A']+df['A']
print(df)
Which gives you:
A B
0 a okay, it was aa
1 b okay, it was bb
2 a okay, it was aa
3 a okay, it was aa
4 c None
I have a pandas dataframe of shape ~ [200K, 40]. The dataframe has a categorical column (one of many) with over 1000 unique values. I can visualizee the value counts of each such unique column by using:
df['column_name'].value_counts()
How do i now club values with:
value_count less than a threshold value, say, 100, and map them to, say, "miscellaneous"?
OR based on the cumulative row count % ?
You can extract the values you want to mask from the index of value_counts and them map them to "miscellaneous" using replace:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])
frequencies = df['A'].value_counts()
condition = frequencies<200 # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')
df['A'] = df['A'].replace(mask_dict) # or you could make a copy not to modify original data
Now, using value_counts will group all the values below your threshold as miscellaneous:
df['A'].value_counts()
df['A'].value_counts()
Out[18]:
miscellaneous 947
3 226
1 221
0 204
7 201
2 201
I think need:
df = pd.DataFrame({ 'A': ['a','a','a','a','b','b','b','c','d']})
s = df['A'].value_counts()
print (s)
a 4
b 3
d 1
c 1
Name: A, dtype: int64
If need sum all values bellow threshold:
threshold = 2
m = s < threshold
#filter values under threshold
out = s[~m]
#sum values under and create new values to Series
out['misc'] = s[m].sum()
print (out)
a 4
b 3
misc 2
Name: A, dtype: int64
But if need rename index values bellow threshold:
out = s.rename(dict.fromkeys(s.index[s < threshold], 'misc'))
print (out)
a 4
b 3
misc 1
misc 1
Name: A, dtype: int64
If need replace original column use GroupBy.transform with numpy.where:
df['A'] = np.where(df.groupby('A')['A'].transform('size') < threshold, 'misc', df['A'])
print (df)
A
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 misc
8 misc
An alternate solution:
cond = df['col'].value_counts()
threshold = 100
df['col'] = np.where(df['col'].isin(cond.index[cond >= threshold ]), df['col'], 'miscellaneous')
I have a big data frame with lots of NaN, I want to store it into a smaller data frame which stores all the indexes and the values of the non-NaN, non-zero values.
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
The data frame may look like this:
A B C
0 NaN -2.268882 0.337074
1 NaN 0.000000 1.340350
2 -1.526945 0.000000 NaN
3 -1.223816 0.000000 -2.185926
I want a data frame looks like this:
0 B -2.268882
0 C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
4 C -2.185926
How can I do it quickly, as i have a relatively big data frame, thousands by thousands...
Many Thanks!
Replace 0 with np.nan and .stack() the result (see docs).
If there's a chance that you have all np.nan values in rows after .replace(), you could do .dropna(how='all') before .stack() to reduce the number of rows to pivot. If that could apply to columns do `.dropna(how='all', axis=1).
df.replace(0, np.nan).stack()
0 B -2.268882
C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
C -2.185926
Combine with .reset_index() as needed.
To select from a Series with MultiIndex use .loc[(level_0, level_1)]:
df.loc[(0, 'B')] = -2.268882
Details on slicing etc in the docs.
I've come up with a bit ugly way of achieving things, but hey, it works. But this solution has index going from 0 and does not preserve the original order of 'A', 'B', 'C' as in your question, if that matters.
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
dff.iloc[2,1] = np.nan
# mask to do logical and for two lists
mask = lambda y,z: list(map(lambda x: x[0] and x[1], zip(y,z)))
# create new frame
new_df = pd.DataFrame()
types = []
vals = []
# iterate over columns
for col in dff.columns:
# get the non empty and non zero values from current column
data = dff[col][mask(dff[col].notnull(), dff[col] != 0)]
# add corresponding original column name
types.extend([col for x in range(len(data))])
vals.extend(data)
# populate the dataframe
new_df['Types'] = pd.Series(types)
new_df['Vals'] = pd.Series(vals)
print(new_df)
# A B C
#0 NaN -1.167975 -1.362128
#1 NaN 0.000000 1.388611
#2 1.482621 NaN NaN
#3 -1.108279 0.000000 -1.454491
# Types Vals
#0 A 1.482621
#1 A -1.108279
#2 B -1.167975
#3 C -1.362128
#4 C 1.388611
#5 C -1.454491
I am looking forward for more pandas/python like answer myself!