Pandas select data between both index and value range - python

I wish to set the values of a dataframe that lie between an index range and a value range to be NaN values. For example, say I have n columns, I want for every numeric data point in these columns to be set to NaN if they meet the following conditions:
The value is between -1 and 1
The index of this value is between 1 and 3
Below I have some code that is trying to do what I'm describing above, and it almost does it, it's just that it is setting these values on a copy of the original dataframe, and trying to use .loc throws the following error:
KeyError: "None of [Index([('a',), ('b',), ('c',)], dtype='object')]
are in the [columns]"
import numpy as np
import pandas as pd
np.random.seed(398)
df = pd.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
row_indexer = (df.index > 0) & (df.index < 4)
col_indexer = (df > -1) & (df < 1)
df[row_indexer][col_indexer] = np.nan
I'm sure there's a really simple solution, I just can't figure out the correct syntax.
(Additionally, I want to "extract" these filtered values (the ones I'm setting to NaN) into a second dataframe, but I'm fairly sure any solution that solves the primary question will solve this additional issue)
Any help would be appreciated

Try broadcasting with numpy:
df[row_indexer[:,None] & col_indexer] = np.nan
Output:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914

I will do mul since True * True = True
out = df.mask(col_indexer.mul(row_indexer ,axis=0))
Out[81]:
a b c
0 -1.810802 -0.776590 -0.495147
1 1.381038 NaN 2.334671
2 NaN -1.571401 1.011139
3 -1.200217 -1.013983 NaN
4 1.261759 0.863896 0.228914

Related

Pandas mask with composite expression behaviour

this question was previously asked (and then deleted) by an user, I was looking to find a solution so I could give out an answer when the question disappeared and I, moreover, can't seem to make sense of pandas' behaviour so I would appreciate some clarity, the original question stated something along the lines of:
How can I replace every negative value except those in a given list with NaN in a Pandas dataframe?
my setup to reproduce the scenario is the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A' : [x for x in range(4)],
'B' : [x for x in range(-2, 2)]
})
this should technically only be an issue of correctly passing a boolean expression to pd.where, my attemped solution looks like:
df[df >= 0 | df.isin([-2])]
which produces:
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
which also cancels the number in the list!
moreover if I mask the dataframe with each of the two conditions I get the correct behavior:
with df[df >= 0] (identical to the compound result)
index
A
B
0
0
NaN
1
1
NaN
2
2
0
3
3
1
with df[df.isin([-2])] (identical to the compound result)
index
A
B
0
NaN
-2.0
1
NaN
NaN
2
NaN
NaN
3
NaN
NaN
So it seems like I am
Running into some undefined behaviour as a result of performing logic on NaN values
I have got something wrong
Anyone can clarify this situation to me?
Solution
df[(df >= 0) | (df.isin([-2]))]
Explanation
In python, bitwise OR, |, has a higher operator precedence than comparison operators like >=: https://docs.python.org/3/reference/expressions.html#operator-precedence
When filtering a pandas DataFrame on multiple boolean conditions, you need to enclose each condition in parentheses. More from the boolean indexing section of the pandas user guide:
Another common operation is the use of boolean vectors to filter the
data. The operators are: | for or, & for and, and ~ for not. These
must be grouped by using parentheses, since by default Python will
evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

pandas grabbing value to fill in new columns

I am trying to get new columns B and C with a condition B value will be positive if the ‘A’ of one day is bigger than the ‘A’ of the day before. Otherwise, the value will be negative (C column).
Here is an example of what I am trying to get:
A B C
0. 167765
1. 235353 235353
2. 89260 89260
3. 188382 188382
4. 104677 104677
5. 207723 207723
I notice that this will cause an index error because the number of data in column B and C will be different from the original column A.
Currently, I am doing via this to test move specific data to column B and this cause length of values does not match the length of index error:
df['B'] = np.where(df['A'] <= 250000)
how do I accomplish the desired output where the first row is NA or empty?
desired output:
B C
0.
1. 235353
2. 89260
3. 188382
4. 104677
5. 207723
I'm not able to understand how you got to your final result by the method you're describing
In my understanding a value should be placed in column B if it is greater than the value the day before. Otherwise in column C.
You may need to correct me or adapt this answer if you meant differently.
The trick is in to use .where on a pandas Series object, which inserts the NaNs automatically.
df = pd.DataFrame({'A': [167765, 235353, 89260, 188382, 104677, 207723]})
diffs = df['A'].diff()
df['B'] = df['A'].where(diffs >= 0)
df['C'] = df['A'].where(diffs < 0)
diffs is going to be the following Series which also comes with a handy NaN in the first row.
0 NaN
1 67588.0
2 -146093.0
3 99122.0
4 -83705.0
5 103046.0
Name: A, dtype: float64
Comparing with NaN always returns False. Therefore we can omit the first row by comparing for the positive and the negative seperately.
The resulting table looks like this
A B C
0 167765 NaN NaN
1 235353 235353.0 NaN
2 89260 NaN 89260.0
3 188382 188382.0 NaN
4 104677 NaN 104677.0
5 207723 207723.0 NaN
You can try giving explicit list of index:
df['B'] = np.where(df.index.isin([1, 2, 3]), df['A'], np.nan)
df['C'] = np.where(df.index.isin([4, 5]), df['A'], np.nan)

How to "partially transpose" dataframe in Pandas?

I have csv file like this:
A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0
(A, B, C) is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X column) in separate columns, like this:
A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6
I somehow know how to find duplicates and aggregate X values into tuples:
df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)
This is basically what I need, but I struggle with transforming it further.
I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:
df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]
and then create new dataframe with new columns:
df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)
But at this point I lost index :shrug:
This page on Pandas docs suggests I should use something like this:
df.pivot(columns=columns, values=['X'])
because df already contains an index, but I get this (confusing) error:
KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"
What am I missing here?
I originally marked this as a duplicate of the infamous, but since this is a bit different, here's an answer:
(df.assign(col=df.groupby(['A','B','C']).cumcount().add(1))
.pivot_table(index=['A','B','C'], columns='col', values='X')
.add_prefix('X')
.reset_index()
)
Output:
col A B C X1 X2 X3
0 a a a 1.0 2.1 NaN
1 a b b 1.2 2.4 3.6
2 b c c 1.1 NaN NaN
3 b c d 1.0 NaN NaN
Note: this only differs to the linked question/answer in that you groupby/pivot on a set of columns, instead of one column.

How to select all elements greater than a given values in a dataframe

I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing
Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

Categories

Resources