Setting numbers outside of range as null [duplicate] - python

This question already has answers here:
Python Pandas replace values if not in value range
(4 answers)
Closed last year.
I am working with pandas df and I am trying to make all the numbers that are outside of range set as null, but having trouble
df['Numbers'] = df['Numbers'].mask((df['Numbers']< -10) & (df['Numbers']> 10), inplace=True)
So I want to keep the numbers between -10 and 10, if the numbers are outside of those two numbers, it should be set as null.
What am I doing wrong here?

One thing that immediately strikes out at me is that you're using & with your two conditions, so you're basically trying to select all numbers that are both less than -10 and greater than 10...which isn't gonna work ;)
I'd rewrite your code like this:
df.loc[df['Numbers'].lt(-10) | df['Numbers'].gt(10), 'Numbers'] = np.nan

I would do it like this:
df['Numbers'] = df['Numbers'].where((df['Numbers']>-10) & (df['Numbers']<10))

Related

Python pandas: issues when subsetting Series using .values attribute [duplicate]

This question already has answers here:
python numpy arange unexpected results
(5 answers)
Closed 2 years ago.
I'm having an issue with Pandas Series: I've created an array with some values in it. For testing puroposes I was trying to make sure of the presence of certain values in the Series, so I'm subsetting it like in what's below:
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
and this returns me an empty array. But I just have to change the step (from 0.1 to 1) and then it works... I've double checked that the Series actually contains the value I'm looking for (for both steps values...)
Here's the code for when I change the step (With the output as proof)
#Generating an array conaining 200 values from -10 to 10 with a step of 0.1
A = np.arange(start=-10, stop=10, step=0.1)
Aseries = pd.Series(A)
Aseries[Aseries.values == 9]
#Generating an array conaining 20 values from -10 to 10 with a step of 0.1
B = np.arange(start=-10, stop=10, step=1)
Bseries = pd.Series(B)
print("'Aseries' having the value 9:")
print(Aseries[Aseries.values == 9])
print("'Bseries' having the value 9:")
print(Bseries[Bseries.values == 9])
output:
'Aseries' having the value 9:
Series([], dtype: float64)
'Bseries' having the value 9:
19 9
dtype: int32
any idea of what's going on here? thanks in advance!
[EDIT]: for some reason I can't add any other post to this thread, so I'll add the solution I found here:
like #Quang Hoang and #Kim Rop explained by the non integer step value which doesnt really returns what it's supposed to. So after:
Aseries = pd.Series(A)
I simply added a rounding instruction to only keep one decimal after the values in the array and adapted my subsetting operation with something like that:
Aseries[(Aseries.values < 9.1) &(Aseries.values < 9.1)]
I'm not having the issue anymore... Thanks #Quang Hoang and #Kim Rop
According to the offical document:
When using a non-integer step, such as 0.1, the results will often not
be consistent. It is better to use numpy.linspace for these cases.
And this is also partially because of floating point precision.

how to get one number from pandas sum / is in function [duplicate]

This question already has an answer here:
Count occurrences of certain string in entire pandas dataframe
(1 answer)
Closed 2 years ago.
Suppose I want to find the number of occurrences of something in a pandas dataframe as one number.
If I do df.isin(["ABC"]).sum() it gives me a table of all occurrences of "ABC" under each column.
What do I do if I want just one number which is the number of "ABC" entries under column 1?
Moreover, is there code to find entries that have both "ABC" under say column 1 and "DEF" under column 2. even this should just be a single number of entries/rows that have both of these.
You can check with groupby + size
out = df.groupby(['col1', 'col2']).size()
print(out.loc[('ABC','DEF')])
Q1: I'm sure there are more sophisticated ways of doing this, but you can do something like:
num_occurences = data[(data['column_name'] == 'ABC')]
len(num_occurences.index)
Q2: To add in 'DEF' search, you can try
num_occurences = data[(data['column_name'] == 'ABC') & (data['column_2_name'] == 'DEF')]
len(num_occurences.index)
I know this works for quantitative values; you'll need to see with qualitative.

Pandas - How to get sum of column with positive and negative values? [duplicate]

This question already has answers here:
converting currency with $ to numbers in Python pandas
(5 answers)
Closed 3 years ago.
I am summing a column of data using pandas that includes positive and negative values.
I first clean the data by removing the $ sign and parenthesis. Then format as a float.
How can I sum the whole column and subtract by the negative numbers?
Example:
$1000
($200)
$300
$1250
($100)
I want the answer to be 2250 not 2550.
Thanks in advance!
You want to identify the values and the signs:
# positive and negative
signs = np.where(s.str.startswith('('), -1, 1)
# extract the values
vals = s.str.extract('\$([\d\.]*)')[0].astype(int)
# calculate the sum
vals.mul(signs).sum()
# 2250
A Pandas DataFrame object has the .sum method that takes axis as a parameter
my_dataframe['name_of_column_you_want'].sum(axis = 0) # axis=0 means down (the rows)
I don't understand your example.
import re
def clean(column_name) :
if column_name.find('(') > 0 :
return float(re.match(r'(\d+)').group(0))
else :
return -float(re.match(r'(\d+)').group(0))
my_dataframe['column_name'].apply(clean).sum()

if-else for multiple conditions dataframe [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I don't know how to right properly the following idea:
I have a dataframe that has two columns, and many many rows.
I want to create a new column based on the data in these two columns, such that if there's 1 in one of them the value will be 1, otherwise 0.
Something like that:
if (df['col1']==1 | df['col2']==1):
df['newCol']=1
else:
df['newCol']=0
I tried to use .loc function in different ways but i get different errors, so either I'm not using it correctly, or this is not the right solution...
Would appreciate your help. Thanks!
Simply use np.where or np.select
df['newCol'] = np.where((df['col1']==1 | df['col2']==1), 1, 0)
OR
df['newCol'] = np.select([cond1, cond2, cond3], [choice1, choice2, choice3], default=def_value)
When a particular condition is true replace with the corresponding choice(np.select).
one way to solve this using .loc,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = 1
df['newCol'].fillna(0,inplace=True)
incase if you want newcol as string use,
df.loc[(df['col1'] == 1 | df['col2']==1) ,'newCol'] = '1'
df['newCol'].fillna('0',inplace=True)
or
df['newCol']=df['newCol'].astype(str)

Dropping outliers using standard deviation and mean formula [duplicate]

This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 4 years ago.
Hello everyone,
I am trying to remove the outliers from my dataset. I defined the outlier boundaries using the mean-3*std and mean+3*std. Now I want to delete the values smaller than mean-3*std and delete the values bigger than mean+3*std. Could you help me writing a formula for this? I am a beginner in python. I already looked at similar questions, but this did not helped so far.
Untill now I had the following:
import pandas as pd
print(df_OmanAirTO.mean()-3*df_OmanAirTO.std(), df_OmanAirTO.mean()+3*df_OmanAirTO.std())
resulting in:
FuelFlow 2490.145718
ThrustDerateSmoothed 8.522145
CoreSpeed 93.945180
EGTHotDayMargin 9.950557
EGT 684.168701
TotalAirTemperature 11.980698
ThrustDerate -3.780215
dtype: float64
FuelFlow 4761.600157
ThrustDerateSmoothed 29.439075
CoreSpeed 101.360974
EGTHotDayMargin 90.414781
EGT 915.952163
TotalAirTemperature 43.266653
ThrustDerate 44.672861
dtype: float64
Now I want to delete the values smaller than mean-3*std and delete the values bigger than mean+3*std. How can I do this?
Thank you in advance for helping me!
I assume you want to apply the outlier conditionals on each column (i.e. in column FuelFlow, remove cells smaller than 2490.145718 and larger than 4761.600157, and in column ThrustDerateSmoothed, remove cells smaller than 8.522145 and larger than 29.439075, etc...)
I will try this:
filt_outliers_df_oman = df.apply(lambda x: x[(x < df_OmanAir[x.name].mean()-3*df_OmanAir[x.name].std()) &
(x > df_OmanAIr[x.name].mean()+3*df_OmanAir[x.name].std())], axis=0)

Categories

Resources