I got an unexpected behaviour, at least for me, while working with pandas dataframe. I use a 2d array to check if the values are smaller than. If I check the entire Dataframe at once, the check is wrong at some values. But if I check explicitly the concerned cell the result is correct.
print(df.loc[5,397])
out: 14.4 #--> its actually 14.3999996185
print(df.loc[5,397] < 14.4)
out: True
print(df.loc[4:6,396:398] < 14.4)
out:
396 397 398
4 False False False
5 False False False #[5,397] should be True!
6 False False False
However, if I try to reproduce the error, I got the correct result?!
data = numpy.array([[15,15,15], [15,14.3999996185,15], [15,15,15]])
df = pd.DataFrame(data)
print(df.loc[1,1])
out: 14.3999996185
print(df.loc[1,1] < 14.4)
out: True
print(df < 14.4)
out:
0 1 2
0 False False False
1 False True False
2 False False False
Thank you
Related
I have a dataframe which contains two columns.
One column shows distance, the other column contains unique 'trackIds' that are associated with a set of distances.
Example:
trackId. distance
2. 17.452
2. 8.650
2. 10.392
2. 11.667
2. 23.551
2. 9.881
3. 6.052
3. 7.241
3. 8.459
3. 22.644
3. 126.890
3. 12.442
3. 5.891
4. 44.781
4. 7.657
4. 36.781
4. 224.001
What I am trying to do is eliminate any trackIds that contain a large spike in distance -- a spike that is > 75.
In this example case, track Ids 3 and 4 (and all their associated distances) would be removed from the dataframe because we see spikes in distance greater than 75, thus we would just be left with a dataframe containing trackId 2 and all of its associated distance values.
Here is my code:
i = 0
k = 1
length = len(dataframe)
while i < length:
if (dataframe.distance[k] - dataframe.distance[i]) > 75:
bad_id = dataframe.trackId[k]
condition = dataframe.trackid != bad_id
df2 = dataframe[condition]
i+=1
I tried to use a while loop that was able to go through all the different trackIds, subtract all the distance values and see if the result was > 75, if it was, then the program associated that trackId with the variable 'bad_id' and used that as a condition to filter the dataframe to only contain trackIds that are not equal to the bad_id(s).
I just keep getting nameErrors because I'm unsure of how to properly structure the loop and am in general not sure if this approach works anyways.
We can use diff to compute the difference between each row, then use groupby transform to check if there are any differences in the group gt 75. Then keep groups where there are not any matches:
m = ~(df['distance'].diff().gt(75).groupby(df['trackId']).transform('any'))
filtered_df = df.loc[m, df.columns]
filtered_df:
trackId distance
0 2.0 17.452
1 2.0 8.650
2 2.0 10.392
3 2.0 11.667
4 2.0 23.551
5 2.0 9.881
Breakdown of steps as a DataFrame:
breakdown = pd.DataFrame({'diff': df['distance'].diff()})
breakdown['gt 75'] = breakdown['diff'].gt(75)
breakdown['groupby any'] = (
breakdown['gt 75'].groupby(df['trackId']).transform('any')
)
breakdown['negation'] = ~breakdown['groupby any']
print(breakdown)
breakdown:
diff gt 75 groupby any negation
0 NaN False False True
1 -8.802 False False True
2 1.742 False False True
3 1.275 False False True
4 11.884 False False True
5 -13.670 False False True
6 -3.829 False True False
7 1.189 False True False
8 1.218 False True False
9 14.185 False True False
10 104.246 True True False # Spike of more than 75
11 -114.448 False True False
12 -6.551 False True False
13 38.890 False True False
14 -37.124 False True False
15 29.124 False True False
16 187.220 True True False # Spike of more than 75
So I'm currently finishing a tutorial with the titanic dataset (https://www.kaggle.com/c/titanic/data).
Now I'm trying a couple of new things that might be related.
The info for it is :
There are 891 entries(red asterisk), and columns with NaN values (blue dashes).
When I went to find a little summary of the missing values I got confused by .sum() & .count():
In the above code, .sum() is incrementing by one for each instance of a null value. So it seems that the output is the value of how many missing entries there are for each column in the data frame. (which is what I want)
However if we do .count() we get 891 for each column no matter if we use .isnull().count() or .notnull().count().
So my question(s) is :
What does .count() mean in this context?
I thought that it would count every instance of the wanted method (in this case every instance of a null or not null entry; basically what .sum() did).
Also; my "definition" of how .sum() is being used, is that correct?
Just print out the data of train_df.isnull(), you will see it.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
train_df = pd.read_csv('train.csv')
print(train_df.isnull())
result:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
.. ... ... ... ... ... ... ... ... ...
886 False False False False False False False False False
887 False False False False False False False False False
888 False False False False False True False False False
889 False False False False False False False False False
890 False False False False False False False False False
It got 891 rows, full of Trues and False.
When you use sum(), it will return the sum of every column, which adds trues(=1) and false(= 0) together.Like this
print(False+False+True+True)
2
When you use count(), it just returns the number of rows.
Of course, you will get 891 for each column no matter you use .isnull().count() or .notnull().count().
data.isnull().count() returns the total number of rows irrespective of missing values. You need to use data.isnull().sum().
Columns L,M,N of my dataframe are populated with 'true' and 'false' statements(1000 rows). I would like to create a new column 'count_false' that will return the number of times 'false' statement occurred in columns L,M and N.
Any tips appreciated!
Thank you.
You can negate your dataframe and sum over axis=1:
df = pd.DataFrame(np.random.randint(0, 2, (5, 3)), columns=list('LMN')).astype(bool)
df['Falses'] = (~df).sum(1)
print(df)
L M N Falses
0 True False True 1
1 True False False 2
2 True True True 0
3 False True False 2
4 False False True 2
If you have additional columns, you can filter accordingly:
df['Falses'] = (~df[list('LMN')]).sum(1)
Try this : df[df==false].count()
As explained in this Stack question True and False equal to 1 and 0 in Python, therefore something like the line three of the following example should solve your problem:
import pandas as pd
df = pd.DataFrame([[True, False, True],[False, False, False],[True, False, True],[False, False, True]], columns=['L','M','N'])
df['count_false'] = 3 - (df['L']*1 + df['M']*1 + df['N']*1)
I am getting wrong results when I do an element wise comparison on a numpy array of floats.
For eg:
import numpy as np
a = np.arange(4, 5 + 0.025, 0.025)
print a
mask = a==5.0
print mask
na = a[mask]
print na
When I run the above code, a == 5.0 doesn't give me a True value for the
index where the value is in fact 5.0
I also tried setting the dtype of array to numpy.double thinking it could
be a floating point precision issue but it still returns me wrong result.
I am pretty sure I am missing something here....can anyone point me to right direction or tell me what's wrong with the code above?
Thanks!
There is an imprecision here when using float types, use np.isclose to compare an array against a scalar float value:
In [50]:
mask = np.isclose(a,5.0)
print(mask)
na = a[mask]
na
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False True False]
Out[50]:
array([ 5.])
I have these 2 data frames:
df_test
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 -1 -1 -1 -1 -1
1 1177314888 238198786 5770904146 133207291 Exact
2 1177314888 238198786 5770904266 18395155770 Exact
3 1177314888 238198786 5770904266 19338210057 Exact
4 1177314888 238198786 5770904266 30907903234 Exact
and
df_merge
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 -1 -1 -1 -1 -1
1 1177314888 238198786 5770904146 133207291 Exact
I want to remove everything that is inside df_merge from df_test, based on the combinations of dimension1_id, dimension2_id, dimension3_id, dimension4_id and dimension5_id.
This is my code:
df_test = df_test[
(df_test['dimension5_id'].isin(df_merge.dimension5_id) == False) &
(df_test['dimension4_id'].isin(df_merge.dimension4_id) == False) & (df_test['dimension3_id'].isin(df_merge.dimension3_id) == False) & (df_test['dimension2_id'].isin(df_merge.dimension2_id) == False) &
(df_test['dimension1_id'].isin(df_merge.dimension1_id) == False)
]
But this code returns a empty data frame. How can I just remove the first and second line from df_test?
You can use logical indexing to mask the rows you want by applying a direct comparison. In this case, you can check for values in df_test which are in df_merge:
df_test.isin(df_merge)
The resulting logical index acts as a mask:
dimension1_id dimension2_id dimension3_id dimension4_id dimension5_id \
0 True True True True True True
1 True True True True True True
2 False False False False False False
3 False False False False False False
4 False False False False False False
True values map to matching rows so we can simply negate the index using ~ to return only the rows you in df_merge which are not df_test:
df_test[~df_test.isin(df_merge)]