I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.
I have a vector dogSpecies showing all four unique dog species under investigation.
#a set of possible dog species
dogSpecies = [1,2,3,4]
I also have a data vector containing integer numbers corresponding to the records of dog species of all dogs tested.
# species of examined dogs
data = np.array(1,1,2,-1,0,2,3,5,4)
Some of the records in data contain values different than 1,2,3 or 4. (Such as -1, 0 or 5). If an element in the data set is not equal to any element of the dogSpecies, such occurrence should be marked in an error evaluation boolean matrix as False.
#initially all the elements of the boolean error evaluation vector are True.
errorEval = np.ones((np.size(data,axis = 0)),dtype=bool)
Ideally my errorEval vector would look like this:
errorEval = np.array[True, True, True, False, False, True, True, False, True]
I want a piece of code that checks if the elements of data are not equal to the elements of dogSpecies vector. My code for some reason marks every single element of the errorEval vector as 'False'.
for i in range(np.size(data, axis = 0)):
# validation of the species
if (data[i] != dogSpecies):
errorEval[i] = False
I understand that I cannot compare a single element with a vector of four elements like above, but how do I do this then?
Isn't this just what you want?
for index, elem in enumerate(data):
if elem not in dogSpecies:
errorEval[index] = False
Probably not very fast, it doesn't use any vectorized numpy ufuncs but if the array isn't very large that won't matter. Converting dogSpecies to a set will also speed things up.
As an aside, your python looks very c/java esque. I'd suggest reading the python style guide.
If I understand correctly, you have a dataframe and a list of dog species. This should achieve what you want.
df = pd.DataFrame({'dog': [1,3,4,5,1,1,8,9,0]})
dog
0 1
1 3
2 4
3 5
4 1
5 1
6 8
7 9
8 0
df['errorEval'] = df['dog'].isin(dogSpecies).astype(int)
dog errorEval
0 1 1
1 3 1
2 4 1
3 5 0
4 1 1
5 1 1
6 8 0
7 9 0
8 0 0
df.errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
If you don't want to create a new column then you can do:
df.assign(errorEval=df['dog'].isin(dogSpecies).astype(int)).errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
As #FHTMitchel stated you have to use in to check if an element is in a list or not.
But you can use list comprehension which is faster as normal loop and shorter:
errorEval = np.array([True if elem in dogSpecies else False for elem in data])
I am new to Python and Pandas, so please bear with me. I have a rather simple problem to solve, I suppose, but cannot seem to get it right.
I have a csv-file, that I would like to edit with a pandas dataframe. The data presents flows from home to work locations, and the locations' respective ids as well as coordinates in lat/lon and a value for each flow.
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
I would like to delete all adjacent duplicate rows with the same value and only keep the last row, where id_work is either one-digit or two-digits. All other rows should be deleted. How can I achieve this? What I essentially need is the following output:
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
Super thankful for any help!
drop_duplicates has a keep param, set this to last:
In [188]:
df.drop_duplicates(subset=['value'], keep='last')
Out[188]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
Actually I think the following is what you want:
In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])
Out[197]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
Here we drop the row labels that have duplicate values and where the 'id' length is not 1, a breakdown:
In [198]:
df['value'].duplicated()
Out[198]:
0 False
1 False
2 False
3 True
4 True
5 True
Name: value, dtype: bool
In [199]:
df.loc[df['value'].duplicated(), 'value']
Out[199]:
3 567
4 567
5 567
Name: value, dtype: int64
In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())
Out[200]:
0 False
1 False
2 True
3 True
4 True
5 True
Name: value, dtype: bool
In [201]:
(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)
Out[201]:
0 False
1 False
2 True
3 True
4 True
5 False
dtype: bool
In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]
Out[202]:
Int64Index([2, 3, 4], dtype='int64')
So the above uses duplicated to return the duplicated values, unique to return just the unique duplicated values, isin to test for membership we cast the 'id' column to str so we can test the length using str.len and use the boolean mask to mask the index labels.
Let's simplify this to the case where you have a single array:
arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])
Now let's generate an array of bools which shows us the places where the values change:
arr[1:] != arr[:-1]
That tells us which values we want to keep--the values which are different from the next ones. But it leaves out the last value, which should always be included, so:
mask = np.hstack((arr[1:] != arr[:-1], True))
Now, arr[mask] gives us:
array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])
And in case you don't believe the last occurrence of each element was selected, you can check mask.nonzero() to get the indexes numerically:
array([ 2, 3, 5, 7, 8, 12, 13, 14, 16, 19])
Now that you know how to generate the mask for a single column, you can simply apply it to your entire dataframe as df[mask].
I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?
.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)
I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())