How does loc know which row to update when setting values? - python

df = pd.DataFrame({
'Product': ['Umbrella', 'Matress', 'Badminton',
'Shuttle', 'Sofa', 'Football'],
'MRP': [1200, 1500, 1600, 352, 5000, 500],
'Discount': [0, 10, 0, 10, 20, 40]
})
# Print the dataframe
print(df)
df.loc[df.MRP >= 1500, "Discount"] = -1
print(df)
I want to understand how the loc works. The purpose of loc is to get the row by label search. But in the above code, it seems iterate over each row, and insert -1 in the new col where the boolean is True? Does it do label search?

The only "real" indexing on a DataFrame are the positional indexes (the 0 indexed values which correspond to the underlying structures).
loc, therefore, always has to "Convert a potentially-label-based key into a positional indexer." _get_setitem_indexer.
Stepping out from under the hood the docs on pandas.DataFrame.loc explicitly allow:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
The benefit of loc is that it is extraordinarily flexible, particularly in terms of being able to chain this with other operations:
See:
df.groupby('Discount')['MRP'].agg(sum)
Discount
0 2800
10 1852
20 5000
40 500
Name: MRP, dtype: int64
Filtering this with Series.loc can be written as:
df.groupby('Discount')['MRP'].agg(sum).loc[lambda s: s >= 1500]
Discount
0 2800
10 1852
20 5000
Name: MRP, dtype: int64
Another huge benefit of loc is its ability to index both dimensions:
df.loc[df['MRP'] >= 1500, ['Product', 'Discount']] = np.nan
Product MRP Discount
0 Umbrella 1200 0.0
1 NaN 1500 NaN
2 NaN 1600 NaN
3 Shuttle 352 10.0
4 NaN 5000 NaN
5 Football 500 40.0
TLDR; The power of loc is its ability to translate various inputs into positional inputs, while the drawback is overhead of those conversions.

The first line of the documentation for DataFrame.loc states:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array
Let's take a look at the expression df.MRP >= 1500. This is a boolean series with the same index as the dataframe:
>>> df.MRP >= 1500
0 False
1 True
2 True
3 False
4 True
5 False
Name: MRP, dtype: bool
So clearly there is at least an opportunity to match labels. What happens when you remove the labels?
>>> df.loc[(df.MRP >= 1500).to_numpy(), "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So .loc will use the ordering of the DataFrame when labels are not available. This makes sense. But does it use order or labels when the labels don't match?
Make a Series like df.MRP >= 1500 but out of order to see what gets selected:
>>> ind1 = pd.Series([True, True, True, False, False, False], index=[1, 2, 4, 0, 3, 5])
>>> df.loc[ind1, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So clearly, when available label matching is happening. When not available, order is used instead:
>>> df.loc[ind1.to_numpy(), "Discount"]
0 0
1 10
2 0
Name: Discount, dtype: int64
Another interesting point is that the labels of the index expression must be a superset, not a subset of the DataFrame's index. For example, if you shorten ind by one element, this is what happens:
>>> ind2 = pd.Series([True, True, True, False, False], index=[1, 2, 4, 0, 3])
>>> df.loc[ind2, "Discount"]
...
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
and
>>> df.loc[ind2.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 5 instead of 6
Adding an extra element when doing label matching is OK, however:
>>> ind3 = pd.Series([True, True, True, False, False, False, True], index=[1, 2, 4, 0, 3, 5, 6])
>>> df.loc[ind3, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
Notice that element at index 6, which is not in the DataFrame, is ignored in the output.
And of course without labels, longer arrays are not acceptable either:
>>> df.loc[ind3.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 7 instead of 6

Related

Get index of row where column value changes from previous row

I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

Check, if a variable does not equal to any of the vector's elements

I have a vector dogSpecies showing all four unique dog species under investigation.
#a set of possible dog species
dogSpecies = [1,2,3,4]
I also have a data vector containing integer numbers corresponding to the records of dog species of all dogs tested.
# species of examined dogs
data = np.array(1,1,2,-1,0,2,3,5,4)
Some of the records in data contain values different than 1,2,3 or 4. (Such as -1, 0 or 5). If an element in the data set is not equal to any element of the dogSpecies, such occurrence should be marked in an error evaluation boolean matrix as False.
#initially all the elements of the boolean error evaluation vector are True.
errorEval = np.ones((np.size(data,axis = 0)),dtype=bool)
Ideally my errorEval vector would look like this:
errorEval = np.array[True, True, True, False, False, True, True, False, True]
I want a piece of code that checks if the elements of data are not equal to the elements of dogSpecies vector. My code for some reason marks every single element of the errorEval vector as 'False'.
for i in range(np.size(data, axis = 0)):
# validation of the species
if (data[i] != dogSpecies):
errorEval[i] = False
I understand that I cannot compare a single element with a vector of four elements like above, but how do I do this then?
Isn't this just what you want?
for index, elem in enumerate(data):
if elem not in dogSpecies:
errorEval[index] = False
Probably not very fast, it doesn't use any vectorized numpy ufuncs but if the array isn't very large that won't matter. Converting dogSpecies to a set will also speed things up.
As an aside, your python looks very c/java esque. I'd suggest reading the python style guide.
If I understand correctly, you have a dataframe and a list of dog species. This should achieve what you want.
df = pd.DataFrame({'dog': [1,3,4,5,1,1,8,9,0]})
dog
0 1
1 3
2 4
3 5
4 1
5 1
6 8
7 9
8 0
df['errorEval'] = df['dog'].isin(dogSpecies).astype(int)
dog errorEval
0 1 1
1 3 1
2 4 1
3 5 0
4 1 1
5 1 1
6 8 0
7 9 0
8 0 0
df.errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
If you don't want to create a new column then you can do:
df.assign(errorEval=df['dog'].isin(dogSpecies).astype(int)).errorEval.values
# array([1, 1, 1, 0, 1, 1, 0, 0, 0])
As #FHTMitchel stated you have to use in to check if an element is in a list or not.
But you can use list comprehension which is faster as normal loop and shorter:
errorEval = np.array([True if elem in dogSpecies else False for elem in data])

How to drop rows of Pandas dataframe with same value based on condition in different column

I am new to Python and Pandas, so please bear with me. I have a rather simple problem to solve, I suppose, but cannot seem to get it right.
I have a csv-file, that I would like to edit with a pandas dataframe. The data presents flows from home to work locations, and the locations' respective ids as well as coordinates in lat/lon and a value for each flow.
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,10,"Schleswig-Holstein",54.212,9.959,7618
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2000,"Hamburg, Freie und Hansestadt",53.57071859,9.943770215,567
1001,"Flensburg",54.78879007,9.4459971,20,"Hamburg",53.575,9.941,567
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,100,"Saarland",49.379,6.979,25
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11000,"Berlin, Stadt",52.50395948,13.39337765,274
1003,"Lübeck",53.88132436,10.72749774,110,"Berlin",52.507,13.405,274
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
I would like to delete all adjacent duplicate rows with the same value and only keep the last row, where id_work is either one-digit or two-digits. All other rows should be deleted. How can I achieve this? What I essentially need is the following output:
id_home,name_home,lat_home,lon_home,id_work,work,lat_work,lon_work,value
1001,"Flensburg",54.78879007,9.4459971,1002,"Kiel",54.34189351,10.13048288,695
1001,"Flensburg",54.78879007,9.4459971,1003,"Lübeck, Hansestadt",53.88132436,10.72749774,106
1001,"Flensburg",54.78879007,9.4459971,1004,"Neumünster, Stadt",54.07797524,9.974475148,124
1001,"Flensburg",54.78879007,9.4459971,1051,"Dithmarschen",54.12904835,9.120139194,39
1001,"Flensburg",54.78879007,9.4459971,1,"Schleswig-Holstein",54.20896049,9.957114419,7618
1001,"Flensburg",54.78879007,9.4459971,2,"Hamburg",53.57071859,9.943770215,567
1003,"Lübeck",53.88132436,10.72749774,10,"Saarland",54.212,9.959,25
1003,"Lübeck",53.88132436,10.72749774,11,"Berlin",52.50395948,13.39337765,274
Super thankful for any help!
drop_duplicates has a keep param, set this to last:
In [188]:
df.drop_duplicates(subset=['value'], keep='last')
Out[188]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
Actually I think the following is what you want:
In [197]:
df.drop(df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)])
Out[197]:
id name value
0 345 name1 456
1 12 name2 220
5 2 name6 567
Here we drop the row labels that have duplicate values and where the 'id' length is not 1, a breakdown:
In [198]:
df['value'].duplicated()
Out[198]:
0 False
1 False
2 False
3 True
4 True
5 True
Name: value, dtype: bool
In [199]:
df.loc[df['value'].duplicated(), 'value']
Out[199]:
3 567
4 567
5 567
Name: value, dtype: int64
In [200]:
df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())
Out[200]:
0 False
1 False
2 True
3 True
4 True
5 True
Name: value, dtype: bool
In [201]:
(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)
Out[201]:
0 False
1 False
2 True
3 True
4 True
5 False
dtype: bool
In [202]:
df.index[(df['value'].isin(df.loc[df['value'].duplicated(), 'value'].unique())) & (df['id'].astype(str).str.len() != 1)]
Out[202]:
Int64Index([2, 3, 4], dtype='int64')
So the above uses duplicated to return the duplicated values, unique to return just the unique duplicated values, isin to test for membership we cast the 'id' column to str so we can test the length using str.len and use the boolean mask to mask the index labels.
Let's simplify this to the case where you have a single array:
arr = np.array([1, 1, 1, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1])
Now let's generate an array of bools which shows us the places where the values change:
arr[1:] != arr[:-1]
That tells us which values we want to keep--the values which are different from the next ones. But it leaves out the last value, which should always be included, so:
mask = np.hstack((arr[1:] != arr[:-1], True))
Now, arr[mask] gives us:
array([1, 2, 0, 1, 2, 0, 2, 1, 0, 1])
And in case you don't believe the last occurrence of each element was selected, you can check mask.nonzero() to get the indexes numerically:
array([ 2, 3, 5, 7, 8, 12, 13, 14, 16, 19])
Now that you know how to generate the mask for a single column, you can simply apply it to your entire dataframe as df[mask].

contract of pandas.DataFrame.equals

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?
.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)
I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())

Categories

Resources