python, Pandas Series - python

I got 2 pandas Series objects, both are of 1 column. Each element of the first one is a list of floats, each element of the second one is bool. Len is equal.
first one:
0 [190, 2000]
1 [250, 1500, 2500]
2 [12.7, 2.700]
3 [22.4, 1750, 2750]
4 [11.5, 4.500]
...
8123 [113.7, 4000]
8124 [24, 1.900, 2.750]
8125 [190, 2000]
8126 [140, 1800, 3000]
8127 [140, 1800, 3000]
Name: torque, Length: 8128, dtype: object
second one:
0 True
1 True
2 False
3 False
4 False
...
8123 True
8124 False
8125 True
8126 True
8127 True
Name: torque, Length: 8128, dtype: object
I also have function with two arguments - list object and bool object. It processes list object depending on bool value and returns tuple of 2 element always. So what I wanna do is to get all the element from the first Series through the function also giving it corresponding bool values from the second Series and receive third Series consisting of 2 columns (cause function returns tuple of 2 elements). Also the first Series has some NaN values instead of lists in some lines so I need to watch it as well. In this case I need to write NaN and NaN to the resulting Series.
I can apply cycles and complicated constructions but I am sure there must be an easier way using pandas classes and functions.

Related

How to take 2 samples out of pd.series, using sample, so the result would be sample1+sample2=original pd.series?

i need to create to random samples, in order to make cross-validation in next step. Say, we have a pd.Series object, and we are testing how it should work. But when I type this:
example=pd.Series([1,2,3,4,5,6])
example1=example.sample(n=2, replace=False, random_state=12345)
example2=example.sample(n=4, replace=False, random_state=12345)
print(example1)
print(example2)
I get:
5 6
3 4
dtype: int64
5 6
3 4
4 5
0 1
dtype: int64
but the elements shoud be different, and example 1 and example 2 shoud be equal to example. What can be done?
Remove values by index of example1.index, but necessary unique index values:
#if not sure if unique
#example = example.reset_index(drop=True)
example2=example.drop(example1.index).sample(n=4, replace=False, random_state=12345)
If values of Series are unique filter out by example1:
example2=example[~example.isin(example1)].sample(n=4, replace=False, random_state=12345)

How does loc know which row to update when setting values?

df = pd.DataFrame({
'Product': ['Umbrella', 'Matress', 'Badminton',
'Shuttle', 'Sofa', 'Football'],
'MRP': [1200, 1500, 1600, 352, 5000, 500],
'Discount': [0, 10, 0, 10, 20, 40]
})
# Print the dataframe
print(df)
df.loc[df.MRP >= 1500, "Discount"] = -1
print(df)
I want to understand how the loc works. The purpose of loc is to get the row by label search. But in the above code, it seems iterate over each row, and insert -1 in the new col where the boolean is True? Does it do label search?
The only "real" indexing on a DataFrame are the positional indexes (the 0 indexed values which correspond to the underlying structures).
loc, therefore, always has to "Convert a potentially-label-based key into a positional indexer." _get_setitem_indexer.
Stepping out from under the hood the docs on pandas.DataFrame.loc explicitly allow:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
The benefit of loc is that it is extraordinarily flexible, particularly in terms of being able to chain this with other operations:
See:
df.groupby('Discount')['MRP'].agg(sum)
Discount
0 2800
10 1852
20 5000
40 500
Name: MRP, dtype: int64
Filtering this with Series.loc can be written as:
df.groupby('Discount')['MRP'].agg(sum).loc[lambda s: s >= 1500]
Discount
0 2800
10 1852
20 5000
Name: MRP, dtype: int64
Another huge benefit of loc is its ability to index both dimensions:
df.loc[df['MRP'] >= 1500, ['Product', 'Discount']] = np.nan
Product MRP Discount
0 Umbrella 1200 0.0
1 NaN 1500 NaN
2 NaN 1600 NaN
3 Shuttle 352 10.0
4 NaN 5000 NaN
5 Football 500 40.0
TLDR; The power of loc is its ability to translate various inputs into positional inputs, while the drawback is overhead of those conversions.
The first line of the documentation for DataFrame.loc states:
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array
Let's take a look at the expression df.MRP >= 1500. This is a boolean series with the same index as the dataframe:
>>> df.MRP >= 1500
0 False
1 True
2 True
3 False
4 True
5 False
Name: MRP, dtype: bool
So clearly there is at least an opportunity to match labels. What happens when you remove the labels?
>>> df.loc[(df.MRP >= 1500).to_numpy(), "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So .loc will use the ordering of the DataFrame when labels are not available. This makes sense. But does it use order or labels when the labels don't match?
Make a Series like df.MRP >= 1500 but out of order to see what gets selected:
>>> ind1 = pd.Series([True, True, True, False, False, False], index=[1, 2, 4, 0, 3, 5])
>>> df.loc[ind1, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
So clearly, when available label matching is happening. When not available, order is used instead:
>>> df.loc[ind1.to_numpy(), "Discount"]
0 0
1 10
2 0
Name: Discount, dtype: int64
Another interesting point is that the labels of the index expression must be a superset, not a subset of the DataFrame's index. For example, if you shorten ind by one element, this is what happens:
>>> ind2 = pd.Series([True, True, True, False, False], index=[1, 2, 4, 0, 3])
>>> df.loc[ind2, "Discount"]
...
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
and
>>> df.loc[ind2.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 5 instead of 6
Adding an extra element when doing label matching is OK, however:
>>> ind3 = pd.Series([True, True, True, False, False, False, True], index=[1, 2, 4, 0, 3, 5, 6])
>>> df.loc[ind3, "Discount"]
1 10
2 0
4 20
Name: Discount, dtype: int64
Notice that element at index 6, which is not in the DataFrame, is ignored in the output.
And of course without labels, longer arrays are not acceptable either:
>>> df.loc[ind3.to_numpy(), "Discount"]
...
IndexError: Boolean index has wrong length: 7 instead of 6

How to correctly identify float values [0, 1] containing a dot, in DataFrame object dtype?

I have a dataframe like so, where my values are object dtype:
df = pd.DataFrame(data=['A', '290', '0.1744175757', '1', '1.0000000000'], columns=['Value'])
df
Out[65]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
Value 5 non-null object
dtypes: object(1)
memory usage: 120.0+ bytes
What I want to do is select only percentages, in this case values of 0.1744175757 and 1.0000000000, which just so happen in my data will all have a period/dot in them. This is a key point - I need to be able to differentiate between a 1 integer value, and a 1.0000000000 percentage, as well as a 0 and 0.0000000000.
I've tried to look for the presence of the dot character, but this doesn't work, it returns true for every value, and I'm unclear why.
df[df['Value'].str.contains('.')]
Out[67]:
Value
0 A
1 290
2 0.1744175757
3 1
4 1.0000000000
I've also tried isdecimal(), but this isn't quite what I want:
df[df['Value'].str.isdecimal()]
Out[68]:
Value
1 290
3 1
The closest I've come up with a function:
def isPercent(x):
if pd.isnull(x):
return False
try:
x = float(x)
return x % 1 != 0
except:
return False
df[df['Value'].apply(isPercent)]
Out[74]:
Value
2 0.1744175757
but this fails to correctly identify scenarios of 1.0000000000 (and 0.0000000000).
I have two questions:
Why doesn't str.contains('.') work in this context? This seems like it's the easiest way since it will 100% of the time get me what I need in my data, but it returns True even if no '.' character is clearly in the value.
How might I correctly identify all values [0, 1] that have a dot character in the value?
str.contains performs a regex based search by default, and '.' will match any character by the regex engine. To disable it, use regex=False:
df[df['Value'].str.contains('.', regex=False)]
Value
2 0.1744175757
4 1.0000000000
You can also escape it to treat it literally:
df[df['Value'].str.contains(r'\.')]
Value
2 0.1744175757
4 1.0000000000
If you really want to pick up just float numbers, try using a regex that is a little more robust.
df[df['Value'].str.contains(r'\d+\.\d+')].astype(float)
Value
2 0.174418
4 1.000000

Casting pandas float64 to string with specific format

I have very small numbers in one pandas column. For example:
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 1.984500e+07
Unfortunately, when I cast it to a string, it gives me unusable values, such as:
df.astype('str').tolist()
['8.494e-07', ]
Is there a way to return the actual value of an item when casting to a string, such as:
'0.0000008494'
Given
# s = df[c]
s
0 6.560000e+02
1 6.730000e+02
2 6.240000e+02
3 1.325000e+03
4 8.494000e-07
Name: 1, dtype: float64
You can either call str.format through apply,
s.apply('{:.10f}'.format)
0 656.0000000000
1 673.0000000000
2 624.0000000000
3 1325.0000000000
4 0.0000008494
Name: 1, dtype: object
s.apply('{:.10f}'.format).tolist()
# ['656.0000000000', '673.0000000000', '624.0000000000',
# '1325.0000000000', '0.0000008494']
Or perhaps, through a list comprehension.
['{:f}'.format(x) for x in s]
# ['656.000000', '673.000000', '624.000000', '1325.000000', '0.000001']
Notice that if you do not specify decimal precision, the last value is rounded up.

contract of pandas.DataFrame.equals

I have a simple test case of a function which returns a df that can potentially contain NaN. I was testing if the output and expected output were equal.
>>> output
Out[1]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> expected
Out[2]:
r t ts tt ttct
0 2048 30 0 90 1
1 4096 90 1 30 1
2 0 70 2 65 1
[3 rows x 5 columns]
>>> output == expected
Out[3]:
r t ts tt ttct
0 True True True True True
1 True True True True True
2 True True True True True
However, I can't simply rely on the == operator because of NaNs. I was under the impression that the appropriate way to resolve this was by using the equals method. From the documentation:
pandas.DataFrame.equals
DataFrame.equals(other)
Determines if two NDFrame objects contain the same elements. NaNs in the same location are considered equal.
Nonetheless:
>>> expected.equals(log_events)
Out[4]: False
A little digging around reveals the difference in the frames:
>>> output._data
Out[5]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
FloatBlock: [r], 1 x 3, dtype: float64
IntBlock: [t, ts, tt, ttct], 4 x 3, dtype: int64
>>> expected._data
Out[6]:
BlockManager
Items: Index([u'r', u't', u'ts', u'tt', u'ttct'], dtype='object')
Axis 1: Int64Index([0, 1, 2], dtype='int64')
IntBlock: [r, t, ts, tt, ttct], 5 x 3, dtype: int64
Force that output float block to int, or force the expected int block to float, and the test passes.
Obviously, there are different senses of equality, and the sort of test that DataFrame.equals performs could be useful in some cases. Nonetheless, the disparity between == and DataFrame.equals is frustrating to me and seems like an inconsistency. In pseudo-code, I would expect its behavior to match:
(self.index == other.index).all() \
and (self.columns == other.columns).all() \
and (self.values.fillna(SOME_MAGICAL_VALUE) == other.values.fillna(SOME_MAGICAL_VALUE)).all().all()
However, it doesn't. Am I wrong in my thinking, or is this an inconsistency in the Pandas API? Moreover, what IS the test I should be performing for my purposes, given the possible presence of NaN?
.equals() does just what it says. It tests for exact equality among elements, positioning of nans (and NaTs), dtype equality, and index equality. Think of this as as df is df2 type of test but they don't have to actually be the same object, IOW, df.equals(df.copy()) IS always True.
Your example fails because different dtypes are not equal (they may be equivalent though). So you can use com.array_equivalent for this, or (df == df2).all().all() if you don't have nans.
This is a replacement for np.array_equal which is broken for nan positional detections (and object dtypes).
It is mostly used internally. That said if you like an enhancement for equivalence (e.g. the elements are equivalent in the == sense and nan positionals match), pls open an issue on github. (and even better submit a PR!)
I used a workaround digging into the MagicMock instance:
assert mock_instance.call_count == 1
call_args = mock_instance.call_args[0]
call_kwargs = mock_instance.call_args[1]
pd.testing.assert_frame_equal(call_kwargs['dataframe'], pd.DataFrame())

Categories

Resources