Equivalent of Dataframe "diff" with strings

Equivalent of Dataframe "diff" with strings - python

Dataframes in pandas have some functions to perform computation between different rows, like diff (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html).
However, this only works with numeric computation (or at least objects that supports - operation).
Is there a way to perform a different between strings and return a boolean if the strings are equal?
For example:
>>> s = pd.Series(list("ABCCDEF"))
>>> s.str_diff()
0 NaN
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool

Thanks for Quang Hoang to point out the answer.
You just need to do a new series or Dataframe with a shift of rows and then compare.
>>> s = pd.Series(list("ABBCDDEF"))
# If you seach strings that are different
>>> s.ne(s.shift())
# If you seach strings that are equal
>>> s.eq(s.shift())
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 False
dtype: bool

Related

Isn't taking the mean of a pandas column of boolean values supposed to return the proportion that is True?

So I took the mean of a pandas data frame column that contains boolean values. I've done this in the past multiple times and understood that it would return the proportion that is True. But when I wrote it in this particular instance, it didn't work. It returns the proportion that is False and not only that, the denominator it uses doesn't seem to relate to anything. I have no idea where it pulls the denominator from to calculate the proportion value. I discovered it works the way I want it to when I remove the second line of code (datadf = datadf[1:])
# get current row value minus previous row value and returns True if > 0
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
# remove first row because it gives 'None'
datadf = datadf[1:]
# calculate proportion that is True
accretionscore = datadf['increase'].mean()
This is the output
date price increase
1 2020-09-28 488.51 True
2 2020-09-29 489.33 True
3 2020-09-30 490.43 True
4 2020-10-01 499.51 True
5 2020-10-02 478.99 False
correct value: 0.8
value given: 0.2
When I try adding another sample that's when things get weirder:
date price increase
1 2020-09-27 479.78 False
2 2020-09-28 488.51 True
3 2020-09-29 489.33 True
4 2020-09-30 490.43 True
5 2020-10-01 499.51 True
6 2020-10-02 478.99 False
correct value: 0.6666666666666666
value given: 0.16666666666666666
they don't even add up to 1!
I'm so confused. Can anyone tell me what is going on? How does taking out the second line fix the problem?

Hint: if you want to convert from boolean to int, then you just can use:
datadf['increase'] = datadf['increase'].astype(int)
and this way things will work fine.

If we run your code, you can see that datadf['increase'] is an object instead of a boolean, so taking mean on this is most likely converting the categories to a number and so on.. basically something weird:
import pandas as pd
datadf = pd.DataFrame({'price':[470,488.51,489.33,490.43,499.51,478.99]})
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
datadf['increase']
Out[8]:
0 None
1 True
2 True
3 True
4 True
5 False
Name: increase, dtype: object
datadf['increase'].dtype
dtype('O')
From what I can see, you want True / False on whether the row is larger than its preceding, so do:
datadf['increase'] = datadf.price > datadf.price.shift(1)
datadf['increase'].dtype
dtype('bool')
And we just omit the first row by doing:
datadf['increase'][1:].mean()
0.8

Select rows where multiple column values are in multiple lists

I want to select values from a dataframe such as:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
2 0 0 300
3 4 0 0
4 0 50 0
5 0 0 500
The values I want to keep from Vendor_1, 2, 3 are all inside a seperate list i.e v_1, v_2, v_3. For example say say v_1 = [1], v_2 = [20], v_3 = [500], meaning I want only these rows to stay.
I've tried something like:
df = df[(df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2)) & ... ]
This gives me an empty dataframe, is this problem to do with the above logic, or is it that there exist no rows with these constraints (highly unlikely in my real dataframe).
Cheers
EDIT:
Ok so I've realised a fundamental difference with my example and what is actually is like in my df, if there is a value for Vendor_1 then Vendor_2,3 must be 0, etc. So my logic with the isin chain doesnt make sense right, ill update the example df.
So I feel like I need to make 3 subsets and then merge them or something?

isin accepts dictionary:
d = {
'Vendor_1':[1],
'Vendor_2':[20],
'Vendor_3':[500]
}
df.isin(d)
Output:
Vendor_1 Vendor_2 Vendor_3
0 True False False
1 False True False
2 False False False
3 False False False
4 False False False
5 False False True
And then depending on your logic, you want to check for any or all:
df[df.isin(d).any(1)]
Output:
Vendor_1 Vendor_2 Vendor_3
0 1 0 0
1 0 20 0
5 0 0 500
But if you use all in this case, for example, you require that Vendor_1=1, Vendor_2=20, and Vendor_3=500 must happen on the same rows and you would keep these rows.

The example you're giving should work unless there are effectively no rows that match that condition.
Those expressions are a bit tricky with the parens so I'd rather split the line in two for easier debugging:
mask = (df['Vendor_1'].isin(v_1)) & (df['Vendor_2'].isin(v_2))
# sanity check that the mask is selecting something
assert mask.any()
df = df[mask]
Note that you must have parens between & because of operator precedence rules.
For example:

Numpy/Pandas clean way to check if a specific value is NaN

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?

You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN

Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Testing if value is contained in Pandas Series with mixed types

I have a pandas series, for example: x = pandas.Series([-1,20,"test"]).
Now I would like to test if -1 is contained in x without looping over the whole series. I could transform the whole series to string and then test if "-1" in x but sometimes I have -1.0 and sometime -1 and so on, so this is not a good choice.
Is there another possibility to approach this?

What about
x.isin([-1])
output:
0 True
1 False
2 False
dtype: bool
Or if you want to have a count of how many instances:
x.isin([-1]).sum()
Output:
1

I think you can do something like this to handle data that appears to be string-like and integer-like. Pandas Series are all a single datatype.
x = pd.Series([-1,20,"test","-1.0"])
print(x)
0 -1
1 20
2 test
3 -1.0
dtype: object
(pd.to_numeric(x, errors='coerce') == -1).sum()
Note: Any value that can cast into a number will return NaN.
Output
2
If you just want to see if a -1 appears in x then you can use
(pd.to_numeric(x, errors='coerce') == -1).sum() > 0
Output:
True

x.isin([-1])
Gives me:
0 True
1 False
2 False
dtype: bool
You can refer to docs for more info.

Search for value within a range in a pandas dataframe?

I am attempting to search for matching values within a range within a given uncertainty in a pandas dataframe. For instance, if I have a dataframe:
A B C
0 12 12.6 111.20
1 14 23.4 112.20
2 16 45.6 112.30
3 18 56.6 112.40
4 27 34.5 121.60
5 29 65.2 223.23
6 34 45.5 654.50
7 44 65.6 343.50
How can I search for a value that matches 112.6 +/-0.4 without having to create a long and difficult criteria like:
TargetVal_Max= 112.6+0.4
TargetVal_Min= 112.6-0.4
Basically, I want to create a "buffer window" that allows for all values matching a window to be returned back. I have uncertainties package, but have yet to get it working like this.
Optimally, I'd like to be able to return all index values that match a value in both C and B within a given error range.
Edit
As pointed out by #MaxU, the np.isclose f(x) works very well if you know the exact number. But is it possible to match a list of values, such that if I had a second dataframe and wanted to see if the values in C from one matched the values of C (second dataframe) within a tolerance? I have attempted to get them into a list and do it this way, but I am getting problems when attempting to do it for more than a single value at a time.
TEST= Dataframe_2["C"]
HopesNdreams = sample[sample["C"].apply(np.isclose,b=TEST, atol=1.0)]
Edit 2
I found through trying a couple of different work arounds that I can just do:
TEST1= Dataframe_2["C"].tolist
for i in TEST1:
HopesNdreams= sample[sample["C"].apply(np.isclose,b=i, atol=1.0)]
And this returns the hits for the given column. Using the logic set forth in the first answer, I think this will work very well for what I need it to. Are there any hangups that I don't see with this method?
Cheers and thanks for the help!

IIUC you can use np.isclose() function:
In [180]: df[['B','C']].apply(np.isclose, b=112.6, atol=0.4)
Out[180]:
B C
0 False False
1 False True
2 False True
3 False True
4 False False
5 False False
6 False False
7 False False
In [181]: df[['B','C']].apply(np.isclose, b=112.6, atol=0.4).any(1)
Out[181]:
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 False
dtype: bool
In [182]: df[df[['B','C']].apply(np.isclose, b=112.6, atol=0.4).any(1)]
Out[182]:
A B C
1 14 23.4 112.2
2 16 45.6 112.3
3 18 56.6 112.4

Use Series.between():
df['C'].between(112.6 + .4, 112.6 - .4)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Equivalent of Dataframe "diff" with strings - python

Related

Isn't taking the mean of a pandas column of boolean values supposed to return the proportion that is True?

Select rows where multiple column values are in multiple lists

Numpy/Pandas clean way to check if a specific value is NaN

Testing if value is contained in Pandas Series with mixed types

Search for value within a range in a pandas dataframe?

Categories

Resources