Pandas. Check that rows are partly equals

Pandas. Check that rows are partly equals - python

I have table:
tel size 1 2 3 4
0 123 1 Baby Baby None None
1 234 1 Shave Shave None None
2 222 1 Baby Baby None None
3 333 1 Shave Shave None None
I want to check if values in tables 1,2,3,4 ... are partly equal with 2 loops:
x = df_map.iloc[i,2:]
y = df_map.iloc[j,2:]
so df_map.iloc[0,2:] should be equal to df_map.iloc[2,2:],
and df_map.iloc[1,2:], is equal to df_map.iloc[3,2:],
I tried:
x == y
and
y.eq(x)
but it returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If i use (x==y).all() or (x==y).any() it returns wrong result.
I need something like:
if x== y:
counter += 1
Update:
problem was in None values. I used fillna('') and (x == y).all()

fillna('') because None == None is False
use numpy broadcasting evaluate ==
all(-1) to make sure the whole row matches
np.fill_diagonal because we don't need self matches
np.where to find where the matches are
v = df.fillna('').values[:, 2:]
match = ((v[None, :] == v[:, None]).all(-1))
np.fill_diagonal(match, False)
i, j = np.where(match)
pd.Series(i, j)
2 0
3 1
0 2
1 3
dtype: int64

THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array
import numpy as np
x = df_map[i,2:]
y = df_map[j,2:]
np.equal(y.values,x.values)

Related

pandas fill the column values with min function

I have a dataframe with 2 columns and I need to add 3rd column 'start'. However my code for some reason doesn't work and I am not sure why. Here is my code
df.loc[df.type=='C', 'start']= min(-1+df['dq']-3,4)
df.loc[df.type=='A', 'start']= min(-3+df['dq']-3,4)
df.loc[df.type=='B', 'start']= min(-3+df['dq']-5,4)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
a.any() or a.all().
and the dataset looks like this:
type dq
A 3
A 4
B 8
C 3

The error is being raised because first argument you pass to min() is a Series and the second (4) is an int.
Since you're using min to replace values greater than 4 with 4, you can just replace it once at the end using where:
df.loc[df.type=='C', 'start'] = -1+df['dq']-3
df.loc[df.type=='A', 'start'] = -3+df['dq']-3
df.loc[df.type=='B', 'start'] = -3+df['dq']-5
df["start"] = df["start"].where(df["start"]<4,other=4)
>>> df
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1
Another (perhaps cleaner) way of getting your column would be to use numpy.select, like so:
import numpy as np
df["start"] = np.select([df["type"]=="A", df["type"]=="B", df["type"]=="C"],
[df['dq']-6, df["dq"]-8, df["dq"]-4])
df["start"] = df["start"].where(df["start"]<4, 4)

You cannot use a Series with min. Instead, you can do:
s = (df['dq'] - df['type'].map({'A': 6, 'B': 8, 'C': 4}))
df['start'] = s.where(s<4, 4)
output:
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1

compare value of the current index to the value of next index in Pandas df

I'm trying to compare the value of the current index to the value of the next index in my pandas data frame. I'm able to access the value with iloc but when I write an if condition to validate the value. It gives me an error.
Code I tried:
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
for k in range(len(df)):
if df.iloc[k+1] > df.iloc[k]:
trend.append('up')
if df.iloc[k+1] < df.iloc[k]:
trend.append('down')
if df.iloc[k+1] == df.iloc[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I tried assigning the iloc[k] value to a variable "current" with astype=int. Still I am unable to use the variable "current" in my if condition validation. Appreciate if somebody can help with info on how to resolve it.

You are getting the error becauce
df.iloc[k] gives you a pd.Series.
You can use say df.iloc[k,0] to get the Col1 value

So, what I have done is I have converted that particular column into a list. Instead of working directly from the Series object returned by the dataframe, I prefer converting it to a list or numpy array first and then performing basic functions on it.
I have also given the correct working code below.
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
trend = list()
temp=df['Col1'].tolist()
print(temp)
for k in range(len(df)-1):
if temp[k+1] > temp[k]:
trend.append('up')
if temp[k+1] < temp[k]:
trend.append('down')
if temp[k+1] == temp[k]:
trend.append('nochange')
dftrend = pd.DataFrame(trend)
print(trend)

Here is a more pandas-like approach. We can get the difference of two consecutive elements of a series easily via pandas.DataFrame.diff:
import pandas as pd
df = pd.DataFrame({'Col1': [2.5, 1.5, 3 , 3 ,4.8 , 4 ]})
df_diff = df.diff()
Col1
0 NaN
1 -1.0
2 1.5
3 0.0
4 1.8
5 -0.8
Now you can apply a function elementwise that only distiguishes >0, <0 or ==0, using pandas.DataFrame.applymap
def direction(x):
if x > 0:
return 'up'
elif x < 0:
return 'down'
elif x == 0:
return 'nochange'
df_diff.applymap(direction))
Col1
0 None
1 down
2 up
3 nochange
4 up
5 down
Finally, it's a design decision what should happen to the first value of the series. Here the NaN value doesn't fit any case. You can also treat it separately in direction, or omit in in your result by slicing.
Edit: The same as a oneliner:
df.diff().applymap(lambda x: 'up' if x > 0 else ('down' if x < 0 else ('nochange' if x == 0 else None)))

You can use this:
df['trend'] = np.where(df.Col1.shift().isnull(), "N/A", np.where(df.Col1 == df.Col1.shift(), "nochange", np.where(df.Col1 < df.Col1.shift(), "down", "up")))
Col1 trend
0 2.5 N/A
1 1.5 down
2 3.0 up
3 3.0 nochange
4 4.8 up
5 4.0 down

Check if dataframe value +/- 1 exists anywhere else in a given column

Let's say I have a dataframe df that looks like this:
irrelevant location
0 1 0
1 2 0
2 3 1
3 4 3
How do I create a new true/false column "neighbor" to indicate if the value in "location" +/- 1 (plus or minus 1) exists anywhere else in the "location" column. Such that:
irrelevant location neighbor
0 1 0 True
1 2 0 True
2 3 1 True
3 4 3 False
The last row would be false, because neither 2 nor 4 appear anywhere in the df.location column.
I've tried these:
>>> df['neighbor']=np.where((df.location+1 in df.location.unique())|(df.location-1 in df.x.unique()), True, False)
ValueError: Lengths must match to compare
>>> df['tmp']=np.where((df.x+1 in df.x.tolist())|(df.x-1 in df.x.tolist()), 'true', 'false')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Alternatively, thanks in advance for help directing me to earlier instances of this question being asked (I don't seem to have the right vocabulary to find them).

To find a neighbor anywhere in the column, create a list of all neighbor values then check isin.
import numpy as np
vals = np.unique([df.location+1, df.location-1])
#array([-1, 0, 1, 2, 4], dtype=int64)
df['neighbor'] = df['location'].isin(vals)
# irrelevant location neighbor
#0 1 0 True
#1 2 0 True
#2 3 1 True
#3 4 3 False
Just because, this is also possible with pd.merge_asof setting a tolerance to find the neighbors. We assing a value of True, which is brought in the merge if a neighbor exists. Otherwise it's left NaN which we fill with False after the merge.
(pd.merge_asof(df,
df[['location']].assign(neighbor=True),
on='location',
allow_exact_matches=False, # Don't match with same value
direction='nearest', # Either direction
tolerance=1) # Within 1, inclusive
.fillna(False))

You just need a little fix:
df['neighbor']=np.where(((df['location']+1).isin(df['location'].unique()))|((df['location']-1).isin(df['location'].unique())), True, False)

A For loop with embedded if statement to update a dataframe

I am a newcomer to python. I want to implement a "For" loop on the elements of a dataframe, with an embedded "if" statement.
Code:
import numpy as np
import pandas as pd
#Dataframes
x = pd.DataFrame([1,-2,3])
y = pd.DataFrame()
for i in x.iterrows():
for j in x.iteritems():
if x>0:
y = x*2
else:
y = 0
With the previous loop, I want to go through each item in the x dataframe and generate a new dataframe y based on the condition in the "if" statement. When I run the code, I get the following error message.
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any help would be much appreciated.

In pandas is best avoid loops if exist vectorized solution:
x = pd.DataFrame([1,-2,3], columns=['a'])
y = pd.DataFrame(np.where(x['a'] > 0, x['a'] * 2, 0), columns=['b'])
print (y)
b
0 2
1 0
2 6
Explanation:
First compare column by value for boolean mask:
print (x['a'] > 0)
0 True
1 False
2 True
Name: a, dtype: bool
Then use numpy.where for set values by conditions:
print (np.where(x['a'] > 0, x['a'] * 2, 0))
[2 0 6]
And last use DataFrame constructor or create new column:
x['new'] = np.where(x['a'] > 0, x['a'] * 2, 0)
print (x)
a new
0 1 2
1 -2 0
2 3 6

You can try this:
y = (x[(x > 0)]*2).fillna(0)

Drop entire row subject to value in column

I am facing a problem with a rather simple command. I hava a DataFrame and want to delete the respective row if the value in column1 (in this row) exceeds e.g. 5.
First step, the if-condition:
if df['column1]>5:
Using this command, I always get the following Value Error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Do you have an idea what this could be about?
Second step (drop row):
How do I specify that Python shall delete the entire row? Do I have to work with a loop or is there a simple solution such as df.drop(df.index[?]).
I am still rather unexperienced with Python and would appreciate any support and suggestions!

The reason you're getting the error, is because df['column1'] > 5 returns a series of booleans, equal in length to column1, and a Series can't be true or false, i.e. "The truth value of a Series is ambiguous".
That said, if you just need to select out rows fulfilling a specific condition, then you can use the returned series as a boolean index, for example
>>> from numpy.random import randn
>>> from pandas import DataFrame
#Create a data frame of 10 rows by 5 cols
>>> D = DataFrame(randn(10,5))
>>> D
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
2 0.597807 -0.705585 -0.019233 -0.552494 -1.881875
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
6 1.481952 -2.143201 -0.747700 -0.597314 0.428769
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
9 -0.646145 -0.188199 -1.363282 -1.386130 1.065585
#Making a comparison test against a whole column yields a boolean series
>>> D[2] >= 0
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 True
9 False
Name: 2, dtype: bool
#Which can be used directly to select rows, like so
>>> D[D[2] >=0]
#note rows 2, 6 and 9 re now missing.
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
#if you want, you can make a new data frame out of the result
>>> N = D[D[2] >= 0]
>>> N
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
For more, see the Pandas docs on boolean indexing; Note the dot syntax for column selection used in the docs only works for non-numeric column names, so in the example above D[D.2 >= 0] wouldn't work.
If you actually need to remove rows, then you would need to look into creating a deep copy dataframe of only the specific rows. I'd have to dive into the docs quite deep to figure that out, because pandas tries it's level best to do most things by reference, to avoid copying huge chunks of memory around.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas. Check that rows are partly equals - python

THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array import numpy as np x = df_map[i,2:] y = df_map[j,2:] np.equal(y.values,x.values)

Related

pandas fill the column values with min function

compare value of the current index to the value of next index in Pandas df

Check if dataframe value +/- 1 exists anywhere else in a given column

A For loop with embedded if statement to update a dataframe

Drop entire row subject to value in column

Categories

Resources