Let's say I have a dataframe df that looks like this:
irrelevant location
0 1 0
1 2 0
2 3 1
3 4 3
How do I create a new true/false column "neighbor" to indicate if the value in "location" +/- 1 (plus or minus 1) exists anywhere else in the "location" column. Such that:
irrelevant location neighbor
0 1 0 True
1 2 0 True
2 3 1 True
3 4 3 False
The last row would be false, because neither 2 nor 4 appear anywhere in the df.location column.
I've tried these:
>>> df['neighbor']=np.where((df.location+1 in df.location.unique())|(df.location-1 in df.x.unique()), True, False)
ValueError: Lengths must match to compare
>>> df['tmp']=np.where((df.x+1 in df.x.tolist())|(df.x-1 in df.x.tolist()), 'true', 'false')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Alternatively, thanks in advance for help directing me to earlier instances of this question being asked (I don't seem to have the right vocabulary to find them).
To find a neighbor anywhere in the column, create a list of all neighbor values then check isin.
import numpy as np
vals = np.unique([df.location+1, df.location-1])
#array([-1, 0, 1, 2, 4], dtype=int64)
df['neighbor'] = df['location'].isin(vals)
# irrelevant location neighbor
#0 1 0 True
#1 2 0 True
#2 3 1 True
#3 4 3 False
Just because, this is also possible with pd.merge_asof setting a tolerance to find the neighbors. We assing a value of True, which is brought in the merge if a neighbor exists. Otherwise it's left NaN which we fill with False after the merge.
(pd.merge_asof(df,
df[['location']].assign(neighbor=True),
on='location',
allow_exact_matches=False, # Don't match with same value
direction='nearest', # Either direction
tolerance=1) # Within 1, inclusive
.fillna(False))
You just need a little fix:
df['neighbor']=np.where(((df['location']+1).isin(df['location'].unique()))|((df['location']-1).isin(df['location'].unique())), True, False)
Related
I have a problem coding a loop to subset a Dataframe in Python.
This is my first post on stack overflow and I have started to code fews months ago so I am sorry if I am doing something wrong ..! I have looked over the web for days now but couldn't find an answer (my keywords might have been poorly chosen..)
To give some context, here is how I obtained my df from a csv file:
#Library
import pandas as pd
import numpy as np
#Assisgn spreadsheets filenames and read files into a Dataframe
file_20 = '/Users/cortana/Desktop/Projet stage/DAT/dat_clean/donnees_assemblees_20.csv'
df_20_initial = pd.read_csv(file_20, sep=';', usecols=[0, 2, 3])
#Create dictionary with tables names as keys
tables_names_20 = pd.DataFrame.dropna(df_20_initial.iloc[:,[0]])
tables_names_20 = tables_names_20.set_index('20').T.to_dict()
#Slice the global dataframe and store the subsets into the dictionary as values
df_20_initial['separators'] = df_20_initial['time'].isna() #add a new column that check for missing values (separators)
print(df_20_initial)
Here is what my df looks like:
20 time velocity separators
0 P1S1 6.158655 0.136731 False
1 NaN 6.179028 0.244889 False
2 NaN 6.199253 0.386443 False
3 NaN 6.219323 0.571861 False
4 NaN 6.239505 0.777680 False
.. ... ... ... ...
520 NaN 7.008377 1.423408 False
521 NaN 7.028759 1.180113 False
522 NaN 7.048932 0.929300 False
523 NaN 7.068993 0.673909 False
524 NaN 7.089557 0.413527 False
[525 rows x 4 columns]
Based on the boolean value present in the "separators" column, I would like to create a new Dataframe containing the values of the "time" and "velocity" column, sliced when the "separators" value is True.
To do so, I have unsuccessfully tried to code the following loop:
for lab, row in df_20_initial.iterrows() :
if df_20_initial.iloc[:,3] == False :
P1S1 = df_20_intermediate[['time', 'velocity']]
else :
break
... and got this error message from Python:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any advices is welcome, and thank you all in advance for your time!
For my experiments I used your DataFrame with separators set to True
in some rows:
20 time velocity separators
0 P1S1 6.158655 0.136731 False
1 NaN 6.179028 0.244889 False
2 NaN 6.199253 0.386443 False
3 NaN 6.219323 0.571861 True
4 NaN 6.239505 0.777680 False
5 NaN 7.008377 1.423408 False
6 NaN 7.028759 1.180113 False
7 NaN 7.048932 0.929300 True
8 NaN 7.068993 0.673909 False
9 NaN 7.089557 0.413527 False
I assumed that separators column is of bool type.
To generate a list of chunks you can use e.g. the following list
comprehension:
dfList = [ chunk[['time', 'velocity']] for _, chunk in
df_20_initial.groupby(df_20_initial.separators.cumsum()) ]
Now when you e.g. print dfList[1] you will get:
time velocity
3 6.219323 0.571861
4 6.239505 0.777680
5 7.008377 1.423408
6 7.028759 1.180113
But if you want to drop separator rows, run:
dfList2 = [ chunk[~chunk.separators][['time', 'velocity']] for _, chunk in
df_20_initial.groupby(df_20_initial.separators.cumsum()) ]
(from each chunk leave only rows with separators == False).
Pandas is really good at Boolean slices. If I understand your question correctly, I think all you need is:
new_df = df_20_initial[df_20_initial['separators']]
If you want to remove the 'separators' column from the output, you can just select the remaining columns like so:
new_df = df_20_initial[df_20_initial['separators']][['time', 'velocity']]
I have a dataframe with 2 columns and I need to add 3rd column 'start'. However my code for some reason doesn't work and I am not sure why. Here is my code
df.loc[df.type=='C', 'start']= min(-1+df['dq']-3,4)
df.loc[df.type=='A', 'start']= min(-3+df['dq']-3,4)
df.loc[df.type=='B', 'start']= min(-3+df['dq']-5,4)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(),
a.any() or a.all().
and the dataset looks like this:
type dq
A 3
A 4
B 8
C 3
The error is being raised because first argument you pass to min() is a Series and the second (4) is an int.
Since you're using min to replace values greater than 4 with 4, you can just replace it once at the end using where:
df.loc[df.type=='C', 'start'] = -1+df['dq']-3
df.loc[df.type=='A', 'start'] = -3+df['dq']-3
df.loc[df.type=='B', 'start'] = -3+df['dq']-5
df["start"] = df["start"].where(df["start"]<4,other=4)
>>> df
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1
Another (perhaps cleaner) way of getting your column would be to use numpy.select, like so:
import numpy as np
df["start"] = np.select([df["type"]=="A", df["type"]=="B", df["type"]=="C"],
[df['dq']-6, df["dq"]-8, df["dq"]-4])
df["start"] = df["start"].where(df["start"]<4, 4)
You cannot use a Series with min. Instead, you can do:
s = (df['dq'] - df['type'].map({'A': 6, 'B': 8, 'C': 4}))
df['start'] = s.where(s<4, 4)
output:
type dq start
0 A 3 -3
1 A 4 -2
2 B 8 0
3 C 3 -1
So I took the mean of a pandas data frame column that contains boolean values. I've done this in the past multiple times and understood that it would return the proportion that is True. But when I wrote it in this particular instance, it didn't work. It returns the proportion that is False and not only that, the denominator it uses doesn't seem to relate to anything. I have no idea where it pulls the denominator from to calculate the proportion value. I discovered it works the way I want it to when I remove the second line of code (datadf = datadf[1:])
# get current row value minus previous row value and returns True if > 0
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
# remove first row because it gives 'None'
datadf = datadf[1:]
# calculate proportion that is True
accretionscore = datadf['increase'].mean()
This is the output
date price increase
1 2020-09-28 488.51 True
2 2020-09-29 489.33 True
3 2020-09-30 490.43 True
4 2020-10-01 499.51 True
5 2020-10-02 478.99 False
correct value: 0.8
value given: 0.2
When I try adding another sample that's when things get weirder:
date price increase
1 2020-09-27 479.78 False
2 2020-09-28 488.51 True
3 2020-09-29 489.33 True
4 2020-09-30 490.43 True
5 2020-10-01 499.51 True
6 2020-10-02 478.99 False
correct value: 0.6666666666666666
value given: 0.16666666666666666
they don't even add up to 1!
I'm so confused. Can anyone tell me what is going on? How does taking out the second line fix the problem?
Hint: if you want to convert from boolean to int, then you just can use:
datadf['increase'] = datadf['increase'].astype(int)
and this way things will work fine.
If we run your code, you can see that datadf['increase'] is an object instead of a boolean, so taking mean on this is most likely converting the categories to a number and so on.. basically something weird:
import pandas as pd
datadf = pd.DataFrame({'price':[470,488.51,489.33,490.43,499.51,478.99]})
datadf['increase'] = datadf.index.map(lambda x: datadf.loc[x]['price'] - datadf.loc[x-1]['price'] > 0 if x > 0 else None)
datadf['increase']
Out[8]:
0 None
1 True
2 True
3 True
4 True
5 False
Name: increase, dtype: object
datadf['increase'].dtype
dtype('O')
From what I can see, you want True / False on whether the row is larger than its preceding, so do:
datadf['increase'] = datadf.price > datadf.price.shift(1)
datadf['increase'].dtype
dtype('bool')
And we just omit the first row by doing:
datadf['increase'][1:].mean()
0.8
I have table:
tel size 1 2 3 4
0 123 1 Baby Baby None None
1 234 1 Shave Shave None None
2 222 1 Baby Baby None None
3 333 1 Shave Shave None None
I want to check if values in tables 1,2,3,4 ... are partly equal with 2 loops:
x = df_map.iloc[i,2:]
y = df_map.iloc[j,2:]
so df_map.iloc[0,2:] should be equal to df_map.iloc[2,2:],
and df_map.iloc[1,2:], is equal to df_map.iloc[3,2:],
I tried:
x == y
and
y.eq(x)
but it returns error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If i use (x==y).all() or (x==y).any() it returns wrong result.
I need something like:
if x== y:
counter += 1
Update:
problem was in None values. I used fillna('') and (x == y).all()
fillna('') because None == None is False
use numpy broadcasting evaluate ==
all(-1) to make sure the whole row matches
np.fill_diagonal because we don't need self matches
np.where to find where the matches are
v = df.fillna('').values[:, 2:]
match = ((v[None, :] == v[:, None]).all(-1))
np.fill_diagonal(match, False)
i, j = np.where(match)
pd.Series(i, j)
2 0
3 1
0 2
1 3
dtype: int64
THe type of pandas series(rows,columns) are numpy array,you can only get the results by column to column unless you loop again from the results which is also another array
import numpy as np
x = df_map[i,2:]
y = df_map[j,2:]
np.equal(y.values,x.values)
I am facing a problem with a rather simple command. I hava a DataFrame and want to delete the respective row if the value in column1 (in this row) exceeds e.g. 5.
First step, the if-condition:
if df['column1]>5:
Using this command, I always get the following Value Error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
Do you have an idea what this could be about?
Second step (drop row):
How do I specify that Python shall delete the entire row? Do I have to work with a loop or is there a simple solution such as df.drop(df.index[?]).
I am still rather unexperienced with Python and would appreciate any support and suggestions!
The reason you're getting the error, is because df['column1'] > 5 returns a series of booleans, equal in length to column1, and a Series can't be true or false, i.e. "The truth value of a Series is ambiguous".
That said, if you just need to select out rows fulfilling a specific condition, then you can use the returned series as a boolean index, for example
>>> from numpy.random import randn
>>> from pandas import DataFrame
#Create a data frame of 10 rows by 5 cols
>>> D = DataFrame(randn(10,5))
>>> D
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
2 0.597807 -0.705585 -0.019233 -0.552494 -1.881875
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
6 1.481952 -2.143201 -0.747700 -0.597314 0.428769
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
9 -0.646145 -0.188199 -1.363282 -1.386130 1.065585
#Making a comparison test against a whole column yields a boolean series
>>> D[2] >= 0
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 True
9 False
Name: 2, dtype: bool
#Which can be used directly to select rows, like so
>>> D[D[2] >=0]
#note rows 2, 6 and 9 re now missing.
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
#if you want, you can make a new data frame out of the result
>>> N = D[D[2] >= 0]
>>> N
0 1 2 3 4
0 0.686901 1.714871 0.809863 -1.162436 1.757198
1 -0.071436 -0.898714 0.062620 1.443304 -0.784341
3 1.313344 -1.146257 1.189182 0.169836 -0.186611
4 0.081255 -0.168989 1.181580 0.366820 2.999468
5 -0.221144 1.222413 1.199573 0.988437 0.378026
7 0.006805 0.876228 0.884723 -0.899379 -0.270513
8 -0.222297 1.695049 0.638627 -1.500652 -1.088818
For more, see the Pandas docs on boolean indexing; Note the dot syntax for column selection used in the docs only works for non-numeric column names, so in the example above D[D.2 >= 0] wouldn't work.
If you actually need to remove rows, then you would need to look into creating a deep copy dataframe of only the specific rows. I'd have to dive into the docs quite deep to figure that out, because pandas tries it's level best to do most things by reference, to avoid copying huge chunks of memory around.