Assign values in Pandas series based on condition? - python

I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?

You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64

When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))

Related

Pandas: Determine if a string in one column is a substring of a string in another column

Consider these series:
>>> a = pd.Series('abc a abc c'.split())
>>> b = pd.Series('a abc abc a'.split())
>>> pd.concat((a, b), axis=1)
0 1
0 abc a
1 a abc
2 abc abc
3 c a
>>> unknown_operation(a, b)
0 False
1 True
2 True
3 False
The desired logic is to determine if the string in the left column is a substring of the string in the right column. pd.Series.str.contains does not accept another Series, and pd.Series.isin checks if the value exists in the other series (not in the same row specifically). I'm interested to know if there's a vectorized solution (not using .apply or a loop), but it may be that there isn't one.
Let us try with numpy defchararray which is vectorized
from numpy.core.defchararray import find
find(df['1'].values.astype(str),df['0'].values.astype(str))!=-1
Out[740]: array([False, True, True, False])
IIUC,
df[1].str.split('', expand=True).eq(df[0], axis=0).any(axis=1) | df[1].eq(df[0])
Output:
0 False
1 True
2 True
3 False
dtype: bool
I tested various functions with a randomly generated Dataframe of 1,000,000 5 letter entries.
Running on my machine, the averages of 3 tests showed:
zip > v_find > to_list > any > apply
0.21s > 0.79s > 1s > 3.55s > 8.6s
Hence, i would recommend using zip:
[x[0] in x[1] for x in zip(df['A'], df['B'])]
or vectorized find (as proposed by BENY)
np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
My test-setup:
def generate_string(length):
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
A = [generate_string(5) for x in range(n)]
B = [generate_string(5) for y in range(n)]
df = pd.DataFrame({"A": A, "B": B})
to_list = pd.Series([a in b for a, b in df[['A', 'B']].values.tolist()])
apply = df.apply(lambda s: s["A"] in s["B"], axis=1)
v_find = np.char.find(df['B'].values.astype(str), df['A'].values.astype(str)) != -1
any = df["B"].str.split('', expand=True).eq(df["A"], axis=0).any(axis=1) | df["B"].eq(df["A"])
zip = [x[0] in x[1] for x in zip(df['A'], df['B'])]

Python/ Pandas If statement inside a function explained

I have the following example and I cannot understand why it doesn't work.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def balh(a, b):
z = a + b
if z.any() > 1:
return z + 1
else:
return z
df['col3'] = balh(df.col1, df.col2)
Output:
My expected output would be see 5 and 7 not 4 and 6 in col3, since 4 and 6 are grater than 1 and my intention is to add 1 if a + b are grater than 1
The any method will evaluate if any element of the pandas.Series or pandas.DataFrame is True. A non-null integer is evaluated as True. So essentially by if z.any() > 1 you are comparing the True returned by the method with the 1 integer.
You need to condition directly the pandas.Series which will return a boolean pandas.Series where you can safely apply the any method.
This will be the same for the all method.
def balh(a, b):
z = a + b
if (z > 1).any():
return z + 1
else:
return z
As #arhr clearly explained the issue was the incorrect call to z.any(), which returns True when there is at least one non-zero element in z. It resulted in a True > 1 which is a False expression.
A one line alternative to avoid the if statement and the custom function call would be the following:
df['col3'] = df.iloc[:, :2].sum(1).transform(lambda x: x + int(x > 1))
This gets the first two columns in the dataframe then sums the elements along each row and transforms the new column according to the lambda function.
The iloc can also be omitted because the dataframe is instantiated with only two columns col1 and col2, thus the line can be refactored to:
df['col3'] = df.sum(1).transform(lambda x: x + int(x > 1))
Example output:
col1 col2 col3
0 1 3 5
1 2 4 7

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

Filter a pandas dataframe using values from a dict

I need to filter a data frame with a dict, constructed with the key being the column name and the value being the value that I want to filter:
filter_v = {'A':1, 'B':0, 'C':'This is right'}
# this would be the normal approach
df[(df['A'] == 1) & (df['B'] ==0)& (df['C'] == 'This is right')]
But I want to do something on the lines
for column, value in filter_v.items():
df[df[column] == value]
but this will filter the data frame several times, one value at a time, and not apply all filters at the same time. Is there a way to do it programmatically?
EDIT: an example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':1, 'B':0, 'C':'right'}
df1.loc[df1[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
gives
A B C D
0 1 1 right 1
1 0 1 right 2
3 1 0 right 3
but the expected result was
A B C D
3 1 0 right 3
only the last one should be selected.
IIUC, you should be able to do something like this:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
This works by making a Series to compare against:
>>> pd.Series(filter_v)
A 1
B 0
C right
dtype: object
Selecting the corresponding part of df1:
>>> df1[list(filter_v)]
A C B
0 1 right 1
1 0 right 1
2 1 wrong 1
3 1 right 0
4 NaN right 1
Finding where they match:
>>> df1[list(filter_v)] == pd.Series(filter_v)
A B C
0 True False True
1 False False True
2 True False False
3 True True True
4 False False True
Finding where they all match:
>>> (df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)
0 False
1 False
2 False
3 True
4 False
dtype: bool
And finally using this to index into df1:
>>> df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).all(axis=1)]
A B C D
3 1 0 right 3
Abstraction of the above for case of passing array of filter values rather than single value (analogous to pandas.core.series.Series.isin()). Using the same example:
df1 = pd.DataFrame({'A':[1,0,1,1, np.nan], 'B':[1,1,1,0,1], 'C':['right','right','wrong','right', 'right'],'D':[1,2,2,3,4]})
filter_v = {'A':[1], 'B':[1,0], 'C':['right']}
##Start with array of all True
ind = [True] * len(df1)
##Loop through filters, updating index
for col, vals in filter_v.items():
ind = ind & (df1[col].isin(vals))
##Return filtered dataframe
df1[ind]
##Returns
A B C D
0 1.0 1 right 1
3 1.0 0 right 3
Here is a way to do it:
df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
UPDATE:
With values being the same across columns you could then do something like this:
# Create your filtering function:
def filter_dict(df, dic):
return df[df[dic.keys()].apply(
lambda x: x.equals(pd.Series(dic.values(), index=x.index, name=x.name)), asix=1)]
# Use it on your DataFrame:
filter_dict(df1, filter_v)
Which yields:
A B C D
3 1 0 right 3
If it something that you do frequently you could go as far as to patch DataFrame for an easy access to this filter:
pd.DataFrame.filter_dict_ = filter_dict
And then use this filter like this:
df1.filter_dict_(filter_v)
Which would yield the same result.
BUT, it is not the right way to do it, clearly.
I would use DSM's approach.
For python2, that's OK in #primer's answer. But, you should be careful in Python3 because of dict_keys. For instance,
>> df.loc[df[filter_v.keys()].isin(filter_v.values()).all(axis=1), :]
>> TypeError: unhashable type: 'dict_keys'
The correct way to Python3:
df.loc[df[list(filter_v.keys())].isin(list(filter_v.values())).all(axis=1), :]
Here's another way:
filterSeries = pd.Series(np.ones(df.shape[0],dtype=bool))
for column, value in filter_v.items():
filterSeries = ((df[column] == value) & filterSeries)
This gives:
>>> df[filterSeries]
A B C D
3 1 0 right 3
To follow up on DSM's answer, you can also use any() to turn your query into an OR operation (instead of AND):
df1.loc[(df1[list(filter_v)] == pd.Series(filter_v)).any(axis=1)]
You can also create a query
query_string = ' and '.join(
[f'({key} == "{val}")' if type(val) == str else f'({key} == {val})' for key, val in filter_v.items()]
)
df1.query(query_string)
Combining previous answers, here's a function you can feed to df1.loc. Allows for AND/OR (using how='all'/'any'), plus it allows comparisons other than == using the op keyword, if desired.
import operator
def quick_mask(df, filters, how='all', op=operator.eq) -> pd.Series:
if how == 'all':
comb = pd.Series.all
elif how == 'any':
comb = pd.Series.any
return comb(op(df[[*filters]], pd.Series(filters)), axis=1)
# Usage
df1.loc[quick_mask(df1, filter_v)]
I had an issue due to my dictionary having multiple values for the same key.
I was able to change DSM's query to:
df1.loc[df1[list(filter_v)].isin(filter_v).all(axis=1), :]

Categories

Resources