I have a Pandas dataframe that looks like this:
df = pd.DataFrame({'gp_id': [1, 2, 1, 2], 'A': [1, 2, 3, 4]})
gp_id A
0 1 1
1 2 2
2 1 3
3 2 4
I want to assign the value -1 to the first row of the group with the id 2 (gp_id = 2), to get the following output:
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
To do this, I've tried the following code:
df[df.gp_id == 2].A.iloc[0] = -1
But this doesn't do anything as I'm assigning a value in the sub-dataframe df[df.gp_id == 2] and I'm not modifying the original dataframe df.
Is there an easy way to solve this problem?
You could do:
df.loc[(df.gp_id == 2).argmax(), 'A'] = -1
as pd.Series.argmax returns the first max.
If you are not sure that the value is present in the dataframe, you could do:
cond = (df.gp_id == 2)
if cond.sum():
df.loc[cond.argmax(), 'A'] = -1
General solution if possible mask return no rows is chain another mask by cumulative sum of mask by & for bitwise AND and set values by DataFrame.loc:
m = df.gp_id == 2
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Working well if no match - no assign, no error, no incorrect assignment:
m = df.gp_id == 7
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Solution if always match mask at least one row is:
idx = df[df.gp_id == 2].index[0]
df.loc[idx, 'A'] = -1
print (df)
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
If no match, solution raise error, no incorrect assignment.
Related
I have a dataFrame:
df =
a b c d e
0 0 1 2 3 4
1 1 2 3 0 4
2 2 3 1 4 0
I would like to get the values that occur N times in a certain column.
For example, if I want to get all the values that occur 2 times in column "e", i would get result = [4], and if I would like to get all the values that occur 1 time in column "d", I would get result = [3,0,4].
I can do df['e'].value_counts() == 2 but that gives a True/False series. I would just want to get the values in "True".
What you did returns a True/False series, so we need to use this to get the index values!
col = 'd'
n = 1
df[col].value_counts() == n
# 3 True
# 0 True
# 4 True
# Name: d, dtype: bool
To get the indeces that have True behind them, we can do:
df[col].value_counts().index[df[col].value_counts() == n]
# Int64Index([3, 0, 4], dtype='int64')
To create a list, we only need to use list():
list(df[col].value_counts().index[df[col].value_counts() == n])
# [3, 0, 4]
EDIT:
You can assign val_counts = df[col].value_counts() and use this like so (or see the answer from #jezrael):
list(val_counts.index[val_counts == n])
# [3, 0, 4]
You can filter index values after Series.value_counts:
s = df['e'].value_counts()
L = s.index[s.eq(2)].tolist()
print (L)
[4]
s = df['d'].value_counts()
L = s.index[s.eq(1)].tolist()
print (L)
[0, 4, 3]
I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?
You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter
def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.
What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.
I have created a new column by comparing two boolean columns. If both are positive, I assign a 1, otherwise a 0. This is my code below, but is there a way to be more pythonic? I tried list comprehension but failed.
lst = []
for i,k in zip(df['new_customer'],df['y']):
if i == 1 & k == 1:
lst.append(1)
else:
lst.append(0)
df['new_customer_subscription'] = lst
Use np.sign:
m = np.sign(df[['new_customer', 'y']]) >= 0
df['new_customer_subscription'] = m.all(axis=1).astype(int)
If you want to consider only positive non-zero values, change >= 0 to > 0 (since np.sign(0) is 0).
# Sample DataFrame.
df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
df
A B
0 0.511684 -0.512633
1 -1.254813 -1.721734
2 0.751830 0.285449
3 -0.934877 1.407998
4 -1.686066 -0.947015
# Get the sign of the numbers.
m = np.sign(df[['A', 'B']]) >= 0
m
A B
0 True False
1 False False
2 True True
3 False True
4 False False
# Find all rows where both columns are `True`.
m.all(axis=1).astype(int)
0 0
1 0
2 1
3 0
4 0
dtype: int64
Another solution if you have to deal with only two columns would be:
df['new_customer_subscription'] = (
df['new_customer'].gt(0) & df['y'].gt(0)).astype(int)
To generalise to multiple columns, use logical_and.reduce:
df['new_customer_subscription'] = np.logical_and.reduce(
df[['new_customer', 'y']] > 0, axis=1).astype(int)
Or,
df['new_customer_subscription'] = (df[['new_customer', 'y']] > 0).all(1).astype(int)
Another way to do this is using the np.where from the numpys module:
df['Indicator'] = np.where((df.A > 0) & (df.B > 0), 1, 0)
Output
A B Indicator
0 -0.464992 0.418243 0
1 -0.902320 0.496530 0
2 0.219111 1.052536 1
3 -1.377076 0.207964 0
4 1.051078 2.041550 1
The np.where method works like this:
np.where(condition, true value, false value)
I'm trying to convert the negative value of rows in column 'nominal' where the corresponding value in column 'side' is equal to 'B'. I don't want to lose any rows that are not converted. I've tried this below but getting raise KeyError('%s not in index' % objarr[mask])
df[-df['nominal']].where(df['side']=='B')
Just use both conditions in a boolean index with &.
df[(df.side == 'B') & (df.nominal < 0)]
or if you intend on modifying,
df.loc[(df.side == 'B') & (df.nominal < 0), 'nominal']
Example
>>> df = pd.DataFrame(dict(side=['A']*3+['B']*3, nominal = [1, -2, -2, 2, 6, -5]))
>>> df
nominal side
0 1 A
1 -2 A
2 -2 A
3 2 B
4 6 B
5 -5 B
>>> df.loc[(df.side == 'B') & (df.nominal < 0), 'nominal'] = 1000
>>> df
nominal side
0 1 A
1 -2 A
2 -2 A
3 2 B
4 6 B
5 1000 B
This is a very standard way for filtering data in Pandas that you'll come across often. See Boolean Indexing in the Pandas docs.
Update
For your updated problem description, we can just use the augmented assignment operator *= to multiply our desired values by -1.
df.loc[(df.side == 'B'), 'nominal'] *= -1
Example
>>> df = pd.DataFrame(dict(nominal = [1, 2, 5, 3, 5, 3], side=['A']*3 + ['B']*3))
>>> df
nominal side
0 1 A
1 2 A
2 5 A
3 3 B
4 5 B
5 3 B
>>> df.loc[(df.side == 'B'), 'nominal'] *= -1
df
nominal side
0 1 A
1 2 A
2 5 A
3 -3 B
4 -5 B
5 -3 B
You should try this:
df.loc[ ( df.side == 'B' ), 'nominal' ] *= -1
I have a pandas dataframe with a text column.
I'd like to create a new column in which values are conditional on the start of the text string from the text column.
So if the 30 first characters of the text column:
== 'xxx...xxx' then return value 1
== 'yyy...yyy' then return value 2
== 'zzz...zzz' then return value 3
if none of the above return 0
There is possible use multiple numpy.where but if more conditions use apply:
For select strings from strats use indexing with str.
df = pd.DataFrame({'A':['xxxss','yyyee','zzzswee','sss'],
'B':[4,5,6,8]})
print (df)
A B
0 xxxss 4
1 yyyee 5
2 zzzswee 6
3 sss 8
#check first 3 values
a = df.A.str[:3]
df['new'] = np.where(a == 'xxx', 1,
np.where(a == 'yyy', 2,
np.where(a == 'zzz', 3, 0)))
print (df)
A B new
0 xxxss 4 1
1 yyyee 5 2
2 zzzswee 6 3
3 sss 8 0
def f(x):
#print (x)
if x == 'xxx':
return 1
elif x == 'yyy':
return 2
elif x == 'zzz':
return 3
else:
return 0
df['new'] = df.A.str[:3].apply(f)
print (df)
A B new
0 xxxss 4 1
1 yyyee 5 2
2 zzzswee 6 3
3 sss 8 0
EDIT:
If length is different, only need:
df['new'] = np.where(df.A.str[:3] == 'xxx', 1,
np.where(df.A.str[:2] == 'yy', 2,
np.where(df.A.str[:1] == 'z', 3, 0)))
print (df)
A B new
0 xxxss 4 1
1 yyyee 5 2
2 zzzswee 6 3
3 sss 8 0
EDIT1:
Thanks for idea to Quickbeam2k1 use str.startswith for check starts of each string:
df['new'] = np.where(df.A.str.startswith('xxx'), 1,
np.where(df.A.str.startswith('yy'), 2,
np.where(df.A.str.startswith('z'), 3, 0)))
print (df)
A B new
0 xxxss 4 1
1 yyyee 5 2
2 zzzswee 6 3
3 sss 8 0
A different and slower solution:
However, the advantage is that the mapping from patterns is a function parameter (with implicit default 0 value)
def map_starts_with(pat_map):
def map_string(t):
pats = [pat for pat in pat_map.keys() if t.startswith(pat)]
return pat_map.get(pats[0]) if len(pats) > 0 else 0
# get only value of "first" pattern if at least one pattern is found
return map_string
df = pd.DataFrame({'col':[ 'xx', 'aaaaaa', 'c']})
col
0 xx
1 aaaaaa
2 c
mapping = { 'aaa':4 ,'c':3}
df.col.apply(lambda x: map_starts_with(mapping)(x))
0 0
1 4
2 3
Note the we also used currying here. I'm wondering if this approach can be implemented using additional pandas or numpy functionality.
Note that the "first" pattern match may depend on the traversal order of the dict keys. This is irrelephant if there is no overlap in the keys. (Jezrael's solution, or its direct generalization thereof, will also choose one element for the match, but in a more predictable manner)