Computing a new column based on other columns - python

I am trying to create a subset of an existing variable (col1) in the below df. My new variable (col2) would only have "a" corresponding to "a" in col1. Rest of the values should be marked as "Others". Please help.
col1
a
b
c
a
b
c
a
Col2
a
Other
Other
a
Other
Other
a

Use numpy.where:
df['col2'] = np.where(df['col1'] == 'a', 'a', 'Other')
#alternative
#df['col2'] = df['col1'].where(df['col1'] == 'a', 'Other')
print (df)
col1 col2
0 a a
1 b Other
2 c Other
3 a a
4 b Other
5 c Other
6 a a

Method 1: np.where
This is the most direct method:
df['col2'] = np.where(df['col1'] == 'a', 'a', 'Other')
Method 2: pd.DataFrame.loc
df['col2'] = 'Other'
df.loc[df['col1'] == 'a', 'col2'] = 'a'
Method 3: pd.Series.map
df['col2'] = df['col1'].map({'a': 'a'}).fillna('Other')
Most of these methods can be optimized by extracting numpy array representation via df['col1'].values.

Without any additional library since the question is not tagged with either pandas nor numpy :
You can use a list comprehension with if and else :
col1 = ['a', 'b', 'c', 'a', 'b', 'c', 'a']
col2 = [ x if x=='a' else 'others' for x in col1 ]

Related

assign 0 when value_count() is not found

I have a column that looks like this:
group
A
A
A
B
B
C
The value C exists sometimes but not always. This works fine when the C is present. However, if C does not occur in the column, it throws a key error.
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
I want to check whether C has a count or not. If not, I want to assign new_df["C"] a value of 0. I tried this but i still get a keyerror. What else can I try?
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
if (df.group.value_counts()['consents']):
new_df["C"] = value_counts.consents
else:
new_df["C"] = 0
One way of doing it is by converting series into dictionary and getting the key, unless not found return the default value (in your case it is 0):
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'D']})
new_df = {}
character = "C"
new_df[character] = df.group.value_counts().to_dict().get(character, 0)
output of new_df
{'C': 0}
However, I am not sure what new_df should be, it seems that it is a dictionary? Or it might be a new dataframe object?
One way could be to convert the group column to Categorical type with specified categories. eg:
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B']})
print(df)
# group
# 0 A
# 1 A
# 2 A
# 3 B
# 4 B
categories = ['A', 'B', 'C']
df['group'] = pd.Categorical(df['group'], categories=categories)
df['group'].value_counts()
[out]
A 3
B 2
C 0
Name: group, dtype: int64

Pandas - Find duplicated entries in one column within rows with equal values in another column

Assume a dataframe df like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.
Use Series.duplicated only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
You can group your col1 by col2 and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
A more generalised solution using Groupby.count and index.get_level_values:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']

Pandas how add a new column to dataframe based on values from all rows, specific columns values applied to whole dataframe

I working on a pandas DataFrame which needs a new column that shows count of specific values in specific columns.
I tried various combinations groupby and pivot, but had problems to apply it to whole dataframe without errors.
df = pd.DataFrame([
['a', 'z'],
['a', 'x'],
['a', 'y'],
['b', 'v'],
['b', 'x'],
['b', 'v']],
columns=['col1', 'col2'])
I need to add col3 that counts 'v' values in col2 for each value in 'col1'. There is no 'v' in col2 for 'a' in col1, so it's 0 everywhere, while expected value count is 2 for 'b', also in a row where value in col2 equals 'x' instead of 'v'.
Expected output:
['a', 'z', 0]
['a', 'x', 0]
['a', 'y', 0]
['b', 'v', 2]
['b', 'x', 2]
['b', 'v', 2]
I'm looking rather for a nice pandas specific solution because the original dataframe is quite big, so things like row iterations and time expensive.
Create a Boolean Series checking the equality then groupby +transform + sum to count them.
df['col3'] = df.col2.eq('v').astype(int).groupby(df.col1).transform('sum')
# col1 col2 col3
#0 a z 0
#1 a x 0
#2 a y 0
#3 b v 2
#4 b x 2
#5 b v 2
While ALollz's answer is neat and a one liner, here is another one although a two step solution introducing you to other concepts like str.contains and np.where!
First get the rows which have v using np.where and mark them as a flag:
df['col3'] = np.where(df['col2'].str.contains('v'), 1, 0)
Now perform a groupby on col1 and sum them:
df['col3'] = df.groupby('col1')['col3'].transform('sum')
Output:
col1 col2 col3
0 a z 0
1 a x 0
2 a y 0
3 b v 2
4 b x 2
5 b v 2
All the answers above are fine. The only caveat is that transform can be slow when the group size is very large. Alternatively, you can try the workaround below,
(df.assign(mask = lambda x:x.col2.eq('v'))
.pipe(lambda x:x.join(x.groupby('col1')['mask'].sum().map(int).rename('col3'),on='col1')))

FutureWarning in pandas dataframe

I have a sample python code:
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
ddf.sort_values(by='Id')
The above snippet produces ' FutureWarning: 'Id' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version'. And it does become a error when I try this under recent version of python. I am quite new to python and pandas. How do I resolve this issue?
Here the best is convert column Id to index with DataFrame.set_index for avoid index.name same with one of columns name:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf = ddf.set_index('Id')
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'col3'], dtype='object')
Better for sorting by index is DataFrame.sort_index:
print (ddf.sort_index())
col1 col3
Id
1 A a
2 B b
3 A x
Your solution working, if change index.name for different:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
Set different index.name by DataFrame.rename_axis or set by scalar:
ddf = ddf.rename_axis('newID')
#alternative
#ddf.index.name = 'newID'
print (ddf.index.name)
newID
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
So now is possible distinguish between index level and columns names, because sort_values working with both:
print(ddf.sort_values(by='Id'))
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
print (ddf.sort_values(by='newID'))
#same like sorting by index
#print (ddf.sort_index())
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
Simple add .values
ddf.index=ddf['Id'].values
ddf.sort_values(by='Id')
Out[314]:
col1 Id col3
1 A 1 a
2 B 2 b
3 A 3 x
Both your columns and row index contain 'Id', a simple solution would be to not set the (row) index as 'Id'.
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.sort_values(by='Id')
Out[0]:
col1 Id col3
1 A 1 a
2 B 2 b
0 A 3 x
Or set the index when you create the df:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'col3': ['x','a','b']},
index=[3,1,2])
ddf.sort_index()
Out[1]:
col1 col3
1 A a
2 B b
3 A x

How to create row in pandas dataframe from another row plus new data

Suppose that I have a pandas dataframe, df1:
import pandas as pd
df1col = ['col1', 'col2']
df1 = pd.DataFrame(columns=df1col)
df1.loc[0] = 'a', 'b'
My goal is to create df2 where the first two columns of df2 are the same as those in df1. I use the names of the columns and append the column names of col3 and col4 on (is there a better way to do this?) and create the dataframe, df2:
df2col = df1col
df2col.append('col3')
df2col.append('col4')
df2 = pd.DataFrame(columns=df2col)
Now I simply want to add the first (and only) row of df1 to the first row of df2 AND I want to add two new entries (c and d) so that all the columns are filled. I've tried:
df2.loc[0] = df1.loc[0], 'c', 'd'
and
df2.loc[0] = [df1.loc[0], 'c', 'd']
But neither works. Any hints?
You can extract the list and then add 'c' and 'd' to it:
df2.loc[0] = df1.loc[0].tolist() + ['c', 'd']
>>> df2
col1 col2 col3 col4
0 a b c d
The problem is that df1.loc[0] is an object, not a pair of values outside of any structure. You can fix it by extracting the values from df1.loc[0] and extending them with 'c' and 'd' as follows:
row1 = [val for val in df1.loc[0]]
row1.extend(['c', 'd'])
df2.loc[0] = row1
You can copy a dataframe and to add a column to a dataframe works like a dictionary.
import pandas as pd
df1col = ['col1', 'col2']
df1 = pd.DataFrame(columns=df1col)
df1.loc[0] = 'a', 'b'
df2 = df1.copy()
df2['col3'] = 'c'
df2['col4'] = 'd'

Categories

Resources