Pandas conditional creation of a new dataframe column - python

This question is an extension of Pandas conditional creation of a series/dataframe column.
If we had this dataframe:
Col1 Col2
1 A Z
2 B Z
3 B X
4 C Y
5 C W
and we wanted to do the equivalent of:
if Col2 in ('Z','X') then Col3 = 'J'
else if Col2 = 'Y' then Col3 = 'K'
else Col3 = {value of Col1}
How could I do that?

You can use loc with isin and last fillna:
df.loc[df.Col2.isin(['Z','X']), 'Col3'] = 'J'
df.loc[df.Col2 == 'Y', 'Col3'] = 'K'
df['Col3'] = df.Col3.fillna(df.Col1)
print (df)
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C

Try this use np.where : outcome = np.where(condition, true, false)
df["Col3"] = np.where(df['Col2'].isin(['Z','X']), "J", np.where(df['Col2'].isin(['Y']), 'K', df['Col1']))
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C

A simple (but likely inefficient) way can be useful when you have multiple if condition. Like you are trying to put values into (say) four buckets based on quartiles.
df holds your data, col1 has the values, col2 should have the bucketized values (1,2,3,4)
quart has the 25%, 50% and 75% bounds.
try this
create a dummy list as dummy = []
iterate through the data frame with: for index, row in df.iterrows():
Set up the if conditions like: if row[col1] <= quart[0]:#25%
append proper value in dummy under the if: dummy.append(1)
the nested if-elif can take care of all the needed optional values which you append to dummy.
add dummy as a column: df[col2] = dummy
You can find the quartiles via A = df.describe() and then print(A[col1])

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Pandas - explode a column and set a specific value to a column for replicated rows

I would like to explode a column Col1 of a dataframe and for all the replicated rows, set a specific value z for a given column Col2.
For example if my dataframe df is:
Col1
Col2
Col3
[A,B,C]
x
y
I would like to find a way using df.explode("Col1") and achieve:
Col1
Col2
Col3
A
x
y
B
z
y
C
z
y
Thank you for any idea.
You can try
out = (df.explode('Col1')
.groupby(level=0)
.apply(lambda g: g.assign(Col2=[g['Col2'].iloc[0]]+['z']*(len(g)-1))) # keep first row of Col2 and replace rest with z
.reset_index(drop=True))
print(out)
Col1 Col2 Col3
0 A x y
1 B z y
2 C z y

df.iterrows() if condition not working on a dataframe?

I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.
For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25

Extract regex match from pandas column and use as index to access list element

I have a pandas dataframe that looks like this:
col1 col2 col3
0 A,B,C 0|0 1|1
1 D,E,F 2|2 3|3
2 G,H,I 4|4 0|0
My goal is to apply a function on col2 through the last column of the dataframe that splits the corresponding string in col1, using the comma as the delimiter, and uses the first number as the index to get the corresponding list element. For numbers that are greater than the length of the list, I'd like to replace with the 0th element of the list.
Expected output:
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
In reality, my dataframe has thousands of columns with millions of entries that need this replacement, so I need a method that doesn't refer to 'col2' and 'col3' explicity (and preference for a computationally efficient method).
You can use this code to create the original dataframe:
df = pd.DataFrame(
{
'col1': ['A,B,C', 'D,E,F', 'G,H,I'],
'col2': ['0|0', '2|2', '4|4'],
'col3': ['1|1', '3|3', '0|0']
}
)
Taking into account that you could have a lot of columns and the length of the arrays in col1 could vary, you can use the following generalization, which only loops through the columns:
for col in df.columns[1:]:
df[col] = (df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
which outputs for your example:
>>> print(df)
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
Explanation:
first you get the index as string from colX and append to the string in col1 so that you get something like 'A,B,C,0' and split it to get a list with the last element been the index that you need ([A,B,C,0]):
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',')
Then you apply a function that returns the ith element been i the last element of the list and if i is bigger then the len of the list - 1 then return just the first element of the list.
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
Last but not least, you just put it in a loop over your desired columns.
I would first reduce your strange x|x for all x format:
df['col2'] = df['col2'].str.split('|', expand=True).iloc[:, 0]
df['col3'] = df['col3'].str.split('|', expand=True).iloc[:, 0]
Then split the letter mappings while keeping them aligned by row.
ndf = pd.concat([df, df['col1'].str.split(',', expand=True)], axis=1)
After that, map them back by row while making sure to prevent overflows:
def bad_mapping(row, c):
value = int(row[c])
if value <= 2: # adjust if needed
return row[value]
else:
return row[0]
for c in ['col2', 'col3']:
ndf['mapped_' + c] = ndf.apply(lambda r: bad_mapping(r, c), axis=1)
Output looks like:
col1 col2 col3 0 1 2 mapped_col2 mapped_col3
0 A,B,C 0 1 A B C A B
1 D,E,F 2 3 D E F F D
2 G,H,I 4 0 G H I G G
Drop columns with df.drop(columns=['your', 'columns', 'here'], inplace=True) as needed.

Pandas - Find duplicated entries in one column within rows with equal values in another column

Assume a dataframe df like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.
Use Series.duplicated only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
You can group your col1 by col2 and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
A more generalised solution using Groupby.count and index.get_level_values:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']

Categories

Resources