Changing values in a dataframe column based off a different column (python)

Changing values in a dataframe column based off a different column (python) - python

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?

Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.

You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Related

removing specific words from a dataset [duplicate]

I have a pandas data frame, which looks like the following:
col1 col2 col3 ...
field1:index1:value1 field2:index2:value2 field3:index3:value3 ...
field1:index4:value4 field2:index5:value5 field3:index5:value6 ...
The field is of int type, index is of int type and value could be int or float type.
I want to convert this data frame into the following expected output:
col1 col2 col3 ...
index1:value1 index2:value2 index3:value3 ...
index4:value4 index5:value5 index5:value6 ...
I want to remove the all field: values from all the cells. How to do this?
EDIT: An example of a cell looks like: 1:1:1.0445731675303e-06 and I would like to reduce such strings to 1:1.0445731675303e-06, in all the cells.

Given
>>> df
col1 col2 col3
0 1:index1:value1 2:index2:value2 3:index3:value3
1 1:index4:value4 2:index5:value5 3:index5:value6
you can use
>>> df.apply(lambda s: s.str.replace('^\d+:', '', regex=True))
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
The regex '^\d+:' matches the beginnings of strings that start with a sequence of numbers followed by a colon.

Try this:
df = df.applymap(lambda x: ':'.join(str(x).split(':')[1:]))
print(df)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6

Possible other way is to basically split by phrase after first colon and extract using .str[index]
df.apply(lambda s: s.str.split('(^[a-z0-9]+\:(.*))').str[-2])

Another possible solution is to run the string processing in a list comprehension, and create a new dataframe, using the old dataframe's column names :
result = [[":".join(word.split(":")[1:])
for word in ent]
for ent in df.to_numpy()]
pd.DataFrame(result, columns = df.columns)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
This is faster than running an applymap or apply... string processing is usually much faster within python than Pandas.

How to display a list pandas DataFrame cell as multiple lines

Not sure if this makes any sense but essentially I have a dataframe that looks something like this:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
[text(01),text(02),...,text(n)]
name2
num
[text(11),text(12),...,text(m)]
Where one of the columns is a list of strings, in this case col 3, and n!=m.
What I would like to know is if there is a way to display them in a more readable manner, such as:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
text(01)
...
text(n)
name2
num
text(11)
...
text(m)
I appreciate this looks messy but my intention is for all the texts to be displayed in one cell, just with line breaks, rather than being split across multiple rows as the table above shows.
Thank you in advance.

There is explode - function in pandas - It's partially solving the problem - but in this case [col1] + [col2] will be duplicated.
The explode() function is used to transform each element of a list-like to a row, replicating the index values.
df1 = df1.explode('col3name')
df1.explode('col3name')
Initial:
After explode:

Use explode on the col3 with list values like
In [44]: df.explode('col3')
Out[44]:
col1 col2 col3
0 name1 num text(01)
0 name1 num text(02)
1 name2 num text(11)
1 name2 num text(12)
Could then set_index
In [53]: df.explode('col3').set_index(['col1', 'col2'])
Out[53]:
col3
col1 col2
name1 num text(01)
num text(02)
name2 num text(11)
num text(12)

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?

Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Create a column out of the 2nd portion of text of two columns in pandas

I have a dataframe with two columns. I want to create a third column that is the
"sum" of the first two columns, but without the first bit of each column. I think this is best shown in an example:
col1 col2 col3 (need to make)
abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
psdb_what_I_want2 what_I_want2
vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
ertsa_what_I_want5 what_I_want5
abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Note that what_I_want# will be different for every row, but the same between columns in the same row. The prefix will always be the same for each row but can differ/repeat between rows. Cells shown as blank are "" strings.
The code I have so far:
df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)
From there I wanted just the 2nd (or last) element of the split so I tried both of the following:
df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]
Both of these returned errors. The first error I think is because of replicated values (ValueError: cannot reindex from a duplicate axis). The second is a Keyvalue Error.

You were actually quite close, just needed to select the correct slice with str[1] and meanwhile fillna for the empty cells:
m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Another method would be to use apply where you can apply split on multiple columns at once:
m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6

You can replace() all char up until the first underscore and then apply() a join() or sum() on axis=1:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)
Or:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)
Or:
df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
.str.join(''))
col1 col2 Col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2 what_I_want2I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 NaN ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6

Reorder DataFrame rows inplace

Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.

Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changing values in a dataframe column based off a different column (python) - python

You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1. df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1) >>> df Col1 Col2 0 APT-B UB0 1 AK0 UUP 2 IL2-B PB2 3 OIU-B U5B 4 K29 AAA

Related

removing specific words from a dataset [duplicate]

How to display a list pandas DataFrame cell as multiple lines

Faster method of extracting characters for multiple columns in dataframe

Create a column out of the 2nd portion of text of two columns in pandas

Reorder DataFrame rows inplace

Categories

Resources