removing specific words from a dataset [duplicate] - python

I have a pandas data frame, which looks like the following:
col1 col2 col3 ...
field1:index1:value1 field2:index2:value2 field3:index3:value3 ...
field1:index4:value4 field2:index5:value5 field3:index5:value6 ...
The field is of int type, index is of int type and value could be int or float type.
I want to convert this data frame into the following expected output:
col1 col2 col3 ...
index1:value1 index2:value2 index3:value3 ...
index4:value4 index5:value5 index5:value6 ...
I want to remove the all field: values from all the cells. How to do this?
EDIT: An example of a cell looks like: 1:1:1.0445731675303e-06 and I would like to reduce such strings to 1:1.0445731675303e-06, in all the cells.

Given
>>> df
col1 col2 col3
0 1:index1:value1 2:index2:value2 3:index3:value3
1 1:index4:value4 2:index5:value5 3:index5:value6
you can use
>>> df.apply(lambda s: s.str.replace('^\d+:', '', regex=True))
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
The regex '^\d+:' matches the beginnings of strings that start with a sequence of numbers followed by a colon.

Try this:
df = df.applymap(lambda x: ':'.join(str(x).split(':')[1:]))
print(df)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6

Possible other way is to basically split by phrase after first colon and extract using .str[index]
df.apply(lambda s: s.str.split('(^[a-z0-9]+\:(.*))').str[-2])

Another possible solution is to run the string processing in a list comprehension, and create a new dataframe, using the old dataframe's column names :
result = [[":".join(word.split(":")[1:])
for word in ent]
for ent in df.to_numpy()]
pd.DataFrame(result, columns = df.columns)
col1 col2 col3
0 index1:value1 index2:value2 index3:value3
1 index4:value4 index5:value5 index5:value6
This is faster than running an applymap or apply... string processing is usually much faster within python than Pandas.

Related

How to display a list pandas DataFrame cell as multiple lines

Not sure if this makes any sense but essentially I have a dataframe that looks something like this:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
[text(01),text(02),...,text(n)]
name2
num
[text(11),text(12),...,text(m)]
Where one of the columns is a list of strings, in this case col 3, and n!=m.
What I would like to know is if there is a way to display them in a more readable manner, such as:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
text(01)
...
text(n)
name2
num
text(11)
...
text(m)
I appreciate this looks messy but my intention is for all the texts to be displayed in one cell, just with line breaks, rather than being split across multiple rows as the table above shows.
Thank you in advance.
There is explode - function in pandas - It's partially solving the problem - but in this case [col1] + [col2] will be duplicated.
The explode() function is used to transform each element of a list-like to a row, replicating the index values.
df1 = df1.explode('col3name')
df1.explode('col3name')
Initial:
After explode:
Use explode on the col3 with list values like
In [44]: df.explode('col3')
Out[44]:
col1 col2 col3
0 name1 num text(01)
0 name1 num text(02)
1 name2 num text(11)
1 name2 num text(12)
Could then set_index
In [53]: df.explode('col3').set_index(['col1', 'col2'])
Out[53]:
col3
col1 col2
name1 num text(01)
num text(02)
name2 num text(11)
num text(12)

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?
Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Concatenate value from 3 excel columns in Python

very new to Python here.
I'm triying to concatenate value from 3 columns from an excel sheet into 1 columns.
I do have about 300-400 rows to do
Values are like this
COl1 COL 2 COL3
CNMG 432 EMU
TNMG 332 ESU
...
Output should be
COL3
CNMG432EMU
TNMG332ESU
...
I tried about every tutorial in Pandas I could find, but nothing seams to works since I have STR and INT
Thanks in advance
seems like some simple string concatenation should do the trick
df['concat'] = df['COL1'] + df['COL 2'].astype(str) + df['COL3']
if you have ints, you'll need to cast them as strings, you can check which columns with a simple print(df.dtypes)
if you have ints or floats you'll need to use .astype(str)
print(df)
COl1 COL2 COL3 concat
0 CNMG 432 EMU CNMG432EMU
1 TNMG 332 ESU TNMG332ESU
df["COL3"]=df["COL1"]+df["COL2"].astype(str)+df["COL3]
You can also do this pretty easily in pylightxl https://pylightxl.readthedocs.io/en/latest/
import pylightxl as xl
db = xl.readxl(‘excelfile.xlsx’)
cat_3_columns = list(zip(db.ws(‘Sheet1’).col(1), db.ws(‘Sheet1’).col(2), db.ws(‘Sheet1’).col(3)))

Appending unique mixed string using pandas or python

I have a table or df(if pandas has a better way) with one of the columns with multiple mixed character and string, i need to count them and append a unique mixed string to it, what would be best way to do a python loop or pandas has some syntax to do it? example data
col0 col1 col2
ENSG0001 E001 ENSG001:E001
ENSG0001 E002 ENSG001:E002
.
.
ENSG001 E028 ENSG001:E028
ENSG002 E001 ENSG002:E001
.
ENSG002 E012 ENSG002:E012
Edit:
Need to count the elements in col0 and instead of a number I need E001 as the counter and concatenate col0 and col1 in col2
Add to column Series created by cumcount + astype to string + zfill.
df['col3'] = df['col0'] + ':E' +
df.groupby('col0').cumcount().add(1).astype(str).str.zfill(3)
print (df)
col0 col1 col2 col3
0 ENSG0001 E001 ENSG001:E001 ENSG0001:E001
1 ENSG0001 E002 ENSG001:E002 ENSG0001:E002
2 ENSG001 E028 ENSG001:E028 ENSG001:E001
3 ENSG002 E001 ENSG002:E001 ENSG002:E001
4 ENSG002 E012 ENSG002:E012 ENSG002:E002

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Categories

Resources