Faster method of extracting characters for multiple columns in dataframe - python

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?

Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Related

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

How to display a list pandas DataFrame cell as multiple lines

Not sure if this makes any sense but essentially I have a dataframe that looks something like this:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
[text(01),text(02),...,text(n)]
name2
num
[text(11),text(12),...,text(m)]
Where one of the columns is a list of strings, in this case col 3, and n!=m.
What I would like to know is if there is a way to display them in a more readable manner, such as:
col 1 (str)
col 2 (int)
col 3 (list)
name1
num
text(01)
...
text(n)
name2
num
text(11)
...
text(m)
I appreciate this looks messy but my intention is for all the texts to be displayed in one cell, just with line breaks, rather than being split across multiple rows as the table above shows.
Thank you in advance.
There is explode - function in pandas - It's partially solving the problem - but in this case [col1] + [col2] will be duplicated.
The explode() function is used to transform each element of a list-like to a row, replicating the index values.
df1 = df1.explode('col3name')
df1.explode('col3name')
Initial:
After explode:
Use explode on the col3 with list values like
In [44]: df.explode('col3')
Out[44]:
col1 col2 col3
0 name1 num text(01)
0 name1 num text(02)
1 name2 num text(11)
1 name2 num text(12)
Could then set_index
In [53]: df.explode('col3').set_index(['col1', 'col2'])
Out[53]:
col3
col1 col2
name1 num text(01)
num text(02)
name2 num text(11)
num text(12)

Create a column out of the 2nd portion of text of two columns in pandas

I have a dataframe with two columns. I want to create a third column that is the
"sum" of the first two columns, but without the first bit of each column. I think this is best shown in an example:
col1 col2 col3 (need to make)
abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
psdb_what_I_want2 what_I_want2
vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
ertsa_what_I_want5 what_I_want5
abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Note that what_I_want# will be different for every row, but the same between columns in the same row. The prefix will always be the same for each row but can differ/repeat between rows. Cells shown as blank are "" strings.
The code I have so far:
df["col3"] = df["col1"].str.split("_", 1) + df["col2"].str.split("_", 1)
From there I wanted just the 2nd (or last) element of the split so I tried both of the following:
df["col3"] = df["col1"].str.split("_", 1)[1] + df["col2"].str.split("_", 1)[1]
df["col3"] = df["col1"].str.split("_", 1)[-1] + df["col2"].str.split("_", 1)[-1]
Both of these returned errors. The first error I think is because of replicated values (ValueError: cannot reindex from a duplicate axis). The second is a Keyvalue Error.
You were actually quite close, just needed to select the correct slice with str[1] and meanwhile fillna for the empty cells:
m = df['col1'].str.split('_', 1).str[1].fillna('') + df['col2'].str.split('_', 1).str[1].fillna('')
df['col3'] = m
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
Another method would be to use apply where you can apply split on multiple columns at once:
m = df[['col1', 'col2']].apply(lambda x: x.str.split('_', 1).str[1]).fillna('')
df['col3'] = m['col1']+m['col2']
col1 col2 col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6
You can replace() all char up until the first underscore and then apply() a join() or sum() on axis=1:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').apply(''.join,axis=1)
Or:
df['Col3']=df.replace('^[^_]*_','',regex=True).fillna('').sum(axis=1)
Or:
df['Col3']=(pd.Series(df.replace('^[^_]*_','',regex=True).fillna('').values.tolist())
.str.join(''))
col1 col2 Col3
0 abc_what_I_want1 abc_what_I_want1 what_I_want1what_I_want1
1 psdb_what_I_want2 what_I_want2 what_I_want2I_want2
2 vxc_what_I_want3 vxc_what_I_want3 what_I_want3what_I_want3
3 qk_what_I_want4 qk_what_I_want4 what_I_want4what_I_want4
4 NaN ertsa_what_I_want5 what_I_want5
5 abc_what_I_want6 abc_what_I_want6 what_I_want6what_I_want6

Pandas contains condition returns True but want to return as 1 or 0 - creation of a series/dataframe column

I have a data frame with String values in one of its columns. I want to iterate each row in the specified column to see if the value contains the word i'm looking for. If it does i want it to return int value 1 and if it doesn't then 0.
df['2'] = df['Col2'].str.lower().str.contains('word')
I can only get it to return True or False
Col1 Col2
1 hello how are you 0
2 that is a big word 1
3 this word is bad 1
4 tonight tonight 0
Easy to do it. you can simply add .astype(int) to your boolean column. or just use apply function for one line.look the follow example.
df = pd.DataFrame(["hello how are you","that is a big word","this word is bad","tonight tonight"],columns=["Col1"])
# Method 1
df["Col2"] = df["Col1"].str.lower().str.contains('word')
df["Col2"] = df["Col2"].astype(int)
# Method 2
df["Col2"] = df["Col1"].apply(lambda x: 1 if "word" in x.lower() else 0)
df
Col1 Col2
0 hello how are you 0
1 that is a big word 1
2 this word is bad 1
3 tonight tonight 0

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Categories

Resources