Extract number, remove duplicate and separate with a comma in Python - python

Need help extracting multiple numbers from a column in a dataframe and remove duplicates and separate them with a comma.
Col1
Abcde 10 hijk20
wewrw5 gagdhdh5
Mnbjgkh10,20, 30
Expected output;
Col2
10,20
5
10,20,30

Try this:
punctuations = ['!','(',')','-','[',']','{','}',';',':','"','<','>','.','/','?']
for index, row in dataframe.iterrows():
content = dataframe.iloc[index:index, column_index]
for p in punctuations:
content.replace(p, " ")
only_numbers = re.sub("[^0-9]", " ", content)
only_numbers.strip()
numbers_found = only_numbers.split(" ")
no_duplicates = list(set(numbers_found))
comma_separated = ",".join(no_duplicates)
dataframe.iloc[index:index, column_index] = comma_separated

Does this answer your question? findall() from the re module with the regular expression r'\d+' returns a list containing all non-overlapping matches of one or more consecutive decimal digits in the string. The built-in set() removes any duplicates from that list and applying the sorted() built-in returns a sorted list of the elements in the set. We also make use of numpy.vectorize as it is faster than apply() from Pandas for this particular application (at least on my system) though I have shown how to use apply() as well.
Method 1
import pandas as pd
import numpy as np
import re
# compile RE - matches one or more decimal digits
p = re.compile(r'\d+')
# data
d = {'col1': ['Abcde 10 hijk20', 'wewrw5 gagdhdh5', 'Mnbjgkh10,20, 30'],
'col2': [''] * 3}
# DataFrame
df = pd.DataFrame(d)
# modify col2 based on col1
df['col2'] = np.vectorize(
lambda y: ','.join(sorted(set(p.findall(y)))),
)(df['col1'])
print(df)
Output
col1 col2
0 Abcde 10 hijk20 10,20
1 wewrw5 gagdhdh5 5
2 Mnbjgkh10,20, 30 10,20,30
If you can only use pandas and not numpy, you can do
Method 2
# modify col2 based on col1
df['col2'] = df.apply(
lambda x: ','.join(sorted(set(p.findall(x['col1'])))) , axis=1)
or even
Method 3
# modify col2 based on col1
for index, row in df.iterrows():
row['col2'] = ','.join(sorted(set(p.findall(row['col1']))))
Efficiency
On my system, vectorize (method 1) is fastest, method 3 is second fastest and method 2 is the slowest.
# Method 1
82.9 µs ± 170 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Method 2
399 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# Method 3
117 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Related

How to conditionally concat 2 columns in Python Pandas Dataframe

I have the following dataframe:
d = {'col1': [1, "Avoid", 3, "Avoid"], 'col2': ["AA", "BB", "Avoid", "Avoid"]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 AA
1 Avoid BB
2 3 Avoid
3 Avoid Avoid
I have to conditionally concat col1 and col2 into col3. Conditions:
Only concat 2 columns as long as none of them is Avoid.
If any of col1 and col2 is Avoid, col3 will be equal to Avoid as well.
When performing concatination, " & " needs to be added between column values in col3. For instance, first row of col3 will be "1 & AA".
The end result is supposed to look as the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
How can I do this without dealing with for loops?
Run a list comprehension on the strings in plain python:
out = [f"{l}&{r}"
if 'Avoid' not in {l, r}
else 'Avoid'
for l, r in zip(df.col1, df.col2)]
df.assign(col3 = out)
col1 col2 col3
0 1 AA 1&AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Is not an efficient way to work with pandas, but if you can't change the data structure these are solutions:
Solution 1:
def custom_merge(cols):
if cols["col1"]=="Avoid" or cols["col2"]=="Avoid":
return "Avoid"
else:
return f"{cols['col1']} & {cols['col2']}"
df['col3'] = df.apply(custom_merge, axis=1)
Solution 2:
df['col3'] = (df["col1"].astype(str) + " & " + df["col2"].astype(str)).apply(lambda x: "Avoid" if 'Avoid' in x else x)
Both solutions results in the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Execution Time comparison
In this section I will count the execution times of the proposed solutions.
#mozway proposed 2 very tricky solutions in his answer, which I will call Solution 3a and Solution 3b. Another interesting solution is #sammywemmy's solution that uses a list comprehension and then adds the list to the dataframe, I will call this solution solution 4
The instances of the experiments will have the following structure:
import pandas as pd
n = 100000
d = {'col1': [1, "Avoid", 3, "Avoid"] * n, 'col2': ["AA", "BB", "Avoid", "Avoid"] * n}
df = pd.DataFrame(data=d)
Execution time:
Solution 1:
%timeit 3.56 s ± 71.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%timeit 140 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3a:
%timeit 3.44 s ± 77.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3b:
%timeit 893 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 4:
%timeit 191 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 1 and solution 3a have similar execution times on the instance under consideration. Solution 3b runs 4 times faster than solutions 1 and 3a. The fastest solution is solution 2 which goes about 7 times faster than solution 2 and about 25 times faster than solutions 1 and 3a. This is because solution 2 takes advantage of pandas vectorization. Solution 4 has a similar runtime as solution 2, taking advantage of list comprehension for the merge operation (without using pandas).
TIPS:
If you can change the data format, the advice is to format the data so that you can use the native pandas functions to do the join operations or if you can't change the data format and can do without pandas you might have a slight speed up over solution 2 using dictionaries or lists, doing the merge operation using list comphrehnsion.
try this:
df["col3"]=df.apply(lambda x:"Avoid" if x["col1"]=="Avoid" or x["col2"]=="Avoid" else f"{x['col1']} & {x['col2']}",axis=1)
df["col3"] = df["col1"] + " & " + df["col2"]
df["col3"] = df["col3"].apply(lambda x: "Avoid" if x.contains("Avoid") else x)
Use boolean operations, this enables you to use an arbitrary number of columns:
# is any value in the row "Avoid"?
m = df.eq('Avoid').any(1)
# concatenate all columns unless there was a "Avoid"
df['col3'] = df.astype(str).agg(' & '.join, axis=1).mask(m, 'Avoid')
Alternative that should be faster if you have many rows and few with "Avoid":
m = df.ne('Avoid').all(1)
df.loc[m, 'col3'] = df[m].astype(str).agg(' & '.join, axis=1)
df['col3'] = df['col3'].fillna('Avoid')
output:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid

pandas better runtime, going trough dataframe

I have a pandas dataframe, there I wanna search in one column for numbers, find it and put it in a new column.
import pandas
import regex as re
import numpy as np
data = {'numbers':['134.ABBC,189.DREB, 134.TEB', '256.EHBE, 134.RHECB, 345.DREBE', '456.RHN,256.REBN,864.TREBNSE', '256.DREB, 134.ETNHR,245.DEBHTECM'],
'rate':[434, 456, 454256, 2334544]}
df = pd.DataFrame(data)
print(df)
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = None
index_numbers = df.columns.get_loc('numbers')
index_mynumbers = df.columns.get_loc('mynumbers')
length = np.array([])
for row in range(0, len(df)):
number = re.findall(pattern, df.iat[row, index_numbers])
df.iat[row, index_mynumbers] = number
print(df)
I get my numbers: {'mynumbers': ['[134.ABBC, 134.TEB]', '[134.RHECB]', '[134.RHECB]']}. My dataframe is huge. Is there a better, faster method in pandas for going trough my df?
Sure, use Series.str.findall instead loops:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].str.findall(pattern)
print(df)
numbers rate mynumbers
0 134.ABBC,189.DREB, 134.TEB 434 [134.ABBC, 134.TEB]
1 256.EHBE, 134.RHECB, 345.DREBE 456 [134.RHECB]
2 456.RHN,256.REBN,864.TREBNSE 454256 []
3 256.DREB, 134.ETNHR,245.DEBHTECM 2334544 [134.ETNHR]
If want using re.findall is it possible, only 2 times slowier:
pattern = '134.[A-Z]{2,}'
df['mynumbers'] = df['numbers'].map(lambda x: re.findall(pattern, x))
# [40000 rows]
df = pd.concat([df] * 10000, ignore_index=True)
pattern = '134.[A-Z]{2,}'
In [46]: %timeit df['numbers'].map(lambda x: re.findall(pattern, x))
50 ms ± 491 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [47]: %timeit df['numbers'].str.findall(pattern)
21.2 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to get frequency of each element in column (having array of strings) of data frame with pandas?

I have a panda data frame in python as below:
df['column'] = [abc, mno]
[mno, pqr]
[abc, mno]
[mno, pqr]
I want to get the count of each item below :
abc = 2,
mno= 4 ,
pqr = 2
I can do iteration over the each row to count but this is not the kind of solution I m looking for.
If there is any way where I can use iloc or anything related to that, please suggest to me.
I have looked at various solutions with a similar problem but none of them satisfied my scenario.
Here is how I'd solve it using .explode() and .value_counts() you can furthermore assign it as a column or do as you please with the output:
In one line:
print(df.explode('column')['column'].value_counts())
Full example:
import pandas as pd
data_1 = {'index':[0,1,2,3],'column':[['abc','mno'],['mno','pqr'],['abc','mno'],['mno','pqr']]}
df = pd.DataFrame(data_1)
df = df.set_index('index')
print(df)
column
index
0 [abc, mno]
1 [mno, pqr]
2 [abc, mno]
3 [mno, pqr]
Here we perform the .explode() to create individual values from the lists and value_counts() to count repetition of unique values:
df_new = df.explode('column')
print(df_new['column'].value_counts())
Output:
mno 4
abc 2
pqr 2
Use collections.Counter
from collections import Counter
from itertools import chain
Counter(chain.from_iterable(df.column))
Out[196]: Counter({'abc': 2, 'mno': 4, 'pqr': 2})
%timeit
df1 = pd.concat([df]*10000, ignore_index=True)
In [227]: %timeit pd.Series(Counter(chain.from_iterable(df1.column)))
14.3 ms ± 279 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [228]: %timeit df1.column.explode().value_counts()
127 ms ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

apply the custom function on rows which takes two lists as input

I have a data frame like this,
col1 col2
[1,2] [3,4]
[5,6] [7,8]
[9,5] [1,3]
[8,4] [3,6]
and I have a function f which takes two list inputs and returns a single value. I want to add the column as col3 and apply the function with col1 and col2 values. the output of the function will be col3 values, so the final data frame would look like,
col1 col2 col3
[1,2] [3,4] 3
[5,6] [7,8] 5
[9,5] [1,3] 8
[8,4] [3,6] 9
Using a for loop and passing list values each time I can calculate the col3 values. but the execution time will be longer. Looking for pythonic way to do the task more efficiently.
Working with lists in pandas is not good vectorized, possible solution with list comprehension:
df['col3'] = [func(a, b) for a,b in zip(df.col1, df.col2)]
Pandas apply solution (should be slowier):
df['col3'] = df.apply(lambda x: func(x.col1, x.col2), axis=1)
But if function should be vectorized and same length of list in columns maybe is possible rewrite it to numpy.
If not, maybe rewritten function to numba should help.
Performance with custom function:
#[40000 rows x 2 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#sample function
def func(x, y):
return min(x + y)
In [144]: %timeit df['col31'] = [func(a, b) for a,b in zip(df.col1, df.col2)]
39.6 ms ± 331 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [145]: %timeit df['col32'] = df.apply(lambda x: func(x.col1, x.col2), axis=1)
2.25 s ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas Question create two aggregations with one being conditionally created

I have a dataframe like the following:
label val
a 0
b -1
b 0
b 1
a 1
b 1
My goal here is to group by the label column and get two aggregated columns. One that shows the amount of rows in each group (eg. a:2, b:4) and second the proportion in each group where val = 1. What is the best way to do this in pandas?
Finding the proportion of a column that satisfies a condition is equivalent to taking the mean of a Boolean Series. This allows for it to be done quickly. Since s and df share the same index, it's perfectly fine to use one to group the other.
To get multiple aggregations for a column, supply a list that specifies what you want to do.
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
# size mean
#label
#a 2 0.5
#b 4 0.5
When the number of groups becomes large using "tricks" like this can be significantly faster than using a lambda because many of the basic groupby aggregations have cythonized versions that are extremely performant.
# Create a sample df with 20,000 unique groups
df = pd.concat([df]*10000, ignore_index=True)
df['label'] = df.index//3
%%timeit
s = df.val.eq(1)
s.groupby(df.label).agg(['size', 'mean'])
#10.8 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
#7.93 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try:
def portion(x): return (x.eq(1).sum())/len(x)
df.groupby('label').val.agg(['size', portion])
Output:
size portion
label
a 2 0.5
b 4 0.5

Categories

Resources