how to split string data using criterion in python?

how to split string data using criterion in python? - python

i wondering about something expression of string list in dataframe.
how to split string value using python?
I'm using replace method.
But, i can't find a way to delete only the node number.
dataframe
index article_id
0 ['#abc_172', '#abc_249', '#abc-32', '#def-1']
1 ['#az3_2', '#bwc_4', '#xc-34', '#xc-1']
2 ['#ac_12']
3 ['#ea457870a2d32453609f52e50f84abdc_15', '#bb_3']
4 ...
... ...
I want to get like this
index article_id article_id_unique_count
0 ['abc', 'abc', 'abc', 'def'] 2
1 ['az3', 'bwc', 'xc', 'xc'] 3
2 ['ac'] 1
3 ['#ea457870a2d32453609f52e50f84abdc', 'bb'] 2
...

use re.findall
df['article_id'] = df.article_id.apply(lambda x: re.findall('([#a-z0-9]+)',x)).apply(lambda x: [i for i in x if i.isdigit() == False])
df['article_id_unique_count'] = df['article_id'].apply(lambda x: len(set(x)))
Output
article_id article_id_unique_count
0 [abc, abc, abc, def] 2
1 [az3, bwc, xc, xc] 3
2 [ac] 1
3 [#ea457870a2d32453609f52e50f84abdc, bb] 2

Assuming the delimiters are either - or _:
df['article_id'].map(lambda x:[re.findall('#*(.+?)[-_]', s)[0] for s in x], 1)
Output:
0 [abc, abc, abc, def]
1 [az3, bwc, xc, xc]
2 [ac]
3 [#ea457870a2d32453609f52e50f84abdc, bb]
You can then use apply(lambda x:len(set(x))).
Notice that the first element of row 1, az3 is also correctly extracted.

apply regex within apply and set to count unique elements in the list
import re
df = pd.DataFrame(data={"id":[0,1,2],
"article_id":[["abc_172", "#abc_249", "#abc-32", "#def-1"],
["#az3_2", "#bwc_4", "#xc-34", "#xc-1"],
["##ea457870a2d32453609f52e50f84abdc_15"]]})
df['article_id'] = df['article_id'].apply(lambda x : re.sub('[!#$]','', i).split("-")[0].split("_")[0] for i in x])
df['article_id_unique_count'] = df['article_id'].apply(lambda x : len(set(x)))
id article_id article_id_unique_count
0 0 [abc, abc, abc, def] 2
1 1 [az3, bwc, xc, xc] 3
2 2 [#ea457870a2d32453609f52e50f84abdc] 1

Other solutions using apply. I always try to find a solution without using apply. I come up with this one. Simple construct dataframe from the list, stack to series and work with str.extract and agg
(pd.DataFrame(df.article_id.tolist(), index=df.index).stack().str.extract(r'#?(.*)[_-]')
.groupby(level=0)[0].agg([list, 'nunique'])
.rename(columns={'list': 'article_id', 'nunique': 'article_id_unique_count'}))
Out[15]:
article_id article_id_unique_count
0 [abc, abc, abc, def] 2
1 [az3, bwc, xc, xc] 3
2 [ac] 1
3 [#ea457870a2d32453609f52e50f84abdc, bb] 2

Related

How to extract alphanumeric word from column values in excel with Python?

I need a way to extract all words that start with 'A' followed by a 6-digit numeric string right after (i.e. A112233, A000023).
Each cell contains sentences and there could potentially be a user error where they forget to put a space, so if you could account for that as well it would be greatly appreciated.
I've done research into using Python regex and Pandas, but I just don't know enough yet and am kind of on a time crunch.

Suppose your df's content construct from the following code:
import pandas as pd
df1=pd.DataFrame(
{
"columnA":["A194533","A4A556633 system01A484666","A4A556633","a987654A948323a882332A484666","A238B004867","pageA000023lol","a089923","something lol a484876A48466 emoji","A906633 A556633a556633"]
}
)
print(df1)
Output:
columnA
0 A194533
1 A4A556633 system01A484666
2 A4A556633
3 a987654A948323a882332A484666
4 A238B004867
5 pageA000023lol
6 a089923
7 something lol a484876A48466 emoji
8 A906633 A556633a556633
Now let's fetch the target corresponding to the regex patern:
result = df1['columnA'].str.extractall(r'([A]\d{6})')
Output:
0
match
0 0 A194533
1 0 A556633
1 A484666
2 0 A556633
3 0 A948323
1 A484666
5 0 A000023
8 0 A906633
1 A556633
And count them:
result.value_counts()
Output:
A556633 3
A484666 2
A000023 1
A194533 1
A906633 1
A948323 1
dtype: int64
Send the unique index into a list:
unique_list = [i[0] for i in result.value_counts().index.tolist()]
Output:
['A556633', 'A484666', 'A000023', 'A194533', 'A906633', 'A948323']
Value counts into a list:
unique_count_list = result.value_counts().values.tolist()
Output:
[3, 2, 1, 1, 1, 1]

Stripping string values at different positions

Suppose I have the following dataframe:
df = pd.DataFrame({'X':['AB_123_CD','EF_123CD','XY_Z'],'Y':[1,2,3]})
X Y
0 AB_123_CD 1
1 EF_123CD 2
2 XY_Z 3
I want to use strip method to get rid of the first prefix such that I get
X Y
0 123_CD 1
1 123CD 2
2 Z 3
I tried doing: df.X.str.split('_').str[-1].str.strip() but since the positions of _'s are different it returns different result to the one desired above. I wonder how can I address this issue?

You're close, you can split once (n=1) from the left and keep the second one (str[1]):
df.X = df.X.str.split("_", n=1).str[1]
to get
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3

Try this instead:
df["X"] = df["X"].apply(lambda x: x[x.find("_")+1:])
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
This keeps the entire string after the first occurence of _

The following code could do the job:
df['X'] = df.X.apply(lambda x: '_'.join(x.split('_')[1:]))

Your solution is very close. With some minor changes, it should work:
df.X.str.split('_').str[1:].str.join('_')
0 123_CD
1 123CD
2 Z
Name: X, dtype: object

You can define maxsplit in the str.split() function. It sounds like you just want to split with maxsplit 1 and take the last element:
df['X'] = df['X'].apply(lambda x: x.split('_',1)[-1])

Convert dataframe with tuples into dataframe with multiindex

I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!

Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209

Count how many characters from a column appear in another column (pandas)

I am trying to count how many characters from the first column appear in second one. They may appear in different order and they should not be counted twice.
For example, in this df
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2"],["AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"]]).T
the result should be:
result = [3,2,2,2,3]

Looks like set.intersection after zipping the 2 columns:
[len(set(a).intersection(set(b))) for a,b in zip(df[0],df[1])]
#[3, 2, 2, 2, 3]

The other solutions will fail in the case that you compare names that both have the same multiple character, eg. AAL0 and AAL0X24. The result here should be 4.
from collections import Counter
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2", "AAL0"],
["AL0X24", "CXP44", "MLN", "KKRR9", "22MMRRS", "AAL0X24"]]).T
def num_shared_chars(char_counter1, char_counter2):
shared_chars = set(char_counter1.keys()).intersection(char_counter2.keys())
return sum([min(char_counter1[k], char_counter2[k]) for k in shared_chars])
df_counter = df.applymap(Counter)
df['shared_chars'] = df_counter.apply(lambda row: num_shared_chars(row[0], row[1]), axis = 'columns')
Result:
0 1 shared_chars
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
5 AAL0 AAL0X24 4

Sticking to the dataframe data structure, you could do:
>>> def count_common(s1, s2):
... return len(set(s1) & set(s2))
...
>>> df["result"] = df.apply(lambda x: count_common(x[0], x[1]), axis=1)
>>> df
0 1 result
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3

Strip all characters from column header before a :

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown

Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7

import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'

The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).

You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']

df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']

You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to split string data using criterion in python? - python

Related

How to extract alphanumeric word from column values in excel with Python?

Stripping string values at different positions

Convert dataframe with tuples into dataframe with multiindex

Count how many characters from a column appear in another column (pandas)

Strip all characters from column header before a :

Categories

Resources