Subset string rows that contain a 'flexible' pattern

Subset string rows that contain a 'flexible' pattern - python

I have the following df.
data = [
['DWWWWD'],
['DWDW'],
['WDWWWWWWWWD'],
['DDW'],
['WWD'],
]
df = pd.DataFrame(data, columns=['letter_sequence'])
I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'. Examples of rows I want in my output df: DWD, DWWWWWWWWWWWD, WWWWWDWDW...
I came up with the following, but it does not really work for 'whichever number of W's'.
df[df['letter_sequence'].str.contains(
'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]
Desired output new_df:
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
Any alternatives?

Use [W]{1,} for one or more W, regex=True is by default, so should be omit:
df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD

You can use the regex: D\w+D.
The code is shown below:
df = df[df['letter_sequence'].str.contains('Dw+D')]
Please let me know if it helps.

Related

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
The result obtained from the code above and it's not right
Expected Output

Try to sort the list first:
pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

Try with str.extract
df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]:
col_one new
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU

One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df
expected results:
col_one new_col
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU

Another approach:
pattern = re.compile(r"|".join(x for x in list20))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

How to extract sub string by defining before and after delimiter

I have data frame which contains the URLs and I want to extract something in between.
df
URL
https://storage.com/vision/Glass2020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
https://storage.com/vision/Carpet5020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
https://storage.com/vision/Metal8020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg
desired output would be like this
URL Type
https://storage.com/vision/Glass2020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Glass2020
https://storage.com/vision/Carpet5020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Carpet5020
https://storage.com/vision/Metal8020/2020-02-04_B8I8FZHl-xJ_2236301468348443721.jpg Metal8020
I would use df['URL'].str.extract but to understand how to define before and after the delimiter.

One idea is use Series.str.split with select second last value by indexing:
df['Type'] = df['URL'].str.split('/').str[-2]
print (df)
URL Type
0 https://storage.com/vision/Glass2020/2020-02-0... Glass2020
1 https://storage.com/vision/Carpet5020/2020-02-... Carpet5020
2 https://storage.com/vision/Metal8020/2020-02-0... Metal8020
EDIT: For specify different values outside expected output use Series.str.extract:
df['Type'] = df['URL'].str.extract('vision/(.+)/2020')
print (df)
URL Type
0 https://storage.com/vision/Glass2020/2020-02-0... Glass2020
1 https://storage.com/vision/Carpet5020/2020-02-... Carpet5020
2 https://storage.com/vision/Metal8020/2020-02-0... Metal8020

Try str.split:
df['Type'] = df.URL.str.split('/').str[-2]

Split dataframe by certain condition but keep the original dataframe

I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?

First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0

Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

In Python, how to sort a dataframe containing accents?

I use sort_values to sort a dataframe. The dataframe contains UTF-8 characters with accents. Here is an example:
>>> df = pd.DataFrame ( [ ['i'],['e'],['a'],['é'] ] )
>>> df.sort_values(by=[0])
0
2 a
1 e
0 i
3 é
As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.
Note that the real dataframe has several columns !

This is one way. The simplest solution, as suggested by #JonClements:
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
An alternative, long-winded solution, normalization code courtesy of #EdChum:
df = pd.DataFrame([['i'],['e'],['a'],['é']])
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
# remove accents
df[1] = df[0].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')
# sort by new column, then drop
df = df.sort_values(1, ascending=True)\
.drop(1, axis=1)
print(df)
0
2 a
1 e
3 é
0 i

How can I split a column into 2 in the correct way?

I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:

You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subset string rows that contain a 'flexible' pattern - python

Use [W]{1,} for one or more W, regex=True is by default, so should be omit: df = df[df['letter_sequence'].str.contains('D[W]{1,}D')] print (df) letter_sequence 0 DWWWWD 1 DWDW 2 WDWWWWWWWWD

You can use the regex: D\w+D. The code is shown below: df = df[df['letter_sequence'].str.contains('Dw+D')] Please let me know if it helps.

Related

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

How to extract sub string by defining before and after delimiter

Split dataframe by certain condition but keep the original dataframe

In Python, how to sort a dataframe containing accents?

How can I split a column into 2 in the correct way?

Categories

Resources