How to remove all string values that precede a character in pandas?

How to remove all string values that precede a character in pandas? - python

I have the following dataframe:
data = {'Name':['Square_Train_1', 'Stims1/Neut/32Neut1.jpg', 'Square_Train_2',
'Stims1/Neg/114Neg1.jpg', 'Square_Train_3',
'Stims1/Pos/129Pos1.jpg', 'Stims1/Neut/58Neut1.jpg',
'Stims1/Neg/13Neg1.jpg', 'Stims1/Pos/5Pos1.jpg',
'Stims1/Pos/25Pos1.jpg', 'Stims1/Neg/47Neg1.jpg',
'Stims1/Neut/8Neut1.jpg', 'Stims1/Neg/129Neg1.jpg',
'Stims1/Neut/42Neut1.jpg', 'Stims1/Pos/98Pos1.jpg',
'Stims1/Neut/24Neut1.jpg', 'Stims1/Neg/6Neg1.jpg',
'Stims1/Pos/107Pos1.jpg', 'Stims1/Neg/100Neg1.jpg',
'Stims1/Pos/77Pos1.jpg', 'Stims1/Neut/3Neut1.jpg',
'Stims1/Neg/53Neg1.jpg', 'Stims1/Pos/157Pos1.jpg',
'Stims1/Neut/13Neut1.jpg', 'Stims1/Neut/9Neut1.jpg',
'Stims1/Pos/104Pos1.jpg', 'Stims1/Neg/64Neg1.jpg',
'Stims1/Neut/30Neut1.jpg', 'Stims1/Pos/43Pos1.jpg',
'Stims1/Neg/1Neg1.jpg', 'Stims1/Neut/59Neut1.jpg',
'Stims1/Neg/172Neg1.jpg', 'Stims1/Pos/56Pos1.jpg',
'Stims1/Pos/44Pos1.jpg', 'Stims1/Neg/34Neg1.jpg',
'Stims1/Neut/16Neut1.jpg', 'Stims1/Neut/47Neut1.jpg',
'Stims1/Neg/21Neg1.jpg', 'Stims1/Pos/96Pos1.jpg',
'Stims1/Neg/50Neg1.jpg', 'Stims1/Pos/2Pos1.jpg',
'Stims1/Neut/21Neut1.jpg', 'Stims1/Neg/65Neg1.jpg',
'Stims1/Pos/35Pos1.jpg', 'Stims1/Neut/51Neut1.jpg',
'Stims1/Neut/55Neut1.jpg', 'Stims1/Pos/60Pos1.jpg',
'Stims1/Neg/30Neg1.jpg', 'Stims1/Neut/7Neut1.jpg',
'Stims1/Pos/9Pos1.jpg', 'Stims1/Neg/41Neg1.jpg',
'Stims1/Pos/31Pos1.jpg', 'Stims1/Neut/40Neut1.jpg',
'Stims1/Neg/156Neg1.jpg', 'Stims1/Neg/135Neg1.jpg',
'Stims1/Pos/71Pos1.jpg', 'Stims1/Neut/26Neut1.jpg',
'Stims1/Pos/105Pos1.jpg', 'Stims1/Neg/17Neg1.jpg',
'Stims1/Neut/44Neut1.jpg', 'Stims1/Pos/150Pos1.jpg',
'Stims1/Neut/57Neut1.jpg', 'Stims1/Neg/12Neg1.jpg',
'Stims1/Pos/24Pos1.jpg', 'Stims1/Neg/131Neg1.jpg',
'Stims1/Neut/31Neut1.jpg', 'Stims1/Pos/10Pos1.jpg',
'Stims1/Neut/11Neut1.jpg', 'Stims1/Neg/118Neg1.jpg',
'Stims1/Neg/51Neg1.jpg', 'Stims1/Pos/48Pos1.jpg',
'Stims1/Neut/34Neut1.jpg', 'Stims1/Pos/148Pos1.jpg',
'Stims1/Neut/22Neut1.jpg', 'Stims1/Neg/176Neg1.jpg',
'Stims1/Neut/5Neut1.jpg', 'Stims1/Neg/104Neg1.jpg',
'Stims1/Pos/68Pos1.jpg', 'Stims1/Neut/35Neut1.jpg',
'Stims1/Pos/14Pos1.jpg', 'Stims1/Neg/136Neg1.jpg',
'Stims1/Neut/54Neut1.jpg', 'Stims1/Neg/107Neg1.jpg',
'Stims1/Pos/47Pos1.jpg', 'Stims1/Neut/43Neut1.jpg',
'Stims1/Neg/58Neg1.jpg', 'Stims1/Pos/20Pos1.jpg',
'Stims1/Neut/6Neut1.jpg', 'Stims1/Neg/63Neg1.jpg',
'Stims1/Pos/135Pos1.jpg', 'Stims1/Neut/39Neut1.jpg',
'Stims1/Neg/164Neg1.jpg', 'Stims1/Pos/125Pos1.jpg',
'Stims1/Neg/117Neg1.jpg', 'Stims1/Neut/48Neut1.jpg',
'Stims1/Pos/69Pos1.jpg', 'Stims1/Pos/37Pos1.jpg',
'Stims1/Neg/159Neg1.jpg', 'Stims1/Neut/36Neut1.jpg',
'Stims1/Pos/75Pos1.jpg', 'Stims1/Neg/180Neg1.jpg',
'Stims1/Neut/50Neut1.jpg', 'Stims1/Neg/7Neg1.jpg',
'Stims1/Pos/11Pos1.jpg', 'Stims1/Neut/52Neut1.jpg',
'Stims1/Pos/29Pos1.jpg', 'Stims1/Neut/46Neut1.jpg',
'Stims1/Neg/115Neg1.jpg', 'Stims1/Neg/31Neg1.jpg',
'Stims1/Pos/66Pos1.jpg', 'Stims1/Neut/14Neut1.jpg',
'Stims1/Neut/53Neut1.jpg', 'Stims1/Neg/162Neg1.jpg',
'Stims1/Pos/97Pos1.jpg', 'Stims1/Neg/35Neg1.jpg',
'Stims1/Neut/45Neut1.jpg', 'Stims1/Pos/32Pos1.jpg',
'Stims1/Pos/81Pos1.jpg', 'Stims1/Neg/24Neg1.jpg',
'Stims1/Neut/1Neut1.jpg', 'Stims1/Neut/20Neut1.jpg',
'Stims1/Neg/69Neg1.jpg', 'Stims1/Pos/52Pos1.jpg',
'Stims2/Pos/35Pos2.jpg', 'Stims2/Neut/1Neut2.jpg',
'Stims2/Neg/30Neg2.jpg', 'Stims2/Neg/156Neg2.jpg',
'Stims2/Neut/59Neut2.jpg', 'Stims2/Pos/150Pos2.jpg',
'Stims2/Neg/114Neg2.jpg', 'Stims2/Neut/39Neut2.jpg',
'Stims2/Pos/98Pos2.jpg', 'Stims2/Pos/14Pos2.jpg',
'Stims2/Neg/24Neg2.jpg', 'Stims2/Neut/51Neut2.jpg',
'Stims2/Pos/48Pos2.jpg', 'Stims2/Neg/31Neg2.jpg',
'Stims2/Neut/26Neut2.jpg', 'Stims2/Neg/35Neg2.jpg',
'Stims2/Neut/40Neut2.jpg', 'Stims2/Pos/60Pos2.jpg',
'Stims2/Pos/77Pos2.jpg', 'Stims2/Neut/9Neut2.jpg',
'Stims2/Neg/47Neg2.jpg', 'Stims2/Neg/107Neg2.jpg',
'Stims2/Pos/66Pos2.jpg', 'Stims2/Neut/55Neut2.jpg',
'Stims2/Neut/14Neut2.jpg', 'Stims2/Pos/56Pos2.jpg',
'Stims2/Neg/34Neg2.jpg', 'Stims2/Neg/131Neg2.jpg',
'Stims2/Pos/97Pos2.jpg', 'Stims2/Neut/52Neut2.jpg',
'Stims2/Neut/45Neut2.jpg', 'Stims2/Neg/162Neg2.jpg',
'Stims2/Pos/129Pos2.jpg', 'Stims2/Pos/52Pos2.jpg',
'Stims2/Neg/104Neg2.jpg', 'Stims2/Neut/48Neut2.jpg',
'Stims2/Neut/21Neut2.jpg', 'Stims2/Pos/104Pos2.jpg',
'Stims2/Neg/50Neg2.jpg', 'Stims2/Pos/24Pos2.jpg',
'Stims2/Neut/34Neut2.jpg', 'Stims2/Neg/176Neg2.jpg',
'Stims2/Neg/129Neg2.jpg', 'Stims2/Pos/47Pos2.jpg',
'Stims2/Neut/36Neut2.jpg', 'Stims2/Pos/157Pos2.jpg',
'Stims2/Neg/58Neg2.jpg', 'Stims2/Neut/7Neut2.jpg',
'Stims2/Neut/53Neut2.jpg', 'Stims2/Pos/69Pos2.jpg',
'Stims2/Neg/172Neg2.jpg', 'Stims2/Pos/68Pos2.jpg',
'Stims2/Neut/35Neut2.jpg', 'Stims2/Neg/100Neg2.jpg',
'Stims2/Neg/17Neg2.jpg', 'Stims2/Pos/148Pos2.jpg',
'Stims2/Neut/46Neut2.jpg', 'Stims2/Neut/16Neut2.jpg',
'Stims2/Pos/105Pos2.jpg', 'Stims2/Neg/159Neg2.jpg',
'Stims2/Pos/29Pos2.jpg', 'Stims2/Neg/64Neg2.jpg',
'Stims2/Neut/58Neut2.jpg', 'Stims2/Neut/30Neut2.jpg',
'Stims2/Pos/71Pos2.jpg', 'Stims2/Neg/41Neg2.jpg',
'Stims2/Neut/20Neut2.jpg', 'Stims2/Neg/69Neg2.jpg',
'Stims2/Pos/9Pos2.jpg', 'Stims2/Pos/5Pos2.jpg',
'Stims2/Neut/13Neut2.jpg', 'Stims2/Neg/1Neg2.jpg',
'Stims2/Pos/31Pos2.jpg', 'Stims2/Neg/21Neg2.jpg',
'Stims2/Neut/32Neut2.jpg', 'Stims2/Pos/96Pos2.jpg',
'Stims2/Neg/118Neg2.jpg', 'Stims2/Neut/57Neut2.jpg',
'Stims2/Neut/3Neut2.jpg', 'Stims2/Pos/32Pos2.jpg',
'Stims2/Neg/117Neg2.jpg', 'Stims2/Neg/6Neg2.jpg',
'Stims2/Pos/10Pos2.jpg', 'Stims2/Neut/44Neut2.jpg',
'Stims2/Pos/25Pos2.jpg', 'Stims2/Neut/50Neut2.jpg',
'Stims2/Neg/51Neg2.jpg', 'Stims2/Neut/47Neut2.jpg',
'Stims2/Neg/135Neg2.jpg', 'Stims2/Pos/125Pos2.jpg',
'Stims2/Neut/43Neut2.jpg', 'Stims2/Neg/7Neg2.jpg',
'Stims2/Pos/11Pos2.jpg', 'Stims2/Neut/22Neut2.jpg',
'Stims2/Pos/20Pos2.jpg', 'Stims2/Neg/180Neg2.jpg',
'Stims2/Neut/31Neut2.jpg', 'Stims2/Neg/164Neg2.jpg',
'Stims2/Pos/37Pos2.jpg', 'Stims2/Neg/13Neg2.jpg',
'Stims2/Neut/5Neut2.jpg', 'Stims2/Pos/135Pos2.jpg',
'Stims2/Neg/53Neg2.jpg', 'Stims2/Neut/54Neut2.jpg',
'Stims2/Pos/81Pos2.jpg', 'Stims2/Pos/44Pos2.jpg',
'Stims2/Neut/11Neut2.jpg', 'Stims2/Neg/115Neg2.jpg',
'Stims2/Neut/6Neut2.jpg', 'Stims2/Pos/107Pos2.jpg',
'Stims2/Neg/136Neg2.jpg', 'Stims2/Pos/75Pos2.jpg',
'Stims2/Neg/65Neg2.jpg', 'Stims2/Neut/42Neut2.jpg',
'Stims2/Pos/43Pos2.jpg', 'Stims2/Neut/24Neut2.jpg',
'Stims2/Neg/12Neg2.jpg', 'Stims2/Neut/8Neut2.jpg',
'Stims2/Pos/2Pos2.jpg', 'Stims2/Neg/63Neg2.jpg']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
The goal is to remove all characters that precede the '/' character.
I tried 'lstrip':
df['Name'] = df['Name'].map(lambda x: x.lstrip('Stims1/Neut/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Stims1/Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('2'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('/Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Neg/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('/Neut/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('ut/'))
The problem with lstrip is that it requires a lot of different inputs for matching the string, and often strips too many strings.
I would like to avoid using 'replace', as it is even less efficient; it requires entering every single combination of strings. The same problem seems to apply to 're'.
Is there a way to remove all characters that precede the '/' most panefficiently?

What it really looks like you're trying to do is grab just the filename and drop the rest of the directory from the filepath. If that is the case, I would use df.apply with os.path.basename
>>> import os
>>> df['Name'] = df['Name'].apply(os.path.basename)
Which results in
>>> df
Name
0 Square_Train_1
1 32Neut1.jpg
2 Square_Train_2
3 114Neg1.jpg
4 Square_Train_3
.. ...
238 24Neut2.jpg
239 12Neg2.jpg
240 8Neut2.jpg
241 2Pos2.jpg
242 63Neg2.jpg
[243 rows x 1 columns]

Related

want to apply merge function on column A

How can I apply merge function or any other method on column A.
For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a
"(D A|D B|D C)|(A B|A C|A D)|(B|C|D)"
This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values.
I have below data frame.
import pandas as pd
data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'],
'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)']
}
df = pd.DataFrame(data)
print (df)
My expected result is mentioned in column B(Expected)
Below method I tried:-
(1)
df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x)
(2)
# Split the string by the pipe character
df['string'] = df['string'].str.split('|')
df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))

You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values:
from itertools import product
def split(s):
return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])])
df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True)
print(df)
Note that this requires non-nested parentheses.
Output:
A B
0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col
Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names.
with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A
my data
import pandas as pd
list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE']
data = {
"col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"],
}
df = pd.DataFrame(data)
# The code you provided will give the result like the picture below. and it's not right
# or--------
df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
# or--------
import re
pattern = re.compile(r"|".join(x for x in list20))
df = (df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)
# or----------
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
The result obtained from the code above and it's not right
Expected Output

Try to sort the list first:
pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len)))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

Try with str.extract
df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0]
df
Out[121]:
col_one new
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU

One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that:
def matcher(col_one):
for i in list20:
if i in col_one:
return i
return 'na' #adjust as you see fit
df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1)
df
expected results:
col_one new_col
0 CFER CFER
1 ABCP6P45C9 ABC
2 LOU-W5 LOU
3 CFER-R CFER
4 ABC-W1 ABC
5 LOU13C2465 LOU

Another approach:
pattern = re.compile(r"|".join(x for x in list20))
(df
.assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one])
)

Renaming index values in pandas dataframe

I would need to change the name of my indices:
Country Date (other columns)
/link1/subpath2/Text by Poe/
/link1/subpath2/Text by Wilde/
/link1/subpath2/Text by Whitman/
Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have
Country Date (other columns)
Poe
Wilde
Whitman
Currently I am doing it one by one:
df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'})
df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'})
df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'})
It works, but since I have hundreds of datasets, as you can imagine is not doable

You can use str.replace:
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
df['Country'] = df['Country'].str.replace(r'/', '')
If 'Country' is an Index you can do as follows:
df = df.set_index('Country')
df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '')
If it's a MultiIndex you can use .reset_index:
df = df.reset_index()
df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')

You can always use regex pattern if things get more complicated:
import re
import pandas as pd
df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/',
'/link1/subpath2/Text by Wilde/',
'/link1/subpath2/Text by Whitman/'])
name_pattern = re.compile(r'by (\w+)/')
df.index = [name_att.findall(idx)[0] for idx in df.index]
df
where name_pattern will capture all groups between 'by ' and '/'

you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays.
df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0)
.str.extract('\s(\w*)\/$')[0],
df.index.get_level_values(1)],
names=['Country', 'Dates'])

Add suffix to column names that don't already have a suffix

I have a data frame with columns like
Name Date Date_x Date_y A A_x A_y..
and I need to add _z to the columns (except the Name column) that don't already have _x or _y . So, I want the output to be similar to
Name Date_z Date_x Date_y A_z A_x A_y...
I've tried
df.iloc[:,~df.columns.str.contains('x|y|Name')]=df.iloc[:,~df.columns.str.contains('x|y|Name')].add_suffix("_z")
# doesn't add suffixes and replaces columns with all nans
df.columns=df.columns.map(lambda x : x+'_z' if "x" not in x or "y" not in x else x)
#many variations of this but seems to add _z to all of the column names

How about:
df.columns = [x if x=='Name' or '_' in x else x+'_z' for x in df.columns]

You can also try:
df.rename(columns = lambda x: x if x=='Name' or '_' in x else x+'_z')
stealing slightly from Quang Hoang ;)

Add '_z' where the column stub is duplicated and without a suffix.
m = (df.columns.str.split('_').str[0].duplicated(keep=False)
& ~df.columns.str.contains('_'))
df.columns = df.columns.where(~m, df.columns+'_z')

I would use index.putmask as follows:
m = (df.columns == 'Name') | df.columns.str[-2:].isin(['_x','_y'])
df.columns = df.columns.putmask(~m, df.columns+'_z')
In [739]: df.columns
Out[739]: Index(['Name', 'Date_z', 'Date_x', 'Date_y', 'A_z', 'A_x', 'A_y'], dty
pe='object')

Search and replace dots and commas in pandas dataframe

This is my DataFrame:
d = {'col1': ['sku 1.1', 'sku 1.2', 'sku 1.3'], 'col2': ['9.876.543,21', 654, '321,01']}
df = pd.DataFrame(data=d)
df
col1 col2
0 sku 1.1 9.876.543,21
1 sku 1.2 654
2 sku 1.3 321,01
Data in col2 are numbers in local format, which I would like to convert into:
col2
9876543.21
654
321.01
I tried df['col2'] = pd.to_numeric(df['col2'], downcast='float'), which returns a ValueError: : Unable to parse string "9.876.543,21" at position 0.
I tried also df = df.apply(lambda x: x.str.replace(',', '.')), which returns ValueError: could not convert string to float: '5.023.654.46'

The best is use if possible parameters in read_csv:
df = pd.read_csv(file, thousands='.', decimal=',')
If not possible, then replace should help:
df['col2'] = (df['col2'].replace('\.','', regex=True)
.replace(',','.', regex=True)
.astype(float))

You can try
df = df.apply(lambda x: x.replace(',', '&'))
df = df.apply(lambda x: x.replace('.', ','))
df = df.apply(lambda x: x.replace('&', '.'))

You are always better off using standard system facilities where they exist. Knowing that some locales use commas and decimal points differently I could not believe that Pandas would not use the formats of the locale.
Sure enough a quick search revealed this gist, which explains how to make use of locales to convert strings to numbers. In essence you need to import locale and after you've built the dataframe call locale.setlocale to establish a locale that uses commas as decimal points and periods for separators, then apply the dataframe's applymapp method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove all string values that precede a character in pandas? - python

Related

want to apply merge function on column A

Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column

Renaming index values in pandas dataframe

Add suffix to column names that don't already have a suffix

Search and replace dots and commas in pandas dataframe

Categories

Resources