How to remove all string values that precede a character in pandas? - python
I have the following dataframe:
data = {'Name':['Square_Train_1', 'Stims1/Neut/32Neut1.jpg', 'Square_Train_2',
'Stims1/Neg/114Neg1.jpg', 'Square_Train_3',
'Stims1/Pos/129Pos1.jpg', 'Stims1/Neut/58Neut1.jpg',
'Stims1/Neg/13Neg1.jpg', 'Stims1/Pos/5Pos1.jpg',
'Stims1/Pos/25Pos1.jpg', 'Stims1/Neg/47Neg1.jpg',
'Stims1/Neut/8Neut1.jpg', 'Stims1/Neg/129Neg1.jpg',
'Stims1/Neut/42Neut1.jpg', 'Stims1/Pos/98Pos1.jpg',
'Stims1/Neut/24Neut1.jpg', 'Stims1/Neg/6Neg1.jpg',
'Stims1/Pos/107Pos1.jpg', 'Stims1/Neg/100Neg1.jpg',
'Stims1/Pos/77Pos1.jpg', 'Stims1/Neut/3Neut1.jpg',
'Stims1/Neg/53Neg1.jpg', 'Stims1/Pos/157Pos1.jpg',
'Stims1/Neut/13Neut1.jpg', 'Stims1/Neut/9Neut1.jpg',
'Stims1/Pos/104Pos1.jpg', 'Stims1/Neg/64Neg1.jpg',
'Stims1/Neut/30Neut1.jpg', 'Stims1/Pos/43Pos1.jpg',
'Stims1/Neg/1Neg1.jpg', 'Stims1/Neut/59Neut1.jpg',
'Stims1/Neg/172Neg1.jpg', 'Stims1/Pos/56Pos1.jpg',
'Stims1/Pos/44Pos1.jpg', 'Stims1/Neg/34Neg1.jpg',
'Stims1/Neut/16Neut1.jpg', 'Stims1/Neut/47Neut1.jpg',
'Stims1/Neg/21Neg1.jpg', 'Stims1/Pos/96Pos1.jpg',
'Stims1/Neg/50Neg1.jpg', 'Stims1/Pos/2Pos1.jpg',
'Stims1/Neut/21Neut1.jpg', 'Stims1/Neg/65Neg1.jpg',
'Stims1/Pos/35Pos1.jpg', 'Stims1/Neut/51Neut1.jpg',
'Stims1/Neut/55Neut1.jpg', 'Stims1/Pos/60Pos1.jpg',
'Stims1/Neg/30Neg1.jpg', 'Stims1/Neut/7Neut1.jpg',
'Stims1/Pos/9Pos1.jpg', 'Stims1/Neg/41Neg1.jpg',
'Stims1/Pos/31Pos1.jpg', 'Stims1/Neut/40Neut1.jpg',
'Stims1/Neg/156Neg1.jpg', 'Stims1/Neg/135Neg1.jpg',
'Stims1/Pos/71Pos1.jpg', 'Stims1/Neut/26Neut1.jpg',
'Stims1/Pos/105Pos1.jpg', 'Stims1/Neg/17Neg1.jpg',
'Stims1/Neut/44Neut1.jpg', 'Stims1/Pos/150Pos1.jpg',
'Stims1/Neut/57Neut1.jpg', 'Stims1/Neg/12Neg1.jpg',
'Stims1/Pos/24Pos1.jpg', 'Stims1/Neg/131Neg1.jpg',
'Stims1/Neut/31Neut1.jpg', 'Stims1/Pos/10Pos1.jpg',
'Stims1/Neut/11Neut1.jpg', 'Stims1/Neg/118Neg1.jpg',
'Stims1/Neg/51Neg1.jpg', 'Stims1/Pos/48Pos1.jpg',
'Stims1/Neut/34Neut1.jpg', 'Stims1/Pos/148Pos1.jpg',
'Stims1/Neut/22Neut1.jpg', 'Stims1/Neg/176Neg1.jpg',
'Stims1/Neut/5Neut1.jpg', 'Stims1/Neg/104Neg1.jpg',
'Stims1/Pos/68Pos1.jpg', 'Stims1/Neut/35Neut1.jpg',
'Stims1/Pos/14Pos1.jpg', 'Stims1/Neg/136Neg1.jpg',
'Stims1/Neut/54Neut1.jpg', 'Stims1/Neg/107Neg1.jpg',
'Stims1/Pos/47Pos1.jpg', 'Stims1/Neut/43Neut1.jpg',
'Stims1/Neg/58Neg1.jpg', 'Stims1/Pos/20Pos1.jpg',
'Stims1/Neut/6Neut1.jpg', 'Stims1/Neg/63Neg1.jpg',
'Stims1/Pos/135Pos1.jpg', 'Stims1/Neut/39Neut1.jpg',
'Stims1/Neg/164Neg1.jpg', 'Stims1/Pos/125Pos1.jpg',
'Stims1/Neg/117Neg1.jpg', 'Stims1/Neut/48Neut1.jpg',
'Stims1/Pos/69Pos1.jpg', 'Stims1/Pos/37Pos1.jpg',
'Stims1/Neg/159Neg1.jpg', 'Stims1/Neut/36Neut1.jpg',
'Stims1/Pos/75Pos1.jpg', 'Stims1/Neg/180Neg1.jpg',
'Stims1/Neut/50Neut1.jpg', 'Stims1/Neg/7Neg1.jpg',
'Stims1/Pos/11Pos1.jpg', 'Stims1/Neut/52Neut1.jpg',
'Stims1/Pos/29Pos1.jpg', 'Stims1/Neut/46Neut1.jpg',
'Stims1/Neg/115Neg1.jpg', 'Stims1/Neg/31Neg1.jpg',
'Stims1/Pos/66Pos1.jpg', 'Stims1/Neut/14Neut1.jpg',
'Stims1/Neut/53Neut1.jpg', 'Stims1/Neg/162Neg1.jpg',
'Stims1/Pos/97Pos1.jpg', 'Stims1/Neg/35Neg1.jpg',
'Stims1/Neut/45Neut1.jpg', 'Stims1/Pos/32Pos1.jpg',
'Stims1/Pos/81Pos1.jpg', 'Stims1/Neg/24Neg1.jpg',
'Stims1/Neut/1Neut1.jpg', 'Stims1/Neut/20Neut1.jpg',
'Stims1/Neg/69Neg1.jpg', 'Stims1/Pos/52Pos1.jpg',
'Stims2/Pos/35Pos2.jpg', 'Stims2/Neut/1Neut2.jpg',
'Stims2/Neg/30Neg2.jpg', 'Stims2/Neg/156Neg2.jpg',
'Stims2/Neut/59Neut2.jpg', 'Stims2/Pos/150Pos2.jpg',
'Stims2/Neg/114Neg2.jpg', 'Stims2/Neut/39Neut2.jpg',
'Stims2/Pos/98Pos2.jpg', 'Stims2/Pos/14Pos2.jpg',
'Stims2/Neg/24Neg2.jpg', 'Stims2/Neut/51Neut2.jpg',
'Stims2/Pos/48Pos2.jpg', 'Stims2/Neg/31Neg2.jpg',
'Stims2/Neut/26Neut2.jpg', 'Stims2/Neg/35Neg2.jpg',
'Stims2/Neut/40Neut2.jpg', 'Stims2/Pos/60Pos2.jpg',
'Stims2/Pos/77Pos2.jpg', 'Stims2/Neut/9Neut2.jpg',
'Stims2/Neg/47Neg2.jpg', 'Stims2/Neg/107Neg2.jpg',
'Stims2/Pos/66Pos2.jpg', 'Stims2/Neut/55Neut2.jpg',
'Stims2/Neut/14Neut2.jpg', 'Stims2/Pos/56Pos2.jpg',
'Stims2/Neg/34Neg2.jpg', 'Stims2/Neg/131Neg2.jpg',
'Stims2/Pos/97Pos2.jpg', 'Stims2/Neut/52Neut2.jpg',
'Stims2/Neut/45Neut2.jpg', 'Stims2/Neg/162Neg2.jpg',
'Stims2/Pos/129Pos2.jpg', 'Stims2/Pos/52Pos2.jpg',
'Stims2/Neg/104Neg2.jpg', 'Stims2/Neut/48Neut2.jpg',
'Stims2/Neut/21Neut2.jpg', 'Stims2/Pos/104Pos2.jpg',
'Stims2/Neg/50Neg2.jpg', 'Stims2/Pos/24Pos2.jpg',
'Stims2/Neut/34Neut2.jpg', 'Stims2/Neg/176Neg2.jpg',
'Stims2/Neg/129Neg2.jpg', 'Stims2/Pos/47Pos2.jpg',
'Stims2/Neut/36Neut2.jpg', 'Stims2/Pos/157Pos2.jpg',
'Stims2/Neg/58Neg2.jpg', 'Stims2/Neut/7Neut2.jpg',
'Stims2/Neut/53Neut2.jpg', 'Stims2/Pos/69Pos2.jpg',
'Stims2/Neg/172Neg2.jpg', 'Stims2/Pos/68Pos2.jpg',
'Stims2/Neut/35Neut2.jpg', 'Stims2/Neg/100Neg2.jpg',
'Stims2/Neg/17Neg2.jpg', 'Stims2/Pos/148Pos2.jpg',
'Stims2/Neut/46Neut2.jpg', 'Stims2/Neut/16Neut2.jpg',
'Stims2/Pos/105Pos2.jpg', 'Stims2/Neg/159Neg2.jpg',
'Stims2/Pos/29Pos2.jpg', 'Stims2/Neg/64Neg2.jpg',
'Stims2/Neut/58Neut2.jpg', 'Stims2/Neut/30Neut2.jpg',
'Stims2/Pos/71Pos2.jpg', 'Stims2/Neg/41Neg2.jpg',
'Stims2/Neut/20Neut2.jpg', 'Stims2/Neg/69Neg2.jpg',
'Stims2/Pos/9Pos2.jpg', 'Stims2/Pos/5Pos2.jpg',
'Stims2/Neut/13Neut2.jpg', 'Stims2/Neg/1Neg2.jpg',
'Stims2/Pos/31Pos2.jpg', 'Stims2/Neg/21Neg2.jpg',
'Stims2/Neut/32Neut2.jpg', 'Stims2/Pos/96Pos2.jpg',
'Stims2/Neg/118Neg2.jpg', 'Stims2/Neut/57Neut2.jpg',
'Stims2/Neut/3Neut2.jpg', 'Stims2/Pos/32Pos2.jpg',
'Stims2/Neg/117Neg2.jpg', 'Stims2/Neg/6Neg2.jpg',
'Stims2/Pos/10Pos2.jpg', 'Stims2/Neut/44Neut2.jpg',
'Stims2/Pos/25Pos2.jpg', 'Stims2/Neut/50Neut2.jpg',
'Stims2/Neg/51Neg2.jpg', 'Stims2/Neut/47Neut2.jpg',
'Stims2/Neg/135Neg2.jpg', 'Stims2/Pos/125Pos2.jpg',
'Stims2/Neut/43Neut2.jpg', 'Stims2/Neg/7Neg2.jpg',
'Stims2/Pos/11Pos2.jpg', 'Stims2/Neut/22Neut2.jpg',
'Stims2/Pos/20Pos2.jpg', 'Stims2/Neg/180Neg2.jpg',
'Stims2/Neut/31Neut2.jpg', 'Stims2/Neg/164Neg2.jpg',
'Stims2/Pos/37Pos2.jpg', 'Stims2/Neg/13Neg2.jpg',
'Stims2/Neut/5Neut2.jpg', 'Stims2/Pos/135Pos2.jpg',
'Stims2/Neg/53Neg2.jpg', 'Stims2/Neut/54Neut2.jpg',
'Stims2/Pos/81Pos2.jpg', 'Stims2/Pos/44Pos2.jpg',
'Stims2/Neut/11Neut2.jpg', 'Stims2/Neg/115Neg2.jpg',
'Stims2/Neut/6Neut2.jpg', 'Stims2/Pos/107Pos2.jpg',
'Stims2/Neg/136Neg2.jpg', 'Stims2/Pos/75Pos2.jpg',
'Stims2/Neg/65Neg2.jpg', 'Stims2/Neut/42Neut2.jpg',
'Stims2/Pos/43Pos2.jpg', 'Stims2/Neut/24Neut2.jpg',
'Stims2/Neg/12Neg2.jpg', 'Stims2/Neut/8Neut2.jpg',
'Stims2/Pos/2Pos2.jpg', 'Stims2/Neg/63Neg2.jpg']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
The goal is to remove all characters that precede the '/' character.
I tried 'lstrip':
df['Name'] = df['Name'].map(lambda x: x.lstrip('Stims1/Neut/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Stims1/Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('2'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('/Pos/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('Neg/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('/Neut/'))
df['Name'] = df['Name'].map(lambda x: x.lstrip('ut/'))
The problem with lstrip is that it requires a lot of different inputs for matching the string, and often strips too many strings.
I would like to avoid using 'replace', as it is even less efficient; it requires entering every single combination of strings. The same problem seems to apply to 're'.
Is there a way to remove all characters that precede the '/' most panefficiently?
What it really looks like you're trying to do is grab just the filename and drop the rest of the directory from the filepath. If that is the case, I would use df.apply with os.path.basename
>>> import os
>>> df['Name'] = df['Name'].apply(os.path.basename)
Which results in
>>> df
Name
0 Square_Train_1
1 32Neut1.jpg
2 Square_Train_2
3 114Neg1.jpg
4 Square_Train_3
.. ...
238 24Neut2.jpg
239 12Neg2.jpg
240 8Neut2.jpg
241 2Pos2.jpg
242 63Neg2.jpg
[243 rows x 1 columns]
Related
want to apply merge function on column A
How can I apply merge function or any other method on column A. For example in layman term I want to convert this string "(A|B|C,D)|(A,B|C|D)|(B|C|D)" into a "(D A|D B|D C)|(A B|A C|A D)|(B|C|D)" This (B|C|D) will remain same as it doesn't have comma value to merge in it. Basically I want to merge the values which are in commas to rest of its other values. I have below data frame. import pandas as pd data = {'A': [ '(A|B|C,D)|(A,B|C|D)|(B|C|D)'], 'B(Expected)': [ '(D A|D B|D C)|(A B|A C|A D)|(B|C|D)'] } df = pd.DataFrame(data) print (df) My expected result is mentioned in column B(Expected) Below method I tried:- (1) df['B(Expected)'] = df['A'].apply(lambda x: x.replace("|", " ").replace(",", "|") if "|" in x and "," in x else x) (2) # Split the string by the pipe character df['string'] = df['string'].str.split('|') df['string'] = df['string'].apply(lambda x: '|'.join([' '.join(i.split(' ')) for i in x]))
You can use a regex to extract the values in parentheses, then a custom function with itertools.product to reorganize the values: from itertools import product def split(s): return '|'.join([' '.join(x) for x in product(*[x.split('|') for x in s.split(',')])]) df['B'] = df['A'].str.replace(r'([^()]+)', lambda m: split(m.group()), regex=True) print(df) Note that this requires non-nested parentheses. Output: A B 0 (A|B|C,D)|(A,B|C|D)|(B|C|D) (A D|B D|C D)|(A B|A C|A D)|(B|C|D)
Search columns with list of string for a specific set of text and if the text is found enter new a new string of text in a new column
I want to search for names in column col_one where I have a list of names in the variable list20. When searching, if the value of col_one matches in list20, put the same name in a new column named new_col Most of the time, the name will be at the front, such as ZEN, W, WICE, but there will be some names. with a symbol after the name again, such as ZEN-R, ZEN-W2, ZEN13P2302A my data import pandas as pd list20 = ['ZEN', 'OOP', 'WICE', 'XO', 'WP', 'K', 'WGE', 'YGG', 'W', 'YUASA', 'XPG', 'ABC', 'WHA', 'WHAUP', 'WFX', 'WINNER', 'WIIK', 'WIN', 'YONG', 'WPH', 'KCE'] data = { "col_one": ["ZEN", "WPH", "WICE", "YONG", "K" "XO", "WIN", "WP", "WIIK", "YGG-W1", "W-W5", "WINNER", "YUASA", "WGE", "WFX", "XPG", "WHAUP", "WHA", "KCE13P2302A", "OOP-R"], } df = pd.DataFrame(data) # The code you provided will give the result like the picture below. and it's not right # or-------- df['new_col'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0] # or-------- import re pattern = re.compile(r"|".join(x for x in list20)) df = (df .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one]) ) # or---------- def matcher(col_one): for i in list20: if i in col_one: return i return 'na' #adjust as you see fit df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1) The result obtained from the code above and it's not right Expected Output
Try to sort the list first: pattern = re.compile(r"|".join(x for x in sorted(list20, reverse=True, key=len))) (df .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one]) )
Try with str.extract df['new'] = df['col_one'].str.extract('('+'|'.join(list20)+')')[0] df Out[121]: col_one new 0 CFER CFER 1 ABCP6P45C9 ABC 2 LOU-W5 LOU 3 CFER-R CFER 4 ABC-W1 ABC 5 LOU13C2465 LOU
One way to do this, less attractive in terms of efficiency, is to use a simple function with a lambda such that: def matcher(col_one): for i in list20: if i in col_one: return i return 'na' #adjust as you see fit df['new_col'] = df.apply(lambda x: matcher(x['col_one']), axis=1) df expected results: col_one new_col 0 CFER CFER 1 ABCP6P45C9 ABC 2 LOU-W5 LOU 3 CFER-R CFER 4 ABC-W1 ABC 5 LOU13C2465 LOU
Another approach: pattern = re.compile(r"|".join(x for x in list20)) (df .assign(new=lambda x: [re.findall(pattern, string)[0] for string in x.col_one]) )
Renaming index values in pandas dataframe
I would need to change the name of my indices: Country Date (other columns) /link1/subpath2/Text by Poe/ /link1/subpath2/Text by Wilde/ /link1/subpath2/Text by Whitman/ Country and Date are my indices. I would like to extract the words Poe, Wilde and Whitman from index column Country in order to have Country Date (other columns) Poe Wilde Whitman Currently I am doing it one by one: df=df.rename(index={'/link1/subpath2/Text by Poe/': 'Poe'}) df=df.rename(index={'/link1/subpath2/Text by Wilde/': 'Wilde'}) df=df.rename(index={'/link1/subpath2/Text by Whitman/': 'Whitman'}) It works, but since I have hundreds of datasets, as you can imagine is not doable
You can use str.replace: df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '') df['Country'] = df['Country'].str.replace(r'/', '') If 'Country' is an Index you can do as follows: df = df.set_index('Country') df.index = df.index.str.replace(r'/link1/subpath2/Text by ', '') If it's a MultiIndex you can use .reset_index: df = df.reset_index() df['Country'] = df['Country'].str.replace(r'/link1/subpath2/Text by ', '')
You can always use regex pattern if things get more complicated: import re import pandas as pd df = pd.DataFrame(['foo', 'bar', 'z'], index=['/link1/subpath2/Text by Poe/', '/link1/subpath2/Text by Wilde/', '/link1/subpath2/Text by Whitman/']) name_pattern = re.compile(r'by (\w+)/') df.index = [name_att.findall(idx)[0] for idx in df.index] df where name_pattern will capture all groups between 'by ' and '/'
you can use str.extract with a pattern to catch the last word with (\w*), delimited by a white space \s before and after the character / at the end of the line $. Because it is an index, you need to rebuild the MultiIndex.from_arrays. df.index = pd.MultiIndex.from_arrays([df.index.get_level_values(0) .str.extract('\s(\w*)\/$')[0], df.index.get_level_values(1)], names=['Country', 'Dates'])
Add suffix to column names that don't already have a suffix
I have a data frame with columns like Name Date Date_x Date_y A A_x A_y.. and I need to add _z to the columns (except the Name column) that don't already have _x or _y . So, I want the output to be similar to Name Date_z Date_x Date_y A_z A_x A_y... I've tried df.iloc[:,~df.columns.str.contains('x|y|Name')]=df.iloc[:,~df.columns.str.contains('x|y|Name')].add_suffix("_z") # doesn't add suffixes and replaces columns with all nans df.columns=df.columns.map(lambda x : x+'_z' if "x" not in x or "y" not in x else x) #many variations of this but seems to add _z to all of the column names
How about: df.columns = [x if x=='Name' or '_' in x else x+'_z' for x in df.columns]
You can also try: df.rename(columns = lambda x: x if x=='Name' or '_' in x else x+'_z') stealing slightly from Quang Hoang ;)
Add '_z' where the column stub is duplicated and without a suffix. m = (df.columns.str.split('_').str[0].duplicated(keep=False) & ~df.columns.str.contains('_')) df.columns = df.columns.where(~m, df.columns+'_z')
I would use index.putmask as follows: m = (df.columns == 'Name') | df.columns.str[-2:].isin(['_x','_y']) df.columns = df.columns.putmask(~m, df.columns+'_z') In [739]: df.columns Out[739]: Index(['Name', 'Date_z', 'Date_x', 'Date_y', 'A_z', 'A_x', 'A_y'], dty pe='object')
Search and replace dots and commas in pandas dataframe
This is my DataFrame: d = {'col1': ['sku 1.1', 'sku 1.2', 'sku 1.3'], 'col2': ['9.876.543,21', 654, '321,01']} df = pd.DataFrame(data=d) df col1 col2 0 sku 1.1 9.876.543,21 1 sku 1.2 654 2 sku 1.3 321,01 Data in col2 are numbers in local format, which I would like to convert into: col2 9876543.21 654 321.01 I tried df['col2'] = pd.to_numeric(df['col2'], downcast='float'), which returns a ValueError: : Unable to parse string "9.876.543,21" at position 0. I tried also df = df.apply(lambda x: x.str.replace(',', '.')), which returns ValueError: could not convert string to float: '5.023.654.46'
The best is use if possible parameters in read_csv: df = pd.read_csv(file, thousands='.', decimal=',') If not possible, then replace should help: df['col2'] = (df['col2'].replace('\.','', regex=True) .replace(',','.', regex=True) .astype(float))
You can try df = df.apply(lambda x: x.replace(',', '&')) df = df.apply(lambda x: x.replace('.', ',')) df = df.apply(lambda x: x.replace('&', '.'))
You are always better off using standard system facilities where they exist. Knowing that some locales use commas and decimal points differently I could not believe that Pandas would not use the formats of the locale. Sure enough a quick search revealed this gist, which explains how to make use of locales to convert strings to numbers. In essence you need to import locale and after you've built the dataframe call locale.setlocale to establish a locale that uses commas as decimal points and periods for separators, then apply the dataframe's applymapp method.