rstrip() the unwanted parts from string column - python

I have a column of strings consists of the following values:
'20/25+1'
'9/200E'
'20/50+1'
'20/30 # 8 inches'
'20/60-2+1'
'20/20 !!'
'20/20(slow)'
'20/70-1 "slowly"'
And I only want the first fraction, so I am trying to find a way to get to the following values:
'20/25'
'9/200'
'20/50'
'20/30'
'20/60'
'20/20'
'20/20'
'20/70'
I have tried the following command but it doesn't seem to do the job:
df['colname'].apply(lambda x: x.rstrip(' .*')).unique()
How can I fix it? Thanks in advance!

Assuming that the fraction would always start the column's value, we can use str.extract here as follows:
df['pct'] = df['colname'].str.extract(r'^(\d+/\d+)')
Demo

Related

How to replace a substring in one column based on the string from another column?

I'm working with a dataset of Magic: The Gathering cards. What I want is if a card references it's name in it's rules text, for the name to be replaced with "This_Card". Here is what I've tried:
card_text['text_unnamed'] = card_text[['name', 'oracle_text']].apply(lambda x: x.oracle_text.replace(x.name, 'This_Card') if x.name in x.oracle_text else x, axis = 1)
This is giving me the error "TypeError: 'in ' requires string as left operand, not int"
I've tried with axis = 1, 0 and no axis. Still getting errors.
In editing my code to output what x.name is, it has revealed that it is just the int 2. I'm not sure why this is happening. Everything in the name column is a string. What is causing this interaction and how can I prevent it?
Here is a sample of my data.
Series.name is a built-in attribute, so it won't access the column when you call x.name. Instead, you need use x['name'] to access name column
What's more efficient is to conditionally replace with a mask rather than apply
m = card_text['oracle_text'].str.contains(card_text['name'])
card_text[m, 'text_unnamed'] = card_text['oracle_text'].replace(card_text['name'].tolist(), 'This_Card', regex=True)
x.name isn't always a string so you cant perform <int> in <string>
I can't say for sure without seeing the data.
but I guess adding this line before your code will do it
card_text[['name', 'oracle_text']] = card_text[['name', 'oracle_text']].astype(str)
which simply convert all data in both columns to strings

How to split based on string matching?

I have two lists, one that contains the user input and the other one that contains the mapping.
The user input looks like this :
The mapping looks like this :
I am trying to split the strings in the user input list. Sometime they enter one record as CO109CO45 but in reality these are two codes and don't belong together. They need to be separated with a comma or space as such CO109,CO45.
There are many examples that have the same behavior and i was thinking to use a mapping list to match and split. Is this something that can be done? What do you suggest? Thanks in advance for your help!
Use a combination of look ahead and look behind regex in the split.
df = pd.DataFrame({'RCode': ['CO109', 'CO109CO109']})
print(df)
RCode
0 CO109
1 CO109CO109
df.RCode.str.split('(?<=\d)(?=\D)')
0 [CO109]
1 [CO109, CO109]
Name: RCode, dtype: object
You can try with regex:
import re
l = ['CO2740CO96', 'CO12', 'CO973', 'CO870CO397', 'CO584', 'CO134CO42CO685']
df = pd.DataFrame({'code':l})
df.code = df.code.str.findall('[A-Za-z]+\d+')
print(df)
Output:
code
0 [CO2740, CO96]
1 [CO12]
2 [CO973]
3 [CO870, CO397]
4 [CO584]
5 [CO134, CO42, CO685]
I usually use something like this, for an input original_list:
output_list = [
[
('CO' + target).strip(' ,')
for target in item.split('CO')
]
for item in original_list
]
There are probably more efficient ways of doing it, but you don't need the overhead of dataframes / pandas, or the hard-to-read aspects of regexes.
If you have a manageable number of prefixes ("CO", "PR", etc.), you can set up a recursive function splitting on each of them. - Or you can use .find() with the full codes.

Using split function in Pandas bracket indexer

I'm trying to keep the text rows in a data frame that contains a specific word. I have tried the following:
df['hello' in df['text_column'].split()]
and received the following error:
'Series' object has no attribute 'split'
Please pay attention that I'm trying to check if they contain a word, not a char series, so df[df['text_column'].str.contains('hello')] is not a solution because in that case, 'helloss' or 'sshello' would also be returned as True.
Another answer in addition to the regex answer mentioned above would be to use split combined with the map function like below
df['keep_row'] = df['text_column'].map(lambda x: 'hello' in x.split())
df = df[df['keep_row']]
OR
df = df[df['text_column'].map(lambda x: 'hello' in x.split())]

Pandas check for substring of x in column if found add string to x

My dataframe has a max of 2 variations of each string e.g if the string is 'USD' then sometimes another entry with 'LDUSD' is present also...the entries without 'LD' are always present.
I need to apply x[0:2]+'_'+x[2:] but ONLY if the column contains an exact match of x[2:].
It must be done this way to ensure the change only happens to the relevant entries, as there are also various items which include either 'LD' in their default name e.g ('EGLD','LDO','SLD') or include the current x string e.g.('TUSD','USDT').
df['Asset'] = df['Asset'].apply(lambda x: x[0:2]+'_'+x[2:] if x[2:] in df['Asset'] else x)
The part after...in...doesn't work, and I'm at a loss as to how to proceed next.
How do I check if the column ['Asset'] holds an exact match of x[2:]?
Apologies for the title I didn't really know what to call this one...
EDIT a few examples out of circa 400:
df['Asset'] = ['1INCH','AAVE','ADA','ALGO','EGLD','DASH','LDO','TUSD','USDT','LD1INCH','LDALGO','LDEGLD','LDDASH','LDLDO','LDTUSD','LDUSDT',]
What I need:
df['Asset'] = ['1INCH','AAVE','ADA','ALGO','EGLD','DASH','LDO','TUSD','USDT','LD_1INCH','LD_ALGO','LD_EGLD','LD_DASH','LD_LDO','LD_TUSD','LD_USDT',]
You can use str.contains() to test if any() match rf'^{x[2:]}$':
df['Asset'] = df['Asset'].apply(lambda x: x[:2]+'_'+x[2:]
if df['Asset'].str.contains(rf'^{x[2:]}$', regex=True).any() else x)
For regex, add r to make it a raw string. In this case we also add the f so we can interpolate x[2:] via f-string:
^ - beginning of string
{x[2:]} - interpolate x[2:] inside the f-string
$ - end of string
Do you want something like this? If the end has 'USD' but contains more then give it an underscore before USD?
df = pd.DataFrame(columns=['Asset'], data=['1INCH','AAVE','ADA','ALGO','EGLD','DASH','LDO','TUSD','USDT','LD1INCH','LDALGO','LDEGLD','LDDASH','LDLDO','LDTUSD','LDUSDT',])
df['Asset'].apply(lambda x: x[:2]+'_'+x[2:] if len(x) > 2 and x[2:] in df['Asset'].values else x)

Replace s.str.startwith parameters only in a series

I have a df on which I want to filter a column and replace the str.startswith parameter. Example:
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321'],'Isactive':['Active','','','','Active','']})
by executing the below line :
df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)]
I get :
0 085116634512
2 8885116634
4 +353051166347
How do i replace only the parameters I passed under s.str.startswith() here for example : ('08','8','+3538') and don't touch any other number except the starting numbers inside the tuple (on the fly)?
I found this most convenient and concise
df.Pmobile = df.Pmobile.replace(r'^[08|88|+3538]', '')
You can use pandas's replace with regex.
below is sample code.
df.Pmobile.replace(regex={r'^08':'',r'^8':'',r'^[+]353':''})
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Categories

Resources