Using split function in Pandas bracket indexer - python

I'm trying to keep the text rows in a data frame that contains a specific word. I have tried the following:
df['hello' in df['text_column'].split()]
and received the following error:
'Series' object has no attribute 'split'
Please pay attention that I'm trying to check if they contain a word, not a char series, so df[df['text_column'].str.contains('hello')] is not a solution because in that case, 'helloss' or 'sshello' would also be returned as True.

Another answer in addition to the regex answer mentioned above would be to use split combined with the map function like below
df['keep_row'] = df['text_column'].map(lambda x: 'hello' in x.split())
df = df[df['keep_row']]
OR
df = df[df['text_column'].map(lambda x: 'hello' in x.split())]

Related

How to replace a substring in one column based on the string from another column?

I'm working with a dataset of Magic: The Gathering cards. What I want is if a card references it's name in it's rules text, for the name to be replaced with "This_Card". Here is what I've tried:
card_text['text_unnamed'] = card_text[['name', 'oracle_text']].apply(lambda x: x.oracle_text.replace(x.name, 'This_Card') if x.name in x.oracle_text else x, axis = 1)
This is giving me the error "TypeError: 'in ' requires string as left operand, not int"
I've tried with axis = 1, 0 and no axis. Still getting errors.
In editing my code to output what x.name is, it has revealed that it is just the int 2. I'm not sure why this is happening. Everything in the name column is a string. What is causing this interaction and how can I prevent it?
Here is a sample of my data.
Series.name is a built-in attribute, so it won't access the column when you call x.name. Instead, you need use x['name'] to access name column
What's more efficient is to conditionally replace with a mask rather than apply
m = card_text['oracle_text'].str.contains(card_text['name'])
card_text[m, 'text_unnamed'] = card_text['oracle_text'].replace(card_text['name'].tolist(), 'This_Card', regex=True)
x.name isn't always a string so you cant perform <int> in <string>
I can't say for sure without seeing the data.
but I guess adding this line before your code will do it
card_text[['name', 'oracle_text']] = card_text[['name', 'oracle_text']].astype(str)
which simply convert all data in both columns to strings

rstrip() the unwanted parts from string column

I have a column of strings consists of the following values:
'20/25+1'
'9/200E'
'20/50+1'
'20/30 # 8 inches'
'20/60-2+1'
'20/20 !!'
'20/20(slow)'
'20/70-1 "slowly"'
And I only want the first fraction, so I am trying to find a way to get to the following values:
'20/25'
'9/200'
'20/50'
'20/30'
'20/60'
'20/20'
'20/20'
'20/70'
I have tried the following command but it doesn't seem to do the job:
df['colname'].apply(lambda x: x.rstrip(' .*')).unique()
How can I fix it? Thanks in advance!
Assuming that the fraction would always start the column's value, we can use str.extract here as follows:
df['pct'] = df['colname'].str.extract(r'^(\d+/\d+)')
Demo

Pandas: how can I use map on a column in a pandas dataframe to create a new column? Having trouble using a lambda function to do this

I have a data set containing strings in 1 column that I want to count the most common character and put that character in a new column. I also want another column that contains the proportion of the string the character represents.
The method I want to use on each string is as follows:
sequence = 'ACCCCTGGC'
char_i_want = collections.Counter(sequence).most_common(1)[0] # for the character
value_i_want = collections.Counter(sequence).most_common(1)[1] / len(sequence) # for the proportion
I understand the result of most_common is a tuple, but when I try this in a python shell, I need to do collections.Counter(sequence).most_common(1)[0][0] to access the 0th element of the tuple, the tuple being the 0th element of the returned list. When I tried implementing that, it still didn't work.
Here is how I attempted to do it:
def common_char(sequence):
return Counter(sequence).most_common(1)[0][0]
def char_freq(sequence):
return Counter(sequence).most_common(1)[0][1] / len(sequence)
data = pd.read_csv('final_file_noidx.csv')
data['most_common_ref'] = data['REF'].map(lambda x: common_char(x))
data['most_common_ref_frac'] = data['REF'].map(lambda x: char_freq(x))
I am greeted by this error message: TypeError: 'float' object is not iterable
data['most_common_ref'] = data['REF'].map(lambda x: common_char(x), na_action='ignore')
data['most_common_ref_frac'] = data['REF'].map(lambda x: char_freq(x), na_action='ignore')
Needed to ignore NaNs, thanks Andy L.

Replace s.str.startwith parameters only in a series

I have a df on which I want to filter a column and replace the str.startswith parameter. Example:
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321'],'Isactive':['Active','','','','Active','']})
by executing the below line :
df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)]
I get :
0 085116634512
2 8885116634
4 +353051166347
How do i replace only the parameters I passed under s.str.startswith() here for example : ('08','8','+3538') and don't touch any other number except the starting numbers inside the tuple (on the fly)?
I found this most convenient and concise
df.Pmobile = df.Pmobile.replace(r'^[08|88|+3538]', '')
You can use pandas's replace with regex.
below is sample code.
df.Pmobile.replace(regex={r'^08':'',r'^8':'',r'^[+]353':''})
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Lowercase sentences in lists in pandas dataframe

I have a pandas data frame like below. I want to convert all the text into lowercase. How can I do this in python?
Sample of data frame
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
What I tried
def toLowercase(fullCorpus):
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
return lowerCased
I get this error
lowerCased = [sentences.lower()for sentences in fullCorpus['sentTokenized']]
AttributeError: 'list' object has no attribute 'lower'
It is easy:
df.applymap(str.lower)
or
df['col'].apply(str.lower)
df['col'].map(str.lower)
Okay, you have lists in rows. Then:
df['col'].map(lambda x: list(map(str.lower, x)))
Can also make it a string, use str.lower and get back to lists.
import ast
df.sentTokenized.astype(str).str.lower().transform(ast.literal_eval)
You can try using apply and map:
def toLowercase(fullCorpus):
lowerCased = fullCorpus['sentTokenized'].apply(lambda row:list(map(str.lower, row)))
return lowerCased
There is also a nice way to do it with numpy:
fullCorpus['sentTokenized'] = [np.char.lower(x) for x in fullCorpus['sentTokenized']]

Categories

Resources