remove words starting with "#" in a column from a dataframe - python

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance

Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing

In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Related

How do you remove standalone alphabets from a column in python?

I am currently working with lyric data of several artists.
When I was working with BTS lyric data, I noticed that the data had names of the member who sang that line at the front as you can see in the example below.
Artist Lyric
bts(방탄소년단) jungkook 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake love 가사 v jungkook 널 위해서라면 난 슬퍼도...
I tried removing their names with the str.replace() method.
However, one of their name seems to be "v" and when I try to remove "v" it removes all v's from the column as demonstrated below
Artist Lyric
bts(방탄소년단) 'cause i i i'm in the stars tonight s...
bts(방탄소년단) 방탄소년단의 fake lo e 가사 널 위해서라면 난 슬퍼도...
It would be really appreciated if there would be any way to remove a "v" that stands alone, but maintain the v's that are actually inside a word such as the v in love.
Thank you in advance!!
If you want to remove all single letters in you data, let's say that you have the following dataframe
artist lyrics
0 sdsddssd fdgdg v sdsssdvsdd
1 sffxxvxv sddsdsdvdfdf
2 vxvxvxv fdagsgvs v v v
3 zcczc xdfsddfds
4 zcczc vxvxdfdvdvd
then
dg['lyrics'].map(lambda x: ' '.join(word for word in x.split() if len(word)>1))
returns:
0 fdgdg sdsssdvsdd
1 sddsdsdvdfdf
2 fdagsgvs
3 xdfsddfds
4 vxvxdfdvdvd
Name: lyrics, dtype: object

Remove specific characters from a pandas column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd' from every row that starts with it. The issue I am facing is that the code I am using to execute this is removing anything that starts with the letter 'f'.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd: Please take action on the action needed items
4 Fix all the mistakes please
When i used the code:
df['Clean Summary'] = individual_receivers['summary'].map(lambda x: x.lstrip('Fwd:'))
I end up with a dataframe that looks like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 ix all the mistakes please
I don't want the last row to lose the F in 'Fix'.
You should use a regex remembering ^ indicates startswith:
df['Clean Summary'] = df['Summary'].str.replace('^Fwd','')
Here's an example:
df = pd.DataFrame({'msg':['Fwd: o','oe','Fwd: oj'],'B':[1,2,3]})
df['clean_msg'] = df['msg'].str.replace(r'^Fwd: ','')
print(df)
Output:
msg B clean_msg
0 Fwd: o 1 o
1 oe 2 oe
2 Fwd: oj 3 oj
You are not only loosing 'F' but also 'w', 'd', and ':'. This is the way lstrip works - it removes all of the combinations of characters in the passed string.
You should actually use x.replace('Fwd:', '', 1)
1 - ensures that only the first occurrence of the string is removed.

Issue Finding Substring

I'm trying to find a substring Chief in the df column. Its working fine with split() on text with spaces but not working as expected with find().
sum(df['JobTitle'].apply(lambda x :'chief' in x.lower().split() ))
sum(df['JobTitle'].apply(lambda x : x.lower().find('chief') ==1))
Can you please highlight what the issue in find usage is here?
You can try with re:
import re
# if it appears, add 1, else add 0
sum(df['JobTitle'].apply(lambda x : int(bool(re.findall(r'\bchief\b', x.lower()))))
# add the number of times the word appears
sum(df['JobTitle'].apply(lambda x : len(re.findall(r'\bchief\b', x.lower())))
EDIT
If you want to catch chief but not words with chief inside, like mischief, use r'\bchief\b'
Demo : https://regex101.com/r/jYOfM1/1

Replace specific words by user dictionary and others by 0

So I have a review dataset having reviews like
Simply the best. I bought this last year. Still using. No problems
faced till date.Amazing battery life. Works fine in darkness or broad
daylight. Best gift for any book lover.
(This is from the original dataset, I have removed all punctuation and have all lower case in my processed dataset)
What I want to do is replace some words by 1(as per my dictionary) and others by 0.
My dictionary is
dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
I want my output like:
0010000000000001000000000100000
I have used this code:
df['newreviews'] = df['reviews'].map(dict).fillna("0")
This always returns 0 as output. I did not want this so I took 1s and 0s as strings, but despite that I'm getting the same result.
Any suggestions how to solve this?
First dont use dict as variable name, because builtins (python reserved word), then use list comprehension with get for replace not matched values to 0.
Notice:
If data are like date.Amazing - no space after punctuation is necessary replace by whitespace.
df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})
d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()
df['newreviews'] = [''.join(d.get(y, '0') for y in x.split()) for x in df['reviews']]
Alternative:
df['newreviews'] = df['reviews'].apply(lambda x: ''.join(d.get(y, '0') for y in x.split()))
print (df)
reviews \
0 simply the best i bought this last year stil...
newreviews
0 0011000000000001000000000100000
You can do:
# clean the sentence
import re
sent = re.sub(r'\.','',sent)
# convert to list
sent = sent.lower().split()
# get values from dict using comprehension
new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
print(new_sent)
'001100000000000000000000100000'
You can do it by
df.replace(repl, regex=True, inplace=True)
where df is your dataframe and repl is your dictionary.

Stopword removal with pandas

I would like to remove stopwords from a column of a data frame.
Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted.
but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.
replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun

Categories

Resources