I previously asked this but I got the data type wrong!
I have my Pandas Dataframe, which looks like this
print(data)
text
0 FollowFriday for being top engaged members...
1 Hey James! How odd :/ Please call our Contact...
2 we had a listen last night :) As You Bleed is...
In this dataframe theree are links, which all start with "http". I have already got a line of code in a function, below, which removes words starting with '#' and other cleaning methods.
def cleanData(data):
#Loop through the data, creating a new dataframe with only ascii characters
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if char.isascii()))
#Remove any tokens with numbers, or digits.
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if not char.isdigit()))
#Removes any words which start with #, which are replies.
data['text']= data['text'].str.replace('(#\w+.*?)',"")
#Remove any left over characters
data = data['text'].str.replace('[^\w\s]','')
#return the cleaned data
return data
Can anyone help to remove words which start with 'http' please? I have already tried to edit what I have but no luck so far.
Thanks in advance!
Use Series.str.replace
data['text'] = data['text'].str.replace('http[^\s]*',"")
One option is to use the str.replace() method:
df = pd.DataFrame( dict(text = [r'FollowFridayhttphttp http http for being top engaged members.',r'James!http How odd http:/ Please call ou',r'httpe had a listen last night :) As You Bleed is...']))
df['text'] = df['text'].apply(lambda x: x.replace('http',''))
And you could do something like this in your function.
Related
it's my first time posting but I have a question regarding trying to create a function in python that will search a list of strings and return any words that I am looking for. Here is what I have so far:
def search_words(data, search_words):
keep = []
for data in data:
if data in search_words:
keep.append(data)
return keep
Here is the data I am searching through and the words I am trying to find:
data = ['SHOP earnings for Q1 are up 5%',
'Subscriptions at SHOP have risen to all-time highs, boosting sales',
"Got a new Mazda, VROOM VROOM Y'ALL",
'I hate getting up at 8am FOR A STUPID ZOOM MEETING',
'TSLA execs hint at a decline in earnings following a capital expansion program']
words = ['earnings', 'sales']
Upon doing print(search_words(data=data, search_words=words)) my list (keep) returns empty brackets [ ] and I am unsure of how to fix the issue. I know that searching for a word in a string is different than looking for a number in a list but I cannot figure out to modify my code to account for that. Any help would be appreciated.
You can use the following. This will keep all the sentences in data that contain at least one of the words:
keep = [s for s in data if any(w in s for w in words)]
Since they are all strings, instead of looping over them all just combine them all and search that. Also make words a set:
[word for word in ' '.join(data).split() if word in words]
Using regex:
re.findall('|'.join(words), ''.join(data))
['earnings', 'sales', 'earnings']
You can use the following. This will keep all the sentences in data that contain at least one or both of the words:
This is a form of programs for beginners.
data, search_words must be a list.
def search_words(data, search_words):
keep = []
for dt in data:
for sw in search_words:
if sw in dt:
keep.append(dt)
return keep
I am trying to extract emojis from a sentence and add them into a new column, but when I do it, the new column just contains nothing, in which the emojis are still in the sentence.
For reference, my dataset looks like this - but contains over 70,000 sentences similar to this:
Sentence
You look 😌 good
Love you ❤️
I am so happy today 🤩
So far, I have tried this method:
import pandas as pd
import emoji
df['emojis'] = df['Sentence'].apply(lambda row: ''.join(c for c in row if c in emoji.UNICODE_EMOJI))
df
And this method:
def extract_emojis(text):
return ''.join(c for c in text if c in emoji.UNICODE_EMOJI)
df['emojis'] = df['Sentence'].apply(extract_emojis)
df
However, when I try them, my final output seems to be this:
Sentence
Emojis
You look 😌 good
Love you ❤️
I am so happy today 🤩
Hence, I want my output to look like this:
Sentence
Emojis
You look good
😌
Love you
❤️
I am so happy today
🤩
As well as that, I have also tried this method, which is exactly what I want to do:
import pandas as pd
import emoji as emj
def extract_emoji(df):
df["emoji"] = ""
for index, row in df.iterrows():
for emoji in EMOJIS:
if emoji in row["Sentence"]:
row["Sentence"] = row["Sentence"].replace(emoji, "")
row["emoji"] += emoji
extract_emoji(df)
print(df.to_string())
Though, with the method above, the code does not seem to fully execute, and I think it cannot handle so many rows in the dataset; hence, I have over 70,000 sentences, which need the emojis extracting.
As you can see, I am nearly there, but not fully.
These three methods have not fully worked for me, and I require some additional help.
In summary, I just want to extract the emojis from each sentence and add them into a new column - if this is possible.
Thank you very much.
Try:
import re
import emoji
pattern = re.compile(r"|".join(map(re.escape, emoji.UNICODE_EMOJI["en"])))
df["Emojis"] = df["Sentence"].apply(lambda x: "".join(pattern.findall(x)))
df["Sentence"] = df["Sentence"].apply(lambda x: pattern.sub("", x))
print(df)
Prints:
Sentence Emojis
0 You look good 😌
1 Love you ❤️
2 I am so happy today 🤩
I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)
So I have a review dataset having reviews like
Simply the best. I bought this last year. Still using. No problems
faced till date.Amazing battery life. Works fine in darkness or broad
daylight. Best gift for any book lover.
(This is from the original dataset, I have removed all punctuation and have all lower case in my processed dataset)
What I want to do is replace some words by 1(as per my dictionary) and others by 0.
My dictionary is
dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
I want my output like:
0010000000000001000000000100000
I have used this code:
df['newreviews'] = df['reviews'].map(dict).fillna("0")
This always returns 0 as output. I did not want this so I took 1s and 0s as strings, but despite that I'm getting the same result.
Any suggestions how to solve this?
First dont use dict as variable name, because builtins (python reserved word), then use list comprehension with get for replace not matched values to 0.
Notice:
If data are like date.Amazing - no space after punctuation is necessary replace by whitespace.
df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})
d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()
df['newreviews'] = [''.join(d.get(y, '0') for y in x.split()) for x in df['reviews']]
Alternative:
df['newreviews'] = df['reviews'].apply(lambda x: ''.join(d.get(y, '0') for y in x.split()))
print (df)
reviews \
0 simply the best i bought this last year stil...
newreviews
0 0011000000000001000000000100000
You can do:
# clean the sentence
import re
sent = re.sub(r'\.','',sent)
# convert to list
sent = sent.lower().split()
# get values from dict using comprehension
new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
print(new_sent)
'001100000000000000000000100000'
You can do it by
df.replace(repl, regex=True, inplace=True)
where df is your dataframe and repl is your dictionary.
I have email messages in a pandas data frame. Before applying sent_tokenize, I could remove the punctuation like this.
def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return punctuationRemoved
After applying sent_tokenize the data frame looks like below. How can I remove the punctuation while keeping the sentences as tokenized in the lists?
sent_tokenize
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized
Sample of data frame after tokenizing into sentences
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
You can try with following function where you can use apply to iterate over each word in sentence and character and check if character is in punctuation followed by .join. Also, you may need map since you want to apply function to each sentences:
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
f = lambda sent: ''.join(ch for w in sent for ch in w
if ch not in string.punctuation)
sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))
return sent_tokenized
Note you will need import string for string.punctuation.