I am trying to extract emojis from a sentence and add them into a new column, but when I do it, the new column just contains nothing, in which the emojis are still in the sentence.
For reference, my dataset looks like this - but contains over 70,000 sentences similar to this:
Sentence
You look 😌 good
Love you ❤️
I am so happy today 🤩
So far, I have tried this method:
import pandas as pd
import emoji
df['emojis'] = df['Sentence'].apply(lambda row: ''.join(c for c in row if c in emoji.UNICODE_EMOJI))
df
And this method:
def extract_emojis(text):
return ''.join(c for c in text if c in emoji.UNICODE_EMOJI)
df['emojis'] = df['Sentence'].apply(extract_emojis)
df
However, when I try them, my final output seems to be this:
Sentence
Emojis
You look 😌 good
Love you ❤️
I am so happy today 🤩
Hence, I want my output to look like this:
Sentence
Emojis
You look good
😌
Love you
❤️
I am so happy today
🤩
As well as that, I have also tried this method, which is exactly what I want to do:
import pandas as pd
import emoji as emj
def extract_emoji(df):
df["emoji"] = ""
for index, row in df.iterrows():
for emoji in EMOJIS:
if emoji in row["Sentence"]:
row["Sentence"] = row["Sentence"].replace(emoji, "")
row["emoji"] += emoji
extract_emoji(df)
print(df.to_string())
Though, with the method above, the code does not seem to fully execute, and I think it cannot handle so many rows in the dataset; hence, I have over 70,000 sentences, which need the emojis extracting.
As you can see, I am nearly there, but not fully.
These three methods have not fully worked for me, and I require some additional help.
In summary, I just want to extract the emojis from each sentence and add them into a new column - if this is possible.
Thank you very much.
Try:
import re
import emoji
pattern = re.compile(r"|".join(map(re.escape, emoji.UNICODE_EMOJI["en"])))
df["Emojis"] = df["Sentence"].apply(lambda x: "".join(pattern.findall(x)))
df["Sentence"] = df["Sentence"].apply(lambda x: pattern.sub("", x))
print(df)
Prints:
Sentence Emojis
0 You look good 😌
1 Love you ❤️
2 I am so happy today 🤩
Related
In the following string
SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
I am trying to remove any \n. This string is accessed from a pandas df. The solution I have tried is:
i = str(i).replace("\n", "")
The original code looks like:
for i in data["description"]:
print(i)
i = str(i).replace("\n", "")
i = str(i).split(" ")
for x in i:
x = x.replace("\n", "")
print(x)
where data is the df that stores all of the data from the csv file, and description is the column where the string is taken out of.
I suspect that the failure of replace() to work is due to the string being from a df, as when I try it with just a regular string
x = "a \n\n string"
.replace() works just fine. Any reason why taking strings from a df causes replace to fail? Thanks.
Pandas Dataframes keep their string methods a bit hidden behind the .str attribute. Something like df["column_name"].str.replace("\n", "") should work, and I'd recommend the pandas documentation below to learn more.
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods
This should work:
df["description"].str.replace("\n", "")
Or you could use either of the following if you want to do this for the entire df:
df = df.replace("\n", "")
df.replace("\n", "", inplace = True)
I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234
I previously asked this but I got the data type wrong!
I have my Pandas Dataframe, which looks like this
print(data)
text
0 FollowFriday for being top engaged members...
1 Hey James! How odd :/ Please call our Contact...
2 we had a listen last night :) As You Bleed is...
In this dataframe theree are links, which all start with "http". I have already got a line of code in a function, below, which removes words starting with '#' and other cleaning methods.
def cleanData(data):
#Loop through the data, creating a new dataframe with only ascii characters
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if char.isascii()))
#Remove any tokens with numbers, or digits.
data['text'] = data['text'].apply(lambda s: "".join(char for char in s if not char.isdigit()))
#Removes any words which start with #, which are replies.
data['text']= data['text'].str.replace('(#\w+.*?)',"")
#Remove any left over characters
data = data['text'].str.replace('[^\w\s]','')
#return the cleaned data
return data
Can anyone help to remove words which start with 'http' please? I have already tried to edit what I have but no luck so far.
Thanks in advance!
Use Series.str.replace
data['text'] = data['text'].str.replace('http[^\s]*',"")
One option is to use the str.replace() method:
df = pd.DataFrame( dict(text = [r'FollowFridayhttphttp http http for being top engaged members.',r'James!http How odd http:/ Please call ou',r'httpe had a listen last night :) As You Bleed is...']))
df['text'] = df['text'].apply(lambda x: x.replace('http',''))
And you could do something like this in your function.
I am trying to remove some whole words (but case insensitive) in a pyspark dataframe column.
import re
s = "I like the book. i'v seen it. Iv've" # add a new phrase
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"\b(" + r"|".join(exclude_words) +r")\b|\s", re.I|re.M)
exclude_words_re.sub("" , s)
I added
"Iv've"
but, got:
'like the book. seen it.'
"Iv've" should not be removed because it does not match any whole words in exclude_words.
2 changes to implement to your code:
Use proper regex flags to ignore case
Add \b to only include whole words.
import re
s = "I like the book. i'v seen it. Iv've I've"
exclude_words = ["I", "I\'v", "I\'ve"]
exclude_words_re = re.compile(r"(^|\b)((" + r"|".join(exclude_words) +r"))(\s|$)", re.I|re.M)
exclude_words_re.sub("" , s)
"like the book. seen it. Iv've "
I would like to remove stopwords from a column of a data frame.
Inside the column there is text which needs to be splitted.
For example my data frame looks like this:
ID Text
1 eat launch with me
2 go outside have fun
I want to apply stopword on text column so it should be splitted.
I tried this:
for item in cached_stop_words:
if item in df_from_each_file[['text']]:
print(item)
df_from_each_file['text'] = df_from_each_file['text'].replace(item, '')
So my output should be like this:
ID Text
1 eat launch
2 go fun
It means stopwords have been deleted.
but it does not work correctly. I also tried vice versa in a way make my data frame as series and then loop through that, but iy also did not work.
Thanks for your help.
replace (by itself) isn't a good fit here, because you want to perform partial string replacement. You want regex based replacement.
One simple solution, when you have a manageable number of stop words, is using str.replace.
p = re.compile("({})".format('|'.join(map(re.escape, cached_stop_words))))
df['Text'] = df['Text'].str.lower().str.replace(p, '')
df
ID Text
0 1 eat launch
1 2 outside have fun
If performance is important, use a list comprehension.
cached_stop_words = set(cached_stop_words)
df['Text'] = [' '.join([w for w in x.lower().split() if w not in cached_stop_words])
for x in df['Text'].tolist()]
df
ID Text
0 1 eat launch
1 2 outside have fun