Removing # mentions from pandas DataFrame column - python

I am working on a thesis project on smartworking. I downloaded some tweets using Python and I wanted to get rid of users / mentions before implementing wordclouds. However, I can't delete the users, but with the commands shown I delete only the "#".
df['token']=df['token'].apply(lambda x:re.sub(r"#mention","", x))
df['token']=df['token'].apply(lambda x:re.sub(r"#[A-Za-z0-9]+","", x))

Your second code should work, however for efficiency use str.replace:
df['token2'] = df['token'].str.replace('#[A-Za-z0-9]+\s?', '', regex=True)
# or for [a-zA-Z0-9_] use \w
# df['token2'] = df['token'].str.replace('#\w+\s?', '', regex=True)
example:
token token2
0 this is a #test case this is a case

Related

.replace('\n','') not working to remove \n from string that is taken from pandas df

In the following string
SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
I am trying to remove any \n. This string is accessed from a pandas df. The solution I have tried is:
i = str(i).replace("\n", "")
The original code looks like:
for i in data["description"]:
print(i)
i = str(i).replace("\n", "")
i = str(i).split(" ")
for x in i:
x = x.replace("\n", "")
print(x)
where data is the df that stores all of the data from the csv file, and description is the column where the string is taken out of.
I suspect that the failure of replace() to work is due to the string being from a df, as when I try it with just a regular string
x = "a \n\n string"
.replace() works just fine. Any reason why taking strings from a df causes replace to fail? Thanks.
Pandas Dataframes keep their string methods a bit hidden behind the .str attribute. Something like df["column_name"].str.replace("\n", "") should work, and I'd recommend the pandas documentation below to learn more.
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods
This should work:
df["description"].str.replace("\n", "")
Or you could use either of the following if you want to do this for the entire df:
df = df.replace("\n", "")
df.replace("\n", "", inplace = True)

Find date and time in a text column in a dataframe Python

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234

Pandas out of memory error when applying regex function

I want to apply a regex function to clean text in a dataframe column.
ie:
re1 = re.compile(r' +')
def fixup(x):
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
'<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' #.# ','.').replace(
' #-# ','-').replace('\\', ' \\ ')
return re1.sub(' ', html.unescape(x))
df['text'] = df['text'].apply(fixup).values.astype(str)
However when I run this I get a 'MemoryError' (in jupyter notebook).
I have 128GB of RAM and file to create the dataframe was 4GB.
Also I can see from profiler meory use is <20% when this error is thrown.
The error message give no more detail than 'MemoryError:' at the line I apply the fixup function.
Any ideas to help debug?
Break the replace chain into individual replace operations. Not only that will make your code more readable and maintainable, but the intermediate results will be discarded immediately after use, instead of being kept until all modifications are done:
replacements = ('#39;', "'"), ('amp;', '&'), ('#146;', "'"), ...
for replacement in replacements:
x = x.replace(*replacement)
P.S. Shouldn't 'amp;' be '&'?

How to delete substrings with specific characters in a pandas dataframe?

I have a pandas dataframe that looks like this:
COL
hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?
...
Im fine, what A/P_49 A/P_0.0309 about you?
The expected result should be:
COL
hi how are you?
...
Im fine, what about you?
How can I remove efficiently from a column and for the full pandas dataframe all the strings that have A/P_?
I tried with this regular expression:
A/P_(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+
However, I do not know if there's a more simpler or robust way of removing all those substrings from my dataframe. How can I remove all the strings that have A/P_ at the beginning?
UPDATE
I tried:
df_sess['COL'] = df_sess['COL'].str.replace(r'A/P(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
And it works, however I would like to know if there's a more robust way of doing this. Possibily with a regular expression.
one way could be to use \S* matching all non withespaces after A/P_ and also add \s to remove the whitespace after the string to remove, such as:
df_sess['COL'] = df_sess['col'].str.replace(r'A/P_\S*\s', '')
In you input, it seems there is an typo error (or at least I think so), so with this input:
df_sess = pd.DataFrame({'col':['hi A/P_90890 how A/P_True A/P_/93290 are A/P_wueiwo A/P_|iwoeu you A/P_?9028k ?',
'Im fine, what A/P_49 A/P_0.0309 about you?']})
print (df_sess['col'].str.replace(r'A/P_\S*\s', ''))
0 hi how are you ?
1 Im fine, what about you?
Name: col, dtype: object
you get the expected output
How about:
(df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Full example:
import pandas as pd
df = pd.DataFrame({
'COL':
["hi A/P_90890 how A/P_True A/P_/93290 AP_wueiwo A/P_|iwoeu you A/P_?9028k ?",
"Im fine, what A/P_49 A/P_0.0309 about you?"]
})
df['COL'] = (df['COL'].replace('A[/P|P][^ ]+', '', regex=True)
.replace('\s+',' ', regex=True))
Returns (oh, there is an extra space before ?):
COL
0 hi how you ?
1 Im fine, what about you?
Because of pandas 0.23.0 bug in replace() function (https://github.com/pandas-dev/pandas/issues/21159) when trying to replace by regex pattern the error occurs:
df.COL.str.replace(regex_pat, '', regex=True)
...
--->
TypeError: Type aliases cannot be used with isinstance().
I would suggest to use pandas.Series.apply function with precompiled regex pattern:
In [1170]: df4 = pd.DataFrame({'COL': ['hi A/P_90890 how A/P_True A/P_/93290 are AP_wueiwo A/P_|iwoeu you A/P_?9028k ?', 'Im fine, what A/P_49 A/P_0.0309 about you?']})
In [1171]: pat = re.compile(r'\s*A/?P_[^\s]*')
In [1172]: df4['COL']= df4.COL.apply(lambda x: pat.sub('', x))
In [1173]: df4
Out[1173]:
COL
0 hi how are you ?
1 Im fine, what about you?

Multiple distinct replaces using RegEx

I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)

Categories

Resources