I want to apply a regex function to clean text in a dataframe column.
ie:
re1 = re.compile(r' +')
def fixup(x):
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
'<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' #.# ','.').replace(
' #-# ','-').replace('\\', ' \\ ')
return re1.sub(' ', html.unescape(x))
df['text'] = df['text'].apply(fixup).values.astype(str)
However when I run this I get a 'MemoryError' (in jupyter notebook).
I have 128GB of RAM and file to create the dataframe was 4GB.
Also I can see from profiler meory use is <20% when this error is thrown.
The error message give no more detail than 'MemoryError:' at the line I apply the fixup function.
Any ideas to help debug?
Break the replace chain into individual replace operations. Not only that will make your code more readable and maintainable, but the intermediate results will be discarded immediately after use, instead of being kept until all modifications are done:
replacements = ('#39;', "'"), ('amp;', '&'), ('#146;', "'"), ...
for replacement in replacements:
x = x.replace(*replacement)
P.S. Shouldn't 'amp;' be '&'?
Related
I'm getting a TypeError: expected string or bytes-like object when I'm processing this dataset using Vaex python library. I've written the following code:
import pyarrow as pa
import vaex
import re
# Reading Data
anime = vaex.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/anime.csv')
user = vaex.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/rating.csv')
# Removing Non-Alphanumeric characters
#vaex.register_function()
def replacer(x):
res = [re.sub('[^A-Za-z]', ' ', value) for value in x.tolist()]
res = [re.sub(' +', ' ', value.lower()) for value in res] # Remove redundant whitespace
return pa.array(res, pa.string())
anime['name_clean'] = anime.func.replacer(anime['name'])
anime = anime[anime['name_clean']!=' '] # Filter empty text
anime['name_clean']
# Merging anime and users
data = user[['user_id', 'anime_id']].join(
anime[['anime_id', 'name_clean']], on='anime_id')['user_id', 'name_clean']
data['user_id'] = data['user_id'].astype('str')
The problem occurs when I do
data['name_clean'].tolist()
Screenshot of error
When I process the same dataset using pandas everything works fine.
import pandas as pd
# Reading Data
anime = pd.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/anime.csv')
user = pd.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/rating.csv')
# Removing Non-Alphanumeric characters
def replacer(x):
res = re.sub('[^A-Za-z]', ' ', x)
res = re.sub(' +', ' ', res.lower()) # Remove redundant whitespace
return res
anime['name'] = anime['name'].apply(replacer)
anime = anime[anime['name']!=' '] # Filter empty text
# Merging anime and users
data = user[['user_id', 'anime_id']].merge(
anime[['anime_id', 'name']], on='anime_id')
data['user_id'] = data['user_id'].astype('str')
P.S. I think the problem is with using re.sub() with Vaex because when I print data['clean_name'] we can see the type is "string". I can't find any solution or any other way including apply method for removing non-alphanumeric characters in vaex dataframe without causing this problem.
I think what is going wrong is your example is that you assume that x in the registered function replacer in vaex is a single sample, but in fact vaex processes things in chunks, so x is an array most likely, and the size might vary.
What I might propose for your particular case is to just use vaex string methods to achieve what you want. They are super fast and convenient. Consider the example, which is similar to yours:
import vaex
# Just as an example dataset, comes with vaex
df = vaex.datasets.titanic()
# Remove numeric characters from column cabin (adjust regex expression as needed)
df['cabin'] = df['cabin'].str.replace(r'\d+', '', regex=True)
# Remove white spaces from column cabin
df['cabin'] = df['cabin'].str.replace(r'\s+', '', regex=True)
# See the outcome
print(df)
Hope this helps.
I am working on a thesis project on smartworking. I downloaded some tweets using Python and I wanted to get rid of users / mentions before implementing wordclouds. However, I can't delete the users, but with the commands shown I delete only the "#".
df['token']=df['token'].apply(lambda x:re.sub(r"#mention","", x))
df['token']=df['token'].apply(lambda x:re.sub(r"#[A-Za-z0-9]+","", x))
Your second code should work, however for efficiency use str.replace:
df['token2'] = df['token'].str.replace('#[A-Za-z0-9]+\s?', '', regex=True)
# or for [a-zA-Z0-9_] use \w
# df['token2'] = df['token'].str.replace('#\w+\s?', '', regex=True)
example:
token token2
0 this is a #test case this is a case
I am trying to write some Python code that will replace some unwanted string using RegEx. The code I have written has been taken from another question on this site.
I have a text:
text_1=u'I\u2019m \u2018winning\u2019, I\u2019ve enjoyed none of it. That\u2019s why I\u2019m withdrawing from the market,\u201d wrote Arment.'
I want to remove all the \u2019m, \u2019s, \u2019ve and etc..
The code that I've written is given below:
rep={"\n":" ","\n\n":" ","\n\n\n":" ","\n\n\n\n":" ",u"\u201c":"", u"\u201d":"", u"\u2019[a-z]":"", u"\u2013":"", u"\u2018":""}
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
text = pattern.sub(lambda m: rep[re.escape(m.group(0))], text_1)
The code works perfectly for:
"u"\u201c":"", u"\u201d":"", u"\u2013":"" and u"\u2018":""
However, It doesn't work that great for:
u"\u2019[a-z] : The presence of [a-z] turns rep into \\[a\\-z\\] which doesnt match.
The output I am looking for is:
text_1=u'I winning, I enjoyed none of it. That why I withdrawing from the market,wrote Arment.'
How do I achieve this?
The information about the newlines completely changes the answer. For this, I think building the expression using a loop is actually less legible than just using better formatting in the pattern itself.
replacements = {'newlines': ' ',
'deletions': ''}
pattern = re.compile(u'(?P<newlines>\n+)|'
u'(?P<deletions>\u201c|\u201d|\u2019[a-z]?|\u2013|\u2018)')
def lookup(match):
return replacements[match.lastgroup]
text = pattern.sub(lookup, text_1)
The problem here is actually the escaping, this code does what you want more directly:
remove = (u"\u201c", u"\u201d", u"\u2019[a-z]?", u"\u2013", u"\u2018")
pattern = re.compile("|".join(remove))
text = pattern.sub("", text_1)
I've added the ? to the u2019 match, as I suppose that's what you want as well given your test string.
For completeness, I think I should also link to the Unidecode package which may actually be more closely what you're trying to achieve by removing these characters.
The simplest way is this regex:
X = re.compile(r'((\\)(.*?) ')
text = re.sub(X, ' ', text_1)
Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "
As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be
Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .
I have written a small python program inside my google app. I am using it for extracting out specific characters out of a string like this
"+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"
C
"
I am using split function for it but it's not splitting the string.Any clues why?
it's giving me something this kind of [u'+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"\n C '] result.
def prog (self,strgs):
self.response.out.write(strgs)
temp1= strgs
self.response.out.write(temp1)
message_split=temp1.split('\n')
#self.response.out.write(message_split)
temp=message_split
self.response.out.write(temp)
message_split_second=strgs.split(',')
m_list=message_split[1:]
self.response.out.write(message_split_second)
collect_strings=''
for j in m_list:
collect_strings=collect_strings+j
message_txt=collect_strings
message_date=message_split_second[0]
message_date=message_date.replace('"',"")
dates=message_date
message_time=message_split_second[0]
message_time=message_time.split('/n')
message_time=message_time[0]
message_time=message_time.replace('"',"")
temp=message_time.split('+')
message_time=temp[0]
times=message_time
cell_number=message_split_second[0]
cell_number=cell_number.replace('"',"")
cellnum=cell_number
return message_txt,dates,times,cellnum
The splits in the first part of your function ought to work. Here's an experiment I just did in Python 2.6:
>>> s = '+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"\n C '
>>> s.split('\n')
['+CMGL: 14,"REC READ","+918000459019",,"11/11/04,18:27:53+22"', ' C ']
>>> s.split(',')
['+CMGL: 14', '"REC READ"', '"+918000459019"', '', '"11/11/04', '18:27:53+22"\n C ']
If your self.response.out.write calls aren't doing the same thing, try reducing the function to the very shortest thing that displays the odd behaviour. And check that you know exactly what's being passed in as the strgs argument.
I can't see much wrong with the rest, except that at one point you try to split on /n when you probably meant to use \n.