I ran the below code for about 20k data. Although the code is fine, and I am able to get the output but it's running very slow. It took almost 45 mins to get the output. Can someone please provide the appropriate solution to it?
Code:
import numpy as np
import pandas as pd
import re
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
df = pd.read_csv("data.csv")
print(df['Body'])
tweets=df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
tweets[u'Body'] = tweets[u'Body'].astype(str)
tweets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
weets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
#Preprocessing del RT #blablabla:
tweets['tweetos'] = ''
#add tweetos first part
for i in range(len(tweets['Body'])):
try:
tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]
except AttributeError:
tweets['tweetos'][i] = 'other'
#Preprocessing tweetos. select tweetos contains 'RT #'
for i in range(len(tweets['Body'])):
if tweets['tweetos'].str.contains('#')[i] == False:
tweets['tweetos'][i] = 'other'# remove URLs, RTs, and twitter handles
for i in range(len(tweets['Body'])):
tweets['Body'][i] = " ".join([word for word in tweets['Body'][i].split()
if 'http' not in word and '#' not in word and '<' not in word])
This code is to remove special characters, like /n, Twitter mentions, basically text cleaning
Whenever you work with Pandas and start iterating over dataframe content there's a good chance that your approach is lacking. Try to stick to the native Pandas tools/methods, which are highly optimized! Also, watch out for repetition: In your code you do some stuff over and over again. E.g. in every iteration of
the 1. loop you split df.Body (tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]), only to pick one item from the resulting frame
the 2. loop you evaluate a complete column of the frame (tweets['tweetos'].str.contains('#')) only to pick one item from the result.
Your code could probably look like this:
import pandas as pd
import re
df = pd.read_csv("data.csv")
tweets = df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
# Why not tweets = df.replace(r'\\t|\\n|\\r|\t|/n|/r|w/|\n|w/|Quote::', ',') ?
re_emoji = re.compile(...) # As in your code
tweets.Body = tweets.Body.astype(str).str.replace(re_emoji, '') # Is the astype(str) necessary?
body_split = tweets.Body.str.split()
tweets['tweetos'] = body_split.map(lambda l: 'other' if not l else l[0])
tweets.tweetos[~tweets.tweetos.str.contains('#')] = 'other'
re_discard = re.compile(r'http|#|<')
tweets.Body = (body_split.map(lambda l: [w for w in l if not re_discard.search(w)])
.str.join(' '))
Be aware that I don't have any real insight into the data you're working with - you haven't provided a sample. So there might be bugs in the code I've proposed.
Related
so in this script I am writing to learn python, I would like to just put a wildcard instead of rewriting this whole block just to change line 2. what would be the most efficient way to consolidate this into a loop, where it will just use all d.entries[0-99].content and repeat until finished? if, while, for?
also my try /except does not perform as expected
what gives?
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
print (d.entries[3].title)
sr = str(d.entries[3].content)
spl1 = sr.split("<p>")
ss = str(spl1)
spl2 = ss.split("</p>")
try:
st = str(spl2[0])
# print(st)
except:
binascii.Error
st = str(spl2[1])
print(st)
#st = str(spl2[0])
spl3 =st.split("', '")
stringnow=str(spl3[1])
b64s1 = stringnow.encode('ascii')
b64s2 = base64.b64decode(b64s1)
stringnew = b64s2.decode('ascii')
print(stringnew)
## but line 15 does nothing, how to fix and also loop through all d.entries[?].content
The loop part is done simply by doing the following"
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
# loop from 0 to 99
# range(100) goes from 0 and up to and not including 100
for i in range(100):
print (d.entries[i].title)
sr = str(d.entries[i].content)
<< the rest of your code here>>
The data returned from d.entries[i].content is a dictionary but you are converting to a string so you may want to see if you are doing what you really want too. Also when you use .split() it produces a list of the split items but you convert to a string once again (a few time). You may want to relook at that part of the code.
I haven't used regex much but decided to just to play and got this to work. I retrieved the contents of the 'value' key from the dictionary. Then used regex to get the base64 info. I only tried it for the first 5 items (i.e., I changed range(100) to range(5). Hope it helps. If not, I enjoyed doing this. Oh, I left all of the print statements I used as I was working down the code.
import feedparser, base64
from urlextract import URLExtract
import re
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
for i in range(100):
print (d.entries[i].title)
# .contents is a list.
# print("---------")
# print (type(d.entries[i].content))
print (d.entries[i].content)
print("---------")
# gets the contents of key 'value' in the dictionary that is the 1st item in the list.
string_value = d.entries[3].content[0]['value']
print(string_value)
print("---------")
# this assumes there is always a space between the 1st </p> and the 2nd <p>
# grabs text between using re.search
pattern = "<p>(.*?)</p>"
substring = re.search(pattern, string_value).group(1)
print(substring)
print("---------")
print("---------")
print("---------")
# rest of your code here
I want to make a simple spell corrector system and I have a datafarme like this:
incorrect_word, correct_word
scoohl,school
watn,want
frienf,friend
"I watn to go scoohl"
I want to correct this sentence by replacing the incorrect sample in "incorrect_word" column with the correct sample in the "correct_word" column (if it exists)
how can i do this?
the sample code i wrote and does not work.
text = " شما رفتین مدرسه شون گفتین دستاشون رو بشورن"
# if "دستاشون رو" in text:
# print("yes")
from hazm import *
import pandas as pd
from src.config.config import *
# letters = word_tokenize(text)
# for text in word_tokenize(text):
# print(text)
df = pd.read_excel(FILL_DATA).astype(str)
text = str(text)
for idx, item in enumerate(df['informal']):
if item in text:
text = text.replace(item, df['formal1'].iloc[idx])
# item = item.replace(df['informal'].iloc[idx], df['formal1'].iloc[idx])
print(text)
I would do like this :
df = pd.DataFrame([['scoohl','school'], ['watn','want'], ['frienf','friend']], columns=['incorrect_word', 'correct_word'])
df.index = df['incorrect_word']
df.drop(columns=['incorrect_word'], inplace=True)
text_to_correct = "I watn to go scoohl"
words = text_to_correct.split(' ')
for c, w in enumerate(words):
if w in df.index:
words[c] = df.at[w,'correct_word']
words = ' '.join(words)
words
result :
'I want to go school'
Hello this is very basic python you can do it in this way
df['incorrect']=[x for x in df['Correct'] if len(x)>2]
You should search about lambda, list comprenhension, apply and map
Thank you.
df is unstructured with no columns and rows header. Every columns have strings in which there is a set of pattern which needs to be removed, the pattern is mentioned below as:
Input to one columns of unstructured df as strings:
I am to be read ===start=== I am to be removed ===stop=== I have to be read again ===start=== remove me again ===stop=== continue reading
Ouput needed:
I am to be read I have to be read again continue reading
Here I have to remove from string '===start===' to '===stop===' whenever it occurs. The df has thousands of entries. What is the most efficient way of using regex?
The code below works on a column but takes a long time to complete.
Is there a solution using regex that is most efficient/least time complexity?
df = pd.read_excel("sample_excel.xlsx", header=None)
def removeString(df):
inf = df[0][1]
infcopy = ''
bol = False
start = '*start*'
end = '*stop*'
inf.replace('* start *',start) #in case black space between start
inf.replace('* stop *',end) #in case black space between start
for i in range(len(inf)):
if inf[i] == "*" and inf[i:i+len(start)] == start:
bol = True
if inf[i] == '*' and inf[i+1-len(end):i+1] == end:
bol = False
continue
if bol == False:
infcopy += inf[i]
df[0][1] = infcopy
I think it could look something like this.
import pandas as pd
import re
def removeString(df):
pattern = r'(?:start(.*?)stop)'
df[ColToRemove] = df[ColToRemove].apply(lambda x: re.sub(pattern, "",x))
E.g.
df = pd.DataFrame({'Col1':['startjustsomethingherestop']})
Output:
Col1
0 startjustsomethingherestop
And then,
pattern = r'(?:start(.*?)stop)'
df['Col1'] = df['Col1'].apply(lambda x: re.sub(pattern, "", x))
Output:
Col1
0
The regex pattern defined here will remove everything whenever a match is found for a string that begins with "start" and ends with "stop" and leave you its as the output
I am reading my data from a CSV file using pandas and it works well with range 700. But as soon as I go above 700 and trying to append to a list in python it is showing me list index out of range. But the CSV has around 500K of rows
Can anyone help me with that why is it happening?
Thanks in advance.
import pandas as pd
df_email = pd.read_csv('emails.csv',nrows=800)
test_email = df_email.iloc[:,-1]
list_of_emails = []
for i in range(len(test_email)):
var_email = test_email[i].split("\n") #this code takes one single email splits based on a new line giving a python list of all the strings in the email
email = {}
message_body = ''
for _ in var_email:
if ":" in _:
var_sentence = _.split(":") #this part actually uses the ":" to find the elements in the list that have ":" present
for j in range(len(var_sentence)):
if var_sentence[j].lower().strip() == "from":
email['from'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == "to":
email['to'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif var_sentence[j].lower().strip() == 'subject':
if var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip() == 're':
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+2])].lower().strip()
else:
email['subject'] = var_sentence[var_sentence.index(var_sentence[j+1])].lower().strip()
elif ":" not in _:
message_body += _.strip()
email['body'] = message_body
list_of_emails.append(email)
I am not sure of what you are trying to say here (might as well put example inputs and outputs here), but I came across this problem, which might be of the same nature, sometime weeks ago.
CSV files are comma-separated, which means it always takes note of every comma in a line to separate them in columns. If some dirty input from strings in your CSV file are present, then it will mess up the columns that you are expecting to have.
Best solution here is have some code to cleanup your CSV file, change its delimiter to another character (probably '|', '&', or anything that also doesn't mess up with the data), and revise your code to reflect these changes to the CSV.
use the pandas library to read the file.
it is very efficient and saves you time in writing the code yourself.
eg :
import pandas as pd
training_data = pd.read_csv( "train.csv", sep = ",", header = None )
I am using this function to read a config file.
import numpy as np
stream = np.genfromtxt(filepath, delimiter = '\n', comments='#', dtype= 'str')
It works pretty well but I have a problem: the tab character.
I.e.
output
['\tvalue1 ', ' 1'] ['\t'] ['value2 ', ' 2']
Is there a way to ignore this special char?
My solution is something like that: (It works for my purposes but it's a bit "ugly")
result = {}
for el in stream:
row = el.split('=',1)
try:
if len(row) == 2:
row[0] = row[0].replace(' ','').replace('\t','') #clean the elements from not needed spaces
row[1] = row[1].replace(' ','').replace('\t','')
result[row[0]] = eval(row[1])
except:
print >> sys.stderr,"FATAL ERROR: '"+filepath+"' missetted"
logging.exception(sys.stderr)
sys.exit('')
To replace the tabs with nothing:
stream = [x.replace('\t','') for x in stream]
Or to replace tabs with a single space, and then remove duplicate spaces:
stream = [' '.join(x.replace('\t',' ').split()) for x in stream]
To remove empty strings (source):
stream = filter(None, stream)
There docent seem to be a way to assign multiple delimiters or comments using numpys genfromtext. I would recommend looking elsewhere. Try https://docs.python.org/2/library/configparser.html. Here's a link with a quick example so you can get a feel for how to work with the module https://wiki.python.org/moin/ConfigParserExamples