I am trying to run a function pre_process on a list input k1_tweets_filtered['text'].
however, the function only seems to work on one input at a time i.e. k1_tweets_filtered[1]['text'].
I want the function to run on all inputs of k1_tweets_filtered['text'].
I have tried to use loops, however, the loop only outputs the words of the first input .
I am wondering if this is the right approach as to how I can apply this to the rest of the inputs
This is the question I am trying to solve and what I have coded so far.
Write your code to pre-process and clean up all tweets
stored in the variable k1_tweets_filtered, k2_tweets_filtered and k3_tweets_filtered using the
function pre_process() to result in new variables k1_tweets_processed, k2_tweets_processed
and k3_tweets_processed.
for x in range(len(k1_tweets_filtered)):
tweet_k1 = k1_tweets_filtered[x]['text']
x+=1
k1_tweets_processed = pre_process(tweet_k1)
The function pre_process is below, however, I know that this is correct, as it was given to me.
def remove_non_ascii(s): return "".join(i for i in s if ord(i)<128)
def pre_process(doc):
"""
pre-processes a doc
* Converts the tweet into lower case,
* removes the URLs,
* removes the punctuations
* tokenizes the tweet
* removes words less that 3 characters
"""
doc = doc.lower()
# getting rid of non ascii codes
doc = remove_non_ascii(doc)
# replacing URLs
url_pattern = "http://[^\s]+|https://[^\s]+|www.[^\s]+|[^\s]+\.com|bit.ly/[^\s]+"
doc = re.sub(url_pattern, 'url', doc)
# removing dollars and usernames and other unnecessary stuff
userdoll_pattern = "\$[^\s]+|\#[^\s]+|\&[^\s]+|\*[^\s]+|[0-9][^\s]+|\~[^\s]+"
doc = re.sub(userdoll_pattern, '', doc)
# removing punctuation
punctuation = r"\(|\)|#|\'|\"|-|:|\\|\/|!|\?|_|,|=|;|>|<|\.|\#"
doc = re.sub(punctuation, ' ', doc)
return [w for w in doc.split() if len(w) > 2]
k1_tweets_processed = []
for i in range(len(k1_tweets_filtered)):
tweet_k1 = k1_tweets_filtered[i]['text']
k1_tweets_processed.append(pre_process(tweet_k1))
When you iterate it is better to use i,j for variable name, and if you have "for i n range(10)" you should not increment it inside your loop. And previously you set k1_tweets_processed to single preprocessed text instead of creating list and adding new texts to it.
Related
I want to write a function to get one element of my list and tell me 10 key words of it using TF-IDF.I have seen codes but I could not implement it. each element of my list is a long sentence.
I have written these two functions and I do not know how to do what I said above.
def fit(train_data):
cleaned_lst=[]
for element in train_data:
#removing customized stop words
cleaned = remove(element)
cleaned_lst.append(cleaned)
for sentence in cleaned_lst:
vectorizer = TfidfVectorizer(tokenizer = word_tokenize)
fitted_data = vectorizer.fit([sentence])
return fitted_data
def transfom(test_data):
transformed_data = fit(train_data).transform([element for element in test_data])
return transformed_data
I have a for-loop that calls a function. The function iterates over elements in a list and constructs vectors from it. I am able to print out each of these vectors put since I need to perform operations on them I need the loop to actually return each of them.
for text in corpus:
text_vector = get_vector(lexicon, text)
print(text_vector)
You can store them in a list.
You can create a list by doing:
output = []
Then in the loop you can add the values like this:
for text in corpus:
text_vector = get_vector(lexicon, text)
print(text_vector)
output.append(text_vector)
Now you have saved each item from the for loop in the list.
I think only you need is a list in the global scope and append to it.
vector_texts = []
for text in corpus:
text_vector = get_vector(lexicon, text)
vector_texts.append(text_vector)
print(text_vector)
I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')
I am trying to clean this website and get every word. But using generators gives me more words than using lists. Also, these words are inconsistent. Sometimes I have more 1 words, sometimes none, sometimes more than 30 words. I have read about generators on python documentation and looked up some questions about generators. What i understand is it shouldn't differ. I don't understand what's going on underneath the hood. I am using python 3.6. Also I have read Generator Comprehension different output from list comprehension? but I can't understand the situation.
This is first function with generators.
def text_cleaner1(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = (line.strip() for line in text.splitlines()) # break into lines
print(type(lines))
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
print(type(chunks))
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
This is second function with list comprehensions.
def text_cleaner2(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = [line.strip() for line in text.splitlines()] # break into lines
chunks = [phrase.strip() for line in lines for phrase in line.split(" ")] # break multi-headlines into a line each
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
And this code give me different results randomly.
text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")
Generator is "lazy" - it doesn't execute code immediately but it executes it later when results will be needed. It means it doesn't get values from variables or functions immediately but it keeps references to variables and functions.
Example from link
all_configs = [
{'a': 1, 'b':3},
{'a': 2, 'b':2}
]
unique_keys = ['a','b']
for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
print(x)
print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
print(list(x))
In generator there is for loop inside another for loop.
Internal generator gets reference to c instead of value in c and it will get value later.
Later (when it has to get results from generators) it starts execution with external generator for c in all_configs. When external generator is executed it loops and generates two internal geneartors which use reference to c, not value from c, but when it loops it also changes value in c - so finally you have "list" with two internal generators and {'a': 2, 'b':2} in c.
After that it executes internals generators which finally get value from c but in this moment c already has {'a': 2, 'b':2}.
BTW: there is similar problem with lambda in for loop when you use it with Buttons in tkinter.
I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)
The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.
"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.