How to optimize the overall processing time in this code? - python

I have written a code for take set of documents as a list and take another set of words as a list then if in each document check whether any word containing from the list of words and i make sentences from available words
//find whether the whole word in the sentence- return None if its not in.
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
for data in dataset['abc']:
mvc = ''
for x in newdataset['Words']:
y = findWholeWord(x)(data)
if y != None:
mvc = mvc+" "+ x
document.append(mvc)
when i run this code for 10000 documents with average word count of 10 , it take like so long time . How to optimize this code? or possible alternatives for this code to do the same functionality

Since you just want to check if a word exists in the set of abc, you don't need to use re.
for raw_data in dataset['abc']:
data = raw_data.lower()
mvc = ''
for x in newdataset['Words']:
if x not in data:
mvc = mvc+" "+ x
document.append(mvc)

Are you sure that these code works slow? I am not. I think most time you spend opening files. You need to profile your code as Will says. Also you can use multiprocessing to improve speed of your code.

Related

Using python-docx to iterate over a document and change text

The Problem:
I am supposed to create a word file with 250x5 placeholders.
The background is akin to a database export into a word document.
So I created the first placeholder
Name_1
Adress_1
City_1
occupation_1
zipcode_1
and copied the block x250.
Now I need the number to be different for each block; so the second block of 250 would read
Name_2
Adress_2
...etc.
I thought to myself, surely this is faster to automate than to individually type 1000+ numbers.
But here I am, stuck, because my Python expertise is very limited.
What I have so far:
from docx import Document
import os
document = Document("Mapping.docx")
paragraph = document.paragraphs[10]
print(len(document.paragraphs))
print(document.paragraphs[1].text)
def replaceNumbers(ReplaceNumber):
number = 0
for i in range(len(document.paragraphs)):
if number == 5:
replaceNumbers(ReplaceNumber + 1)
break
else:
y = document.paragraphs[i].text
if "1" in y:
document.paragraphs[i].text = document.paragraphs[i].text.replace("1", str(ReplaceNumber))
number = number + 1
return
if __name__ == '__main__':
replaceNumbers(2)
document.save("edited.docx")
now I am guessing there is a problem in my usage of the "while" loop, as the program gets stuck. but manually debugging seems impossible with so many imported libraries.
I hope someone can help me here

Getting word count of doc/docx files in R

I have a stream of doc/docx documents that I need to get the word count of.
The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.
This is what I tried:
library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1
Unfortunately, wordCount is NOT what MS Word suggests.
For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr does not even import them.
Is there a workaround? I don't mind trying something in Python, too, although I'm less experienced there.
Any help would be greatly appreciated.
This should be able to be done using the tidytext package in R.
library(textreadr)
library(tidytext)
library(dplyr)
#read in word file without password protection
x <- read_docx(myDocxFile)
#convert string to dataframe
text_df <-tibble(line = 1:length(x),text = x)
#tokenize dataframe to isolate separate words
words_df <- text_df %>%
unnest_tokens(word,text)
#calculate number of words in passage
word_count <- nrow(words_df)
I tried reading the docx files with a different library (the officer) and, even though it doesn't agree 100%, it does significantly better this time.
Another small fix would be to copy MS Word's strategy on what is a Word and what isn't. The naive method of counting all spaces can be improved by ignoring the "En Dash" (U+2013) character as well.
Here is my improved function:
getDocxWordCount = function(docxFile) {
docxObject = officer::read_docx(docxFile)
myFixedText = as.data.table(officer::docx_summary(docxObject))[nchar(str_trim(text)) > 1, str_trim(text)]
wordBd = sapply(as.list(myFixedText), function(z) 1 + str_count(z, "\\s+([\u{2013}]\\s+)?"))
return(sum(wordBd))
}
This still has a weakness that prevents 100% accuracy:
The officer library doesn't read list separators (like bullets or hyphens), but MS Word considers those as words. So in any list, this function currently returns X words less where X is the number of listed items. I haven't experimented too much with the attributes of the docxObject, but if it somehow holds the number of listed items, then a definite improvement can be made.

Extract name entities from web domain address

I am working on an NLP problem (in Python 2.7) to extract the location of a news report from the text inside the report. For this task I am using the Clavin API which works well enough.
However I've noticed that the name of the location area is often mentioned in the URL of the report itself and I'd like to find a way to extract this entity from a domain name, to increase the level of accuracy from Clavin by providing an additional named entity in the request.
In an ideal world I'd like to be able to give this input:
www.britainnews.net
and return this, or a similar, output:
[www,britain,news,net]
Of course I can use .split() feature to separate the www and net tokens which are unimportant, however I'm stumped as to how to split the middle phrase without an intensive dictionary lookup.
I'm not asking for someone to solve this problem or write any code for me - but this is an open call for suggestions as to the ideal NLP library (if one exists) or any ideas as to how to solve this problem.
Check - Word Segmentation Task from Norvig's work.
from __future__ import division
from collections import Counter
import re, nltk
WORDS = nltk.corpus.reuters.words()
COUNTS = Counter(WORDS)
def pdist(counter):
"Make a probability distribution, given evidence from a Counter."
N = sum(counter.values())
return lambda x: counter[x]/N
P = pdist(COUNTS)
def Pwords(words):
"Probability of words, assuming each word is independent of others."
return product(P(w) for w in words)
def product(nums):
"Multiply the numbers together. (Like `sum`, but with multiplication.)"
result = 1
for x in nums:
result *= x
return result
def splits(text, start=0, L=20):
"Return a list of all (first, rest) pairs; start <= len(first) <= L."
return [(text[:i], text[i:])
for i in range(start, min(len(text), L)+1)]
def segment(text):
"Return a list of words that is the most probable segmentation of text."
if not text:
return []
else:
candidates = ([first] + segment(rest)
for (first, rest) in splits(text, 1))
return max(candidates, key=Pwords)
print segment('britainnews') # ['britain', 'news']
More examples at : Word Segmentation Task

Parallelize a nested for loop in python for finding the max value

I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.
The output could be also stored in memory, and written to a file afterwards.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation.
I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
If you need an explanation for the code:
unique_words is a list of words (strings)
global_map is a dictionary whose keys are words(global_map.keys() contains the same elements as unique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values in unique_words
for each word, I look for the most similar word based on its value in global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much.
calculate_similarity returns a value from 0 to 1
the result should contain the most similar word for each of the words in unique_words (the most similar word should be different than the word itself, that's why I added the condition if not i == j, but this can be also done if I check if max_similarity is different than 1)
if the max_similarity for a word is 0, it's OK if the most similar word is the empty string
Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.
This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.
You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.
from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict
Result = namedtuple('Result', ('value', 'word'))
def new_calculate_similarity(word1, word2):
return Result(
calculate_similarity(global_map[word1], global_map[word2]),
word2)
with ThreadPoolExecutor(max_workers=4) as executer:
futures = defaultdict(list)
for word1, word2 in permutations(unique_words, r=2):
futures[word1].append(
executer.submit(new_calculate_similarity, word1, word2))
for word in futures:
# this will block until all calculations have completed for 'word'
results = map(Future.result, futures[word])
max_result = max(results, key=lambda r: r.value)
print(word, max_result.word, max_result.value,
sep='\t',
file=file_co_occurring)
Here are the docs for the libraries I used:
Futures
collections
itertools

Python an open-source list of words by valence or categories for comparison

I tend to take notes quite regularly and since the great tablet revolution I've been taking them electronically. I've been trying to see if I can find any patterns in the way I take notes. So I've put together a small hack to load the notes and filter out proper nouns and fluff to leave a list of key words I employ.
import os
import re
dr = os.listdir('/home/notes')
dr = [i for i in dr if re.search('.*txt$',i)]
ignore = ['A','a','of','the','and','in','at','our','my','you','your','or','to','was','will','because','as','also','is','eg','e.g.','on','for','Not','not']
words = set()
d1 = open('/home/data/en_GB.dic','r')
dic = d1.read().lower()
dic = re.findall('[a-z]{2,}',dic)
sdic = set(dic)
for i in dr:
a = open(os.path.join('/home/notes',i),'r')
atmp = a.read()
atmp = atmp.lower()
atmp = re.findall('[a-z]{3,}',atmp)
atmp = set(atmp)
atmp.intersection_update(sdic)
atmp.difference_update(set(ignore))
words.update(atmp)
a.close()
words = sorted(words)
I now have a list of about 15,000 words I regularly use while taking notes. It would be a little unmanageable to sort by hand and I wondered if there was an open-source library of
positive-negative-neutral or optimistic-pessimistic-indifferent or other form of word list along any meaning scale that I could run the word list through.
In a perfect scenario I would also be able to run it through some kind of thesarus so I could group the words into meaning clusters to get a high level view of what sense terms I've been employing most.
Does anyone know if there are any such lists out there and if so, how would I go about employing them in Python?
Thanks
I found a list of words used for sentiment analysis of Twitter at: http://alexdavies.net/twitter-sentiment-analysis/
It includes example Python code for how to use it.
See also: Sentiment Analysis Dictionaries

Categories

Resources