Getting word count of doc/docx files in R - python

I have a stream of doc/docx documents that I need to get the word count of.
The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.
This is what I tried:
library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1
Unfortunately, wordCount is NOT what MS Word suggests.
For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr does not even import them.
Is there a workaround? I don't mind trying something in Python, too, although I'm less experienced there.
Any help would be greatly appreciated.

This should be able to be done using the tidytext package in R.
library(textreadr)
library(tidytext)
library(dplyr)
#read in word file without password protection
x <- read_docx(myDocxFile)
#convert string to dataframe
text_df <-tibble(line = 1:length(x),text = x)
#tokenize dataframe to isolate separate words
words_df <- text_df %>%
unnest_tokens(word,text)
#calculate number of words in passage
word_count <- nrow(words_df)

I tried reading the docx files with a different library (the officer) and, even though it doesn't agree 100%, it does significantly better this time.
Another small fix would be to copy MS Word's strategy on what is a Word and what isn't. The naive method of counting all spaces can be improved by ignoring the "En Dash" (U+2013) character as well.
Here is my improved function:
getDocxWordCount = function(docxFile) {
docxObject = officer::read_docx(docxFile)
myFixedText = as.data.table(officer::docx_summary(docxObject))[nchar(str_trim(text)) > 1, str_trim(text)]
wordBd = sapply(as.list(myFixedText), function(z) 1 + str_count(z, "\\s+([\u{2013}]\\s+)?"))
return(sum(wordBd))
}
This still has a weakness that prevents 100% accuracy:
The officer library doesn't read list separators (like bullets or hyphens), but MS Word considers those as words. So in any list, this function currently returns X words less where X is the number of listed items. I haven't experimented too much with the attributes of the docxObject, but if it somehow holds the number of listed items, then a definite improvement can be made.

Related

Runtime Error (Python3) when you manipulate lists with very long strings

I wrote a Python3 code to manipulate lists of strings but the code gives Runtime Error for long strings. Here is my code for the problem:
string = "BANANA"
slist= list (string)
mark = list(range(len(slist)))
vowel_substrings = list()
consonants_substrings = list()
#print(mark)
for i in range(len(slist)):
if slist[i]=='A' or slist[i]=='E' or slist[i]=='I' or slist[i]=='O' or mark[i]=='U':
mark[i] = 1
else:
mark[i] = 0
#print(mark)
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings.append(string[j:l+1])
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings.append(string[j:l+1])
#print(consonants_substrings)
unique_consonants = list(set(consonants_substrings))
unique_vowels = list(set(vowel_substrings))
##add two lists
all_substrings = consonants_substrings+(vowel_substrings)
#print(all_substrings)
##Find points earned by vowel guy and consonant guy
vowel_guy_score = 0
consonant_guy_score = 0
for strng in unique_vowels:
vowel_guy_score += vowel_substrings.count(strng)
for strng in unique_consonants:
consonant_guy_score += consonants_substrings.count(strng)
#print(vowel_guy_score) #Kevin
#print(consonant_guy_score) #Stuart
if vowel_guy_score > consonant_guy_score:
print("Kevin ",vowel_guy_score)
elif vowel_guy_score < consonant_guy_score:
print("Stuart ",consonant_guy_score)
else:
print("Draw")
gives the right answer. But if you have a long string, shown below, it fails.
NANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANANNANAN
I think initialization or memory allocation might be a problem but I don't know how to allocate memory before even knowing how much memory the code will need. Thank you in advance for any help you can provide.
In the middle there, you generate a data structure of size O(n³): for each starting position × each ending position × length of the substring. That's probably where your memory problems appear (you haven't posted a traceback).
One possible optimisation would be, instead of having a list of substrings and then generating the set, use instead a Counter class. That would let you know how many times each substring appears without storing all the copies:
vowel_substrings = collections.Counter()
consonant_substrings = collections.Counter()
for j in range(len(slist)):
if mark[j] == 1:
for l in range(j,len(string)):
vowel_substrings[string[j:l+1]] += 1
#print(string[j:l+1])
else:
for l in range(j,len(string)):
consonants_substrings[string[j:l+1]] += 1
Even better would be to calculate the scores as you go along, without storing any of the substrings. If I'm reading the code correctly, the substrings aren't actually used for anything — each letter is effectively scored based on its distance from the end of the string, and the scores are added up. This can be calculated in a single pass through the string, without making any additional copies or keeping track of anything other than the cumulative scores and the length of the string.

Parallelize a nested for loop in python for finding the max value

I'm struggling for some time to improve the execution time of this piece of code. Since the calculations are really time-consuming I think that the best solution would be to parallelize the code.
The output could be also stored in memory, and written to a file afterwards.
I am new to both Python and parallelism, so I find it difficult to apply the concepts explained here and here. I also found this question, but I couldn't manage to figure out how to implement the same for my situation.
I am working on a Windows platform, using Python 3.4.
for i in range(0, len(unique_words)):
max_similarity = 0
max_similarity_word = ""
for j in range(0, len(unique_words)):
if not i == j:
similarity = calculate_similarity(global_map[unique_words[i]], global_map[unique_words[j]])
if similarity > max_similarity:
max_similarity = similarity
max_similarity_word = unique_words[j]
file_co_occurring.write(
unique_words[i] + "\t" + max_similarity_word + "\t" + str(max_similarity) + "\n")
If you need an explanation for the code:
unique_words is a list of words (strings)
global_map is a dictionary whose keys are words(global_map.keys() contains the same elements as unique_words) and the values are dictionaries of the following format: {word: value}, where the words are a subset of the values in unique_words
for each word, I look for the most similar word based on its value in global_map. I wouldn't prefer to store each similarity in memory since the maps already take too much.
calculate_similarity returns a value from 0 to 1
the result should contain the most similar word for each of the words in unique_words (the most similar word should be different than the word itself, that's why I added the condition if not i == j, but this can be also done if I check if max_similarity is different than 1)
if the max_similarity for a word is 0, it's OK if the most similar word is the empty string
Here is a solution that should work for you. I ended up changing a lot of your code so please ask if you have any questions.
This is far from the only way to accomplish this, and in particular this is not a memory efficient solution.
You will need to set max_workers to something that works for you. Usually the number of logical processors in your machine is a good starting point.
from concurrent.futures import ThreadPoolExecutor, Future
from itertools import permutations
from collections import namedtuple, defaultdict
Result = namedtuple('Result', ('value', 'word'))
def new_calculate_similarity(word1, word2):
return Result(
calculate_similarity(global_map[word1], global_map[word2]),
word2)
with ThreadPoolExecutor(max_workers=4) as executer:
futures = defaultdict(list)
for word1, word2 in permutations(unique_words, r=2):
futures[word1].append(
executer.submit(new_calculate_similarity, word1, word2))
for word in futures:
# this will block until all calculations have completed for 'word'
results = map(Future.result, futures[word])
max_result = max(results, key=lambda r: r.value)
print(word, max_result.word, max_result.value,
sep='\t',
file=file_co_occurring)
Here are the docs for the libraries I used:
Futures
collections
itertools

How to optimize the overall processing time in this code?

I have written a code for take set of documents as a list and take another set of words as a list then if in each document check whether any word containing from the list of words and i make sentences from available words
//find whether the whole word in the sentence- return None if its not in.
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
for data in dataset['abc']:
mvc = ''
for x in newdataset['Words']:
y = findWholeWord(x)(data)
if y != None:
mvc = mvc+" "+ x
document.append(mvc)
when i run this code for 10000 documents with average word count of 10 , it take like so long time . How to optimize this code? or possible alternatives for this code to do the same functionality
Since you just want to check if a word exists in the set of abc, you don't need to use re.
for raw_data in dataset['abc']:
data = raw_data.lower()
mvc = ''
for x in newdataset['Words']:
if x not in data:
mvc = mvc+" "+ x
document.append(mvc)
Are you sure that these code works slow? I am not. I think most time you spend opening files. You need to profile your code as Will says. Also you can use multiprocessing to improve speed of your code.

Python an open-source list of words by valence or categories for comparison

I tend to take notes quite regularly and since the great tablet revolution I've been taking them electronically. I've been trying to see if I can find any patterns in the way I take notes. So I've put together a small hack to load the notes and filter out proper nouns and fluff to leave a list of key words I employ.
import os
import re
dr = os.listdir('/home/notes')
dr = [i for i in dr if re.search('.*txt$',i)]
ignore = ['A','a','of','the','and','in','at','our','my','you','your','or','to','was','will','because','as','also','is','eg','e.g.','on','for','Not','not']
words = set()
d1 = open('/home/data/en_GB.dic','r')
dic = d1.read().lower()
dic = re.findall('[a-z]{2,}',dic)
sdic = set(dic)
for i in dr:
a = open(os.path.join('/home/notes',i),'r')
atmp = a.read()
atmp = atmp.lower()
atmp = re.findall('[a-z]{3,}',atmp)
atmp = set(atmp)
atmp.intersection_update(sdic)
atmp.difference_update(set(ignore))
words.update(atmp)
a.close()
words = sorted(words)
I now have a list of about 15,000 words I regularly use while taking notes. It would be a little unmanageable to sort by hand and I wondered if there was an open-source library of
positive-negative-neutral or optimistic-pessimistic-indifferent or other form of word list along any meaning scale that I could run the word list through.
In a perfect scenario I would also be able to run it through some kind of thesarus so I could group the words into meaning clusters to get a high level view of what sense terms I've been employing most.
Does anyone know if there are any such lists out there and if so, how would I go about employing them in Python?
Thanks
I found a list of words used for sentiment analysis of Twitter at: http://alexdavies.net/twitter-sentiment-analysis/
It includes example Python code for how to use it.
See also: Sentiment Analysis Dictionaries

Lucene: Fastest way to return the document occurance of a phrase?

I am trying to use Lucene (actually PyLucene!) to find out how many documents contain my exact phrase. My code currently looks like this... but it runs rather slow. Does anyone know a faster way to return document counts?
phraseList = ["some phrase 1", "some phrase 2"] #etc, a list of phrases...
countsearcher = IndexSearcher(SimpleFSDirectory(File(STORE_DIR)), True)
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
for phrase in phraseList:
query = QueryParser(Version.LUCENE_CURRENT, "contents", analyzer).parse("\"" + phrase + "\"")
scoreDocs = countsearcher.search(query, 200).scoreDocs
print "count is: " + str(len(scoreDocs))
Typically, writing custom hit collector is the fastest way to count the number of hits using a bitset as illustrated in javadoc of Collector.
Other method is to get TopDocs with number of results specified as one.
TopDocs topDocs = searcher.search(query, filter, 1);
topDocs.totalHits will give you the total number of results. I'm not sure if this is as fast as it involves calculating scores, which is skipped in aforementioned method.
These solutions are applicable for Java. You have to check equivalent technique in Python.

Categories

Resources