Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have a vector of strings (phrases with several words).
For reasons out of the scope of this question I need to comply with a length limit of N characters per string.
The very first thing I thought was to splice each string, but unfortunately the result of the operation will be facing end user (the end users will have to read the truncated strings and make sense out of them).
That means that I can't just slice the strings, because if I did so the following:
This is a simple test with FOO
This is a simple test with BAR
will be converted to
This is a simple te...
This is a simple te...
Meaning that data will be lost and the users won't be able to distinguish between the two strings.
After thinking a little bit more I figured out the best possible solution is to abbreviate as little characters of as little words as possible, always in accordance with the max length constraint.
With such a behaviour the previous example would be converted to
This is a sim. te. with FOO
This is a sim. te. with BAR
I figured out I'll ask here for an alternative/better solution, before coding this.
Also, if there isn't any better alternative, what things should I keep in mind while implementing this? Can you give me any tips?
I have a few thoughts... which may or may not meet your needs. To begin, here are some additional forms of abbreviation that you may be able to programatically implement.
Remove Vowels
If you remove vowels, you may be able to abbreviate words within the desired lengths, and be slightly more readable. Removing vowels is an acceptable form of abbreviation. Keep in mind, you will need to keep the first and last letter of the word even if they are vowels. organization = orgnztn
Use Abbreviation API
https://Abbreviations.com has an API with abbreviations. This might be useful for abbreviating longer words. For example, to find the abbreviation of "organization": https://www.abbreviations.com/abbreviation/organization abbreviates as ORG
It appears this user has attempted to do this in python. If you know you will have frequent phrases, you can create a dictionary of the abbreviated form.
Unfortunately, no matter where you truncate the data, there is a chance that two strings will end up looking the same to the end user. You could do some string comparison to determine where the differences are, then write some logic to truncate characters in other locations.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I have about 5 million documents. A document is composed of many sentences, and may be about one to five pages long. Each document is a text file.
I have to find the most common sentences / phrases (at least 5 words long) among all the documents. How should I achieve this?
For exactly 5-word-long phrases, this is relatively simple Python (which may require lots of memory). For variably-longer phrases, it's a bit more complicated – & may need additional clarification about what kinds of longer phrases you'd want to find.
For the 5-word (aka '5-gram') case:
In one pass over the corpus, you generate all 5-grams, & tally their occurrences (say into a Counter), then report the top-N.
For example, let's assume docs is a Python sequence of all your tokenized texts, where each individual item is a list-of-string-words. Then roughly:
from collections import Counter
ngram_size = 5
tallies = Counter()
for doc in docs:
for i in range(0, len(doc)-4):
ngram = tuple(doc[i:i+5])
tallies[ngram] += 1
# show the 10 most-common n-grams
print(tallies.most_common(10))
If you then wanted to also consider variably longer phrases, it's a little trickier – but note any such phrase would have to start with some of the 5-grams you'd already found.
So you could consider gradually repeating the above, for 6-grams, 7-grams, etc.
But to optimize for memory/effort, you could add a step to ignore all n-grams that don't already start with one of the top-N candidates you chose from an earlier run. (For example, in a 6-gram run, the += line above would be conditional on the 6-gram starting-with one of the few 5-grams you've already considered to be of interest.)
And further, you'd stop looking for ever-longer n-grams when (for example) the count of top 8-grams is already below the relevant top-N counts of shorter n-grams. (That is, when any further longer n-grams are assured of being less-frequent than your top-N of interest.)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
What is the benefit of using a list to represent a string, outside of the fact that its mutable. Does it have better time complexity if used inside of a class?
Strings are immutable in Python, and it is therefore a common practice to convert a string into a list of single-character strings to perform multiple index-based modifications before joining the characters back into a string. Without such a conversion, one would have to resort to slicing the string and joining the fragments into a new string, with each modification costing a time complexity of O(n) rather than O(1), which list-based modifications can do.
I wouldn't say it has better time complexity, its big-O performance should be the same. But when you're indexing a string, Python needs to generate a new single-character string every time. For a list it will just hand you a reference to the already existing single character string.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have millions of strings and I want to check the locations of exact matches of each string in another collection of billions of strings. What is the most efficient way to do this in Python?
There are multiple answers to this question.
You could use a hash function and traverse the whole human genome trying to match the hash of a subsequence the length of your sequence to the hash of your sequence if they match you have your sequence at your index. The Rabin-Karp algorithm is O(n) where n is the size of the human genome. Take special care that your sequence is not long enough to overflow the integers.
You could use a variant of the brute force approach to string matching invented simultaneously by James H. Morris, Vaughan Pratt and Donald Knuth. The Knuth-Morris-Pratt algorithm starts to check for a match at each index and whenever it fails it checks for a table where the next index to start the matching starts. It is O(n) as well and has a better worst case complexity than RK (read this article on wikipedia).
You could use the Boyer-Moore algorithm which is very similar to the previous algorithm. It starts by computing some shifts and then tries to match at some index conveniently jumping others. It is O(n) as well and has better worst case complexity than RK (read the same article on wikipedia).
I suggest using Rabin-Karp algorithm as for me it seems easier to grasp (However I could be biased: NIH Bias)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using NLTK in python. I understood that it uses regular expressions in its word tokenization functions, such as TreebankWordTokenizer.tokenize(), but it uses trained models (pickle files) for sentence tokenization. I don't understand why they don't use training for word tokenization? Does it imply that sentence tokenization is a harder task?
I'm not sure if you can say that sentence splitting is harder than (word) tokenisation. But tokenisation depends on sentence splitting, so errors in sentence splitting will propagate to tokenisation. Therefore you'd want to have reliable sentence splitting, so that you don't have to make up for it in tokenisation. And it turns out that once you have good sentence splitting, tokenisation works pretty well with regexes.
Why is that? – One of the major ambiguities in tokenisation (in Latin script languages, at least) is the period ("."): It can be a full stop (thus a token of its own), an abbreviation mark (belonging to that abbreviation token), or something special (like part of a URL, a decimal fraction, ...). Once the sentence splitter has figured out the first case (full stops), the tokeniser can concentrate on the rest. And identifying stuff like URLs is exactly what you would use a regex for, isn't it?
The sentence splitter's main job, on the other hand, is to find abbreviations with a period. You can create a list for that by hand – or you can train it on a big text collection. The good thing is, it's unsupervised training – you just feed in the plain text, and the splitter collects abbreviations. The intuition is: If a token almost always appears with a period, then it's probably an abbreviation.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I want to write a Python 3 script to manage my expenses, and I'm going to have a rules filter that says 'if the description contains a particular string, categorize it as x', and these rules will be read in from a text file.
The only way I can think of doing this is to apply str.find() for each rule on the description of each transaction, and break if one is found - but this is a O^2 solution, is there a better way of doing this?
Strip punctuation from the description, and split it into words. Make the words in the description into a set, and the categories into another set.
Since sets use dictionaries internally and dictionaries are built on hash-tables, average membership checking is O(1).
Only when a transaction is entered (or changed), intersect both sets to find the categories that apply (if any), and add the categories to your transaction record (dict, namedtuple, whatever).