I'm new to python (and coding in general), and I ran into an issue while doing my first assignment for school. We need to implement several simple text analysis techniques, and after hours of hitting my head on my keyboard, I figured I better ask for some pointers here.
The problem lies within one of the tasks. I'm supposed to find the number of words per sentence in a given text, and then print out the number of words per sentence from the longest to the shortest. Thus far, I have been able to figure out how to find the longest and shortest sentence, (and even the second longest sentence). However, I'm stuck as to how to find out the second shortest sentence or third longest, etc.
My code looks like this:
length = sentences.split(". ")
tokenized_sentences = [sentence.split(" ") for sentence in length]
longest_sen = max(tokenized_sentences, key=len)
longest_sen_length = len(longest_sen)
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_length = len(shortest_sen)
print("The longest sentence is", (longest_sen_length), "words.")
print("The shortest sentence is", (shortest_sen_length), "words.")
I'm aware that the code is not robust, and I could save a lot of time using nltk or re. However, the paragraph isn't very long or complex, and I'm not certain my professor would be a-ok with me using an additional platform at this point.
Any pointers would be highly appreciated!
Edit: An example of the text: "Once more. Say, you are in the country, in some high land of lakes. Take almost any path you please, and ten to one it carries you down in a dale, and leaves you there by a pool in the stream. There is magic in it. Let the most absentminded of men be plunged in his deepest reveries--stand that man on his legs, set his feet a-going, and he will infallibly lead you to water, if water there be in all that region. Should you ever be athirst in the great American desert, try this experiment, if your caravan happen to be supplied with a metaphysical professor. Yes, as every one knows, meditation and water are wedded for ever."
If we assume that your entire test is named sentences, then you can do the following, that sorts the sentences by length (descending).
l=sentences.split('. ')
m=[len(i.split()) for i in l]
m.sort(reverse=True)
And you will have all sentences lengths and you can play around with what you want to print
You can use sort() method to sort a list based on the length and you want descending order, so just put reverse = True.
tokenized_sentences.sort(key=len,reverse = True)
Related
So I'm trying to match a regular expression to a paragraph in order to do sentiment analysis, but tqdm is saying this could take about 300 hours. I was wondering if anyone has a critique on what I could do to improve the way my RE functions
I'm trying to match stem endings to negative words for this analysis. Here is a small snippet of the expression for the match. I'm only showing a small snippet because the entire expression contains about 2800 terms, and is set up entirely the same all the way through, hence the ellipses.
regex_neg = ((a lie)|(abandon)|(abas)|(abattoir)|(abdicat)|(aberra)|(abhor)|(abject)|(abnormal)|(abolish)|(abominab)|(abominat)|(abrasiv)|(absent)|(abstrus)|(absurd)|(abus)|(accident)|(accost)|(accursed)|(accusation)|(accuse)|(accusing)|(acerbi)|(ache)|(aching)|(achy)|(acomia)|(acrimon)|(adactylism)|(addict)|(admonish)|(admonition)|(adulterat)|(adultery)|(advers)|(affectation)|(affected)|(affected manner)|(afflict)|(affright)...)
Here is the function that I'm using to match the stems in the paragraphs
def neg_stems(paragraph):
stem_list = []
i = " ".join(paragraph)
for n in re.finditer(regex_neg, i):
if n.group():
stem_list.append(n.group())
return json.dumps(stem_list)
And finally, here is just the general output that I'm getting
neg_stems(["the king abdicated the throne in an argument where he was angry, but his son was pretty happy about it","I hate cats but love hedgehogs"])
> ["abdicat", "argument", "anger", "hate"]
I'm just trying to count the number of negative words as defined by the semantic dictionary in regex_neg, but ~300 hours is just way too long, and even then, that's simply an estimate.
Does anyone have a suggestion on what I could do to try and speed this process up?
Thank you in advance!
I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?
The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.
It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en
i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.
I'm coding a game similar to Boggle where the gamer should find words inside a big string made of random letters.
For example, there are five arrays with strings inside like this. Five rows, made of six letters each one :
AMSDNS
MASDOM
ASDAAS
DSMMMS
OAKSDO
So, the users of the game should make words using the letters available with the following restrictions and rules in mind:
Its not possible to repeat the same letter to make a word. Im talking about the "physical" letter, in the game that is a dice. Its not possible to use the same dice twice or more to make the word.
Its not possible to "jump" any letter to make a word. The letters that make the word must be contiguous.
The user is able to move in any direction she wants without any restriction further than the two mentioned above. So its possible to go to the top, then bottom, then to the right, then top again, and so on. So the movements to look for words might be somehow erratic.
I want to know how to go through all the strings to make words. To know the words Im gonna use a txt file with words.
I don't know how to design an algorithm that is able to perform the search, specially thinking about the erratic movements that are needed to find the words and respecting the restrictions, too.
I already implemented the UX, the logic to throw the dice and fill the boardgame, and all the logic for the six-letters dice.
But this part its not easy, and I would like to read your suggestions to this interesting challenge.
Im using Python for this game because is the language I use to code and the language that I like the most. But an explanation or suggestion of an algorithm itself, should be nice too, independently of the language.
The basic algorithm is simple.
For each tile, do the following.
Start with an empty candidate word, then visit the current tile.
Visit a tile by following these steps.
Add the tile's position's letter to the candidate word.
Is the candidate word a known word? If so, add it to the found word list.
Is the candidate word a prefix to any known word?
If so, for each adjacent tile that has not been visited to form the candidate word, visit it (i.e., recurse).
If not, backtrack (stop considering new tiles for this candidate word).
To make things run smoothly when asking the question "is this word a prefix of any word in my dictionary", consider representing your dictionary as a trie. Tries offer fast lookup times for both words and prefixes.
You might find a Trie useful - put all dictionary words into a Trie, then make another Trie from the Boggle grid, only as long you're matching the dictionary Trie.
I.e. Dictionary trie:
S->T->A->C->K = stack
\->R->K = stark
\->T = start
Grid: (simplified)
STARKX
XXXTXX
XXXXXX
Grid trie: (only shown starting at S - also start at A for ART, etc)
S->X (no matches in dict Trie, so it stops)
\->T->X
\->A-R->K (match)
| |->T (match)
| \->X
\->C->K (match)
\->X
You could visualise your Tries with GraphViz like this.
I need to censor all occurrences of a list of words with *'s. I have about 400 words in the list and it's going to get hit with a lot of traffic, so I want to make it very efficient. What's an efficient algorithm/data structure to do this in? Preferably something already in Python.
Examples:
"piss off" => "**** off"
"hello" => "hello"
"go to hell" => "go to ****"
A case-insensitive trie-backed set implementation might fit the bill. For each word, you'll only process a minimum of characters. For example, you would only need to process the first letter of the word 'zoo' to know the word is not present in your list (assuming you have no 'z' expletives).
This is something that is not packaged with python, however. You may observe better performance from a simple dictionary solution since it's implemented in C.
(1) Let P be the set of phrases to censor.
(2) Precompute H = {h(w) | p in P, w is a word in p}, where h is a sensible hash function.
(3) For each word v that is input, test whether h(v) in H.
(4) If h(v) not in H, emit v.
(5) If h(v) in H, back off to any naive method that will check whether v and the words following form a phrase in P.
Step (5) is not a problem since we assume that P is (very) small compared to the quantity of input. Step (3) is an O(1) operation.
like cheeken has mentioned, a Trie may be the thing you need, and actually, you should use Aho–Corasick string matching algorithm. Something more than a trie.
For every string, say S you need to process, the time complexity is approximately O(len(S)). I mean, Linear
And you need to build the automaton initially, it's time complexity is O(sigma(len(words))), and space complexity is about(less always) O(52*sigma(len(words))) here 52 means the size of the alphabet(i take it as ['a'..'z', 'A'..'Z']). And you need to do this just for once(or every time the system launches).
You might want to time a regexp based solution against others. I have used similar regexp based substitution of one to three thousand words on a text to change phrases into links before, but I am not serving those pages to many people.
I take the set of words (it could be phrases), and form a regular expression out of them that will match their occurrence as a complete word in the text because of the '\b'.
If you have a dictionary mapping words to their sanitized version then you could use that. I just swap every odd letter with '*' for convenience here.
The sanitizer function just returns the sanitized version of any matched swear word and is used in the regular expression substitution call on the text to return a sanitized version.
import re
swearwords = set("Holy Cow".split())
swear = re.compile(r'\b(%s)\b' % '|'.join(sorted(swearwords, key=lambda w: (-len(w), w))))
sanitized = {sw:''.join((ch if not i % 2 else '*' for i,ch in enumerate(sw))) for sw in swearwords}
def sanitizer(matchobj):
return sanitized.get(matchobj.group(1), '????')
txt = 'twat prick Holy Cow ... hell hello shitter bonk'
swear.sub(sanitizer, txt)
# Out[1]: 'twat prick H*l* C*w ... hell hello shitter bonk'
You might want to use re.subn and the count argument to limit the number of substitutions done and just reject the whole text if it has too many profanities:
maxswear = 2
newtxt, scount = swear.subn(sanitizer, txt, count=maxswear)
if scount >= maxswear: newtxt = 'Ouch my ears hurt. Please tone it down'
print(newtxt)
# 'Ouch my ears hurt. Please tone it down'
If performance is what you want I would suggest:
Get a sample of the input
Calculate the average amount of censored words per line
Define a max number of words to filter per line (3 for example)
Calcule what censored words have the most hits in the sample
Write a function that given the censored words, will generate a
python file with IF statements to check each words, putting the 'most
hits' words first, since you just want to match whole words it will
be fairly simple
Once you hit the max number per line, exit the function
I know this is not nice and I'm only suggesting this approach because of the high traffic scenario, doing a loop of each word in your list will have a huge negative impact on performance.
Hope that help or at least give you some out of the box idea on how to tackle the problem.
I'm doing a pretty simple homework problem for a Python class involving all sorts of statistics on characters, words and their relative frequencies etc. At the moment I'm trying to analyse a string of text and get a list of every unique word in the text followed by the number of times it is used. I have very limited knowledge of Python (or any language for that matter) as this is an introductory course and so have only come up with the following code:
for k in (""",.’?/!":;«»"""):
text=text.replace(k,"")
text=text.split()
list1=[(text.count(text[n]),text[n]) for n in range(0,len(text))]
for item in sorted(list1, reverse=True):
print("%s : %s" % (item[1], item[0]))
This unfortunately prints out each individual word of the text (in order of appearance), followed by its frequency n, n times. Obviously this is extremely useless, and I'm wondering if I can add in a nifty little bit of code to what I've already written to make each word appear in this list only once, and then eventually in descending order. All the other questions like this I've seen use a lot of code we haven't learned, so I think the answer should be relatively simple.
Take a look at collections.Counter. You can use it to count your word frequencies, and it'll help you print out the list in sorted order, with the most_common method.
(No example code as this is a homework question, you'll have to do some work yourself).