Today I wrote my first program, which is essentially a vocabulary learning program! So naturally I have pretty huge lists of vocabulary and a couple of questions. I created a class with the parameters, one of which is the German vocab and one of which is the Spanish vocab. My first question is: is there anyway to turn all the plain text vocabulary that I copy from an internets vocab list into strings and separate them without adding the " and the commas manually?
And my second question:
I created another list to assign each German vocab to each Spanish vocab and it looks a little bit like that:
vocabs = [
Vocabulary(spanish_word[0], german_word[0])
Vocabulary(spanish_word[1], german_word[1])
etc.
]
Vocabulary would be the class, spanish_word the first word list and German the other obviously.
But with a lot of vocab that's a lot of work too. Is there anyway to automate the process to add each word from the Spanish word list to the German one? I first tried it with the
vocabs = [
for spanish word in german word
Vocabulary(spanish_word[0], german_word[0])
]
But that didn't work. Researching on the internet also didn't help much.
Please don't be rude if those are noob questions I'm actually pretty happy that my program is running so well and I would be thankful for all the help to make it better.
Without knowing what it is you're looking to do with the result, it appears you're trying to do this:
vocabs = [Vocabulary(s, g) for s, g in zip(spanish_word, german_word)]
You didn't provide any code or example data around the "turn all the plain text vocabulary [..] into strings and separate them without adding the quotes and the commas manually". There's sure to be a way to do what you need, but you should probably ask a separate question, after first looking for a solution yourself and coming up with a solution. Ask a question if you can't get it to work.
Related
I'm new to python however after scouring the internet and going back over my study, I cannot seem to find how to find duplicates of a word within multiple sentences. my aim is to define how many times the word python occurs within these strings. I have tried the split() method and count.(python) and even tried to make a dictionary and word_counter which initially I have been taught to do as part of the basics however nothin in my study has shown me anything similar to this before. i need to be able to display the frequency of the word. python occurs 4 times. any help would be very appreciated
python_occurs = ["welcome to our Python program", "Python is my favorite language!", "I am afraid of Pythons", "I love Python"]
A straight-forward approach is to iterate over every word using split. For each word, it's converted to lowercase and the number of times "python" occurs in it is counted using count.
I guess the reason for you approach not working might be that you forgot to change the letters to lowercase.
python_occurs = ["welcome to our Python program", "Python is my favorite language!", "I am afraid of Pythons", "I love Python"]
count = 0
for sentence in python_occurs:
for word in sentence.split():
# lower is necessary because we want to be case-insensitive
count += word.lower().count("python")
I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.
This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)
I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.
reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text
OUTPUT:
'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')
As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.
The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.
I would love if someone will fix me if theres a simpler way im not aware of.
There is no easy solution to this problem.
The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).
But even doing so you'll get a lot of false positives. For example if I got the text:
a n a n a s
the words:
a
an
as
are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:
an an as
Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.
Machine Learning could be a way, but there is no perfect solution.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]
That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)
First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)
I need to split/decompose German composed words in Python. An example:
Donaudampfschiffahrtsgesellschaftskapitän
should be decomposed to:
[Donau, Dampf, Schiff, Fahrt, Gesellschaft, Kapitän]
First I found wordaxe, but it did not work. Than I came across NLTK, but still don't understand if that is smth. I need.
A solution with an example would be really great!