Split/decompose German words in Python

Split/decompose German words in Python - python

I need to split/decompose German composed words in Python. An example:
Donaudampfschiffahrtsgesellschaftskapitän
should be decomposed to:
[Donau, Dampf, Schiff, Fahrt, Gesellschaft, Kapitän]
First I found wordaxe, but it did not work. Than I came across NLTK, but still don't understand if that is smth. I need.
A solution with an example would be really great!

Related

PyDictionary Printing Issue

I have a small issue with PyDictionary; When I enter a list of words, the printing of the words Does NOT keep the order of the word list.
For example:
from PyDictionary import PyDictionary
dictionary=PyDictionary(
"bad ",
"omen",
"azure ",
"sky",
"icy ",
"smile")
print(dictionary.printMeanings())
This list will print first Omen, Then Sky and so on, What I need is to print the word list in its original order. I search on google but there was nothing related, I search the posts in this forum and nothing. I hope you can help me. Thank you in advance.

I found a workaround that gives me a full solution to my initial printing issues.
The Main Problem is that I am using an OLD laptop (older than 12 years), So I have Not been able to use python 3+, Using PyDictionary with python 2.7 arouse the problem of printing the initial word list randomly.
The Solution is to print a single word for each printing, BUT I have to do this about 25,000 times!... Using Notepad++ I made a macro that codes each word to be used with python, Furthermore I was able to even add the Spanish translation to each English word, The printing of each word individually added the benefit that each word definitions are separated from each word.
Using Notepad++ and regex I am able to do the final clean up of each word, and it's meaning.
So I am happy with this workaround... Thank You for your help.

automise long lists in python

Today I wrote my first program, which is essentially a vocabulary learning program! So naturally I have pretty huge lists of vocabulary and a couple of questions. I created a class with the parameters, one of which is the German vocab and one of which is the Spanish vocab. My first question is: is there anyway to turn all the plain text vocabulary that I copy from an internets vocab list into strings and separate them without adding the " and the commas manually?
And my second question:
I created another list to assign each German vocab to each Spanish vocab and it looks a little bit like that:
vocabs = [
Vocabulary(spanish_word[0], german_word[0])
Vocabulary(spanish_word[1], german_word[1])
etc.
]
Vocabulary would be the class, spanish_word the first word list and German the other obviously.
But with a lot of vocab that's a lot of work too. Is there anyway to automate the process to add each word from the Spanish word list to the German one? I first tried it with the
vocabs = [
for spanish word in german word
Vocabulary(spanish_word[0], german_word[0])
]
But that didn't work. Researching on the internet also didn't help much.
Please don't be rude if those are noob questions I'm actually pretty happy that my program is running so well and I would be thankful for all the help to make it better.

Without knowing what it is you're looking to do with the result, it appears you're trying to do this:
vocabs = [Vocabulary(s, g) for s, g in zip(spanish_word, german_word)]
You didn't provide any code or example data around the "turn all the plain text vocabulary [..] into strings and separate them without adding the quotes and the commas manually". There's sure to be a way to do what you need, but you should probably ask a separate question, after first looking for a solution yourself and coming up with a solution. Ask a question if you can't get it to work.

Check for word in string with unpredictable delimiters

I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.

This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)

Filtering out Non English sentences in a list in Python Pandas

So there is a excel file which i have read through pandas and stored it in a dataframe 'df'. Now that excel file contains 24 columns as 'questions' and 631 rows as 'responses/answers'.
So i converted one such question into a list so that i can tokenize it and apply further nlp related tasks on it.
df_lst = df['Q8 Why do you say so ?'].values.tolist()
Now, this gives me a list that contains 631 sentences, out of which some sentences are non-english.. So i want to filter out the non-english sentences so that in the end I am left with a list that contains only english sentences.
What i have:
df_lst = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', 'This text is in a random non english language'...]
Output (What i want):
english_words = ['The excecutive should be able to understand the customer's problem','Customers should get correct responses to their queries', ...]
Also, I read about a python library named pyenchant which should be able to do this, but it's not compatible with windows 64bit and python 3.. Is there any other way by which this can be done ?
Thanks!

There is another library (closely related to nltk), TextBlob,
Initially bound to Sentiment analysis,
But you can still use it for translation, see the doc here: https://textblob.readthedocs.io/en/dev/quickstart.html
Section Translation and Language Detection
gl

Have you considered taking advantage of the number of English "stopwords" in your sentences? Give a look at the nltk package. Check English stopwords using the following code:
import nltk
from ntlk.corpus import stopwords
ntlk.download('stopwords') # If you just installed the package
set(stopwords.words('english'))
You could add a new column indicating the number of English stopwords present in each of your sentences. Presence of stopwords could be used as a predictor of English language.
Other thing that could work is, if you know for a fact that most answers are in English to begin with, make a frequency ranking for words (possibly for each question in your data). In your example, it looks like the word "customer" shows up quite consistently for the question under study. So you could engineer a variable that indicates the presence of very frequent words in an answer. That could also serve as a predictor. Do not forget to either make all words lowercase or uppercase and deal with plural or 's so you don't rank "customer", "Customer", "customers", "Customers", "customer's" and "customers'" all as different words.
After engineering the variables above, you can set up a threshold above which you consider the sentence to be written in English, or you can go for something a bit more fancy in terms of unsupervised learning.

NLTK Most common synonym (Wordnet) for each word

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.
If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.
Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.
Input: "Hello my valued friend"
Return: "Hi my dear friend"

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.
The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:
from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())
You can then look up the frequency of a word like this:
>>> print(freqs["valued"])
14
Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.
>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in
brown.tagged_words(tagset="universal"))
>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45

Synonyms are a huge and open area of work in natural language processing.
In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.
Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.
Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).
I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!
Edit: I should have included this link to an older question as well:
How to get synonyms from nltk WordNet Python
And the NLTK documentation on its interface with WordNet:
http://www.nltk.org/howto/wordnet.html
I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.

The other answer shows you how to use synonyms:
wn.synsets('small')
[Synset('small.n.01'),
Synset('small.n.02'),
Synset('small.a.01'),
Synset('minor.s.10'),
Synset('little.s.03'),
Synset('small.s.04'),
Synset('humble.s.01'),
Synset('little.s.07'),
Synset('little.s.05'),
Synset('small.s.08'),
Synset('modest.s.02'),
Synset('belittled.s.01'),
Synset('small.r.01')]
You now know how to get all the synonyms for a word. That's not the hard part. The hard part is determining what's the most common synonym. This question is highly domain dependent. Most common synonym where? In literature? In common vernacular? In technical speak?
Like you, I wanted to get an idea of how the English language was used. I downloaded 15,000 entire books from (Project Gutenberg) and processed the word and letter pair frequencies on all of them. After ingesting such a large corpus, I could see which words were used most commonly. Like I said above, though, it will depend on what you're trying to process. If it's something like Twitter posts, try ingesting a ton of tweets. Learn from what you have to eventually process.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.