Text block comparison python - python

I have a python project in mind but i'm not too sure where to start.
I want to do some text comparison between two blocks of text, I want a user to be able to input two blocks of text and the program to identify the parts that are different/not the same.
I've seen this functionality in Git - when you make a change in a repo, it shows you the changes before you commit - this makes me think that I should be able to make something with similar functionality.
Any kinda' insight would be greatly appreciated!!
EDIT:
While searching I came across this Git repo online, it's all i'm looking for! A simple GUI interface where a user can load two different files and see the similarities or differences between them!
For others looking for something similar: https://github.com/yebrahim/pydiff

From my point of view, you can take user input and store it in two strings say str1 and str2 then you can make use of split( ) method or rather word_tokenize( )(Natural Language Processing) to get all the words in the String
If you want you can also remove stopwords Here for better comparison
Now you can run a loop comparing each word and for clear perception, you can underline the words or a particular part of a word that doesn't match

Related

Extract Sentences from A String

I am working on a machine learning chatbot project which uses google's speech recognition api.
Now my problem is, when I say 2 or more sentences in one command, speech recognition api returns all sentences in one string, without any fullstop or commas. As a result, it has become harder to seperate sentences. For example, if I say,
Take a photo. Tell me about today's weather. Open Google Chrome.
the speech recognition api returnes:
take a photo tell me about todays weather open Google Chrome
so, my chatbot takes this full string as one sentence.
Is there any way to extract sentences from a string like the one above?
(BTW, I am using Python)
If you are about to say multiple commands say words like "and" and split the command based on that word. Now loop through the list and pass each value to your execute function.
If the variable command stores your value split it using command.split(" and ")
I had previously answered a similar question take a look at it:
https://stackoverflow.com/a/65872940/12279129
I think you could try different approaches to solve the problem:
A Naive solution
I don't know how your system works for now but if you are just looking for some subsentences you could search in the full set of sentences if there is what you are looking for.
i.e.
input_str = "Take a photo turn on fan".lower()
if "take a photo" in input_str :
print("Just took a photo!")
if "turn on fan" in input_str :
print("Just turned the fan on!")
Ofc you could also select a separator word (like and, furthermore, ..) and use it as separator.
A more advanced solution
You could use a NLP library (i.e. spacy) and perform entity recognition so that you can isolate verbs from noun and so on.
After that you could evemtually make use of stemming and lemmatization to further generalize the recognition.
You could also perform many intermediate step with different NLP techniques like stopwords removal.
Try auto punctuation from API
Maybe you can try enabling automatic punctuation in the speech to text api and see if this works good enough for you.
That's because the Google Cloud Speech doesn't provide Natural Language Understanding and you are stuck parsing text transcripts.
You can of course create the natural language understanding component yourself, either by using simple regular expressions or using something like Rasa, but there's a smarter way, too.
Speechly provides you with everything you need to create voice user interfaces on Android, iOS or web. It returns you not only the transcript, but also actionable intents and entities that makes it a lot easier to create something a bit more complex. The best part is that it's free for up to 20 hours a month.
You can see a very simple example on how it works for instance for creating search experiences here. However, the basic idea is always the same: create a model and test that it returns correct intents for your speech input. After you are done, you integrate it to your app by looping through the returned results and whenever you get the correct intent, react in your application as needed. It's actually very simple.
You can use split method
Let your string is A
X = A.split('.')
It will make X a list which will contain items as sentences

How to create a dynamic form with python using translated text as input?

I have an original text that I want to translate. I normally do it manually but I know I could save a lot of time translating automatically the most frequent words and expressions.
I will find out how to translate simple words, the problem is not here. I have read some books on python and I think using string manipulations can be done.
But I am lost about how to create the output file.
The output file will contain:
short empty forms ready to be filled wherever there is text that has not been translated
the translated words wherever they were in the original file
In the output file I will fill manually the empty forms, after pressing Tab the cursor should jump to the next exmpty form
I am lost here, I know how to do forms on html but the language I am used to is Python.
I would like to know what modules from Python I could use. I need some guidance on this.
Can you recommend me a book or a tool that explains how to do something similar to this?
This is what I want to do, assuming I have managed to create a simple database to translate colors from Spanish to English.
The first step contains the original file.
The second step contains the automatic translation.
In the third step I complete the manual translation.
After finishing everything is grouped into a normal txt file ready to be used.
I think it is quite clear. I don't expect people to tell me the code to do this, I just need to know what tools could be used to achieve my goal.
Thanks for editing.
To create an interface that works with a web browser, Flask for Python is a good method for creating webforms. There are tutorials available.
One method for storing data would be an SQLite file. That may be more than you need, so I'd recommend starting with a CSV file. Libraries exist in Python for both CSVs and SQLite.

Run an autocorrect program on a text file on python

I am very new to Python, and I am currently working on a project. This project would be to create (among other things) a program to correct a text. I am having difficulty combining two separate ideas and parts of code together. First of all, I have been experimenting with a code to correct a word that is inputted by a user.
The code can be found here.
So far, I am using this exact code without any modifications.
My goal is to be able to read a text file and go through it and find and propose corrections for the words which are wrong, as this spellchecker code does.
I would use something like:
with open('words.txt','r') as f:
for line in f:
for word in line.split()
to go through the text file and split it into individual words.
Ideally, if my text said
"Wgat is the definiton" I would want to be able to recognize wgat and correct it to what, and recognize definiton and correct to definition.
How do I combine these two ideas? Thanks
Maybe you should look at this:
https://norvig.com/spell-correct.html
It uses probability to give the best answer without being connected to a database.
Else, you can use urllib to connect to the english dictionary website: http://www.mieliestronk.com/corncob_lowercase.txt
Then find the word that is most closely related to the one inputted and then print the word in this list.
Hope it helps!

Python domains extraction from text - new TLDs recognition issues

With emergence of new TLDs (.club, .jobs, etc...) what is the current best practice for extracting/parsing domains from text? My typical approach is regex however given that things like file names with extensions will trigger false positives, I will need something more restrictive.
I noticed even google sometimes does not properly recognize if I'm searching for a file name or want to go to a domain. This appears to be a rather challenging problem. Machine Learning could potentially be an approach to understand the context surrounding a string. However unless there is a library that does this already I won't bother getting too fancy.
One approach I'm thinking of is after regexing, querying http://data.iana.org/TLD/tlds-alpha-by-domain.txt which holds a static list of current TLDs and use it as a filter. Any suggestions?
This is not an easy problem and it depends on the context in which you need to extract the domain names, and the accepted rate of false positives and negatives you can support. You can indeed use the list of currently existing TLDs but this list changes so you need to make sure you are taking into account recent enought values of the list.
You are hitting issues covered by the Universal Acceptance movement, in trying to make sure all TLDs (whatever length, date of creation, and characters it uses) are equal.
They provided a document about "Linkification" which has as a sub problem the fact of extracting the links hence the domain among other things. Have a look at their documentation: https://uasg.tech/wp-content/uploads/2017/06/UASG010-Quick-Guide-to-Linkification.pdf
So this could give you some ideas, as well as their Quick Guide at https://uasg.tech/wp-content/uploads/2016/06/UASG005-160302-en-quickguide-digital.pdf

Automatically pick tags from context using Python

How can I pick tags from an article or a user's post using Python?
Is the following method ok?
Build a list of word frequency from the text and sort them.
Remove some common words and pick the top 10 words remained in the list as the tags.
If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?
Here's an article on removing stop words. The link to the stop word list in the article is broken but here's another one.
The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can't give you hands-on advice as I'm not familiar with this subject, but I think it's worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won't get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.
I'd suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.
But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.
Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I'd recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.
Instead of blacklisting words that shouldn't be tags, why don't you instead build a whitelist of words that would make for good tags?
Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.
When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.
Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.

Categories

Resources