Python domains extraction from text - new TLDs recognition issues

Python domains extraction from text - new TLDs recognition issues - python

With emergence of new TLDs (.club, .jobs, etc...) what is the current best practice for extracting/parsing domains from text? My typical approach is regex however given that things like file names with extensions will trigger false positives, I will need something more restrictive.
I noticed even google sometimes does not properly recognize if I'm searching for a file name or want to go to a domain. This appears to be a rather challenging problem. Machine Learning could potentially be an approach to understand the context surrounding a string. However unless there is a library that does this already I won't bother getting too fancy.
One approach I'm thinking of is after regexing, querying http://data.iana.org/TLD/tlds-alpha-by-domain.txt which holds a static list of current TLDs and use it as a filter. Any suggestions?

This is not an easy problem and it depends on the context in which you need to extract the domain names, and the accepted rate of false positives and negatives you can support. You can indeed use the list of currently existing TLDs but this list changes so you need to make sure you are taking into account recent enought values of the list.
You are hitting issues covered by the Universal Acceptance movement, in trying to make sure all TLDs (whatever length, date of creation, and characters it uses) are equal.
They provided a document about "Linkification" which has as a sub problem the fact of extracting the links hence the domain among other things. Have a look at their documentation: https://uasg.tech/wp-content/uploads/2017/06/UASG010-Quick-Guide-to-Linkification.pdf
So this could give you some ideas, as well as their Quick Guide at https://uasg.tech/wp-content/uploads/2016/06/UASG005-160302-en-quickguide-digital.pdf

Related

Text block comparison python

I have a python project in mind but i'm not too sure where to start.
I want to do some text comparison between two blocks of text, I want a user to be able to input two blocks of text and the program to identify the parts that are different/not the same.
I've seen this functionality in Git - when you make a change in a repo, it shows you the changes before you commit - this makes me think that I should be able to make something with similar functionality.
Any kinda' insight would be greatly appreciated!!
EDIT:
While searching I came across this Git repo online, it's all i'm looking for! A simple GUI interface where a user can load two different files and see the similarities or differences between them!
For others looking for something similar: https://github.com/yebrahim/pydiff

From my point of view, you can take user input and store it in two strings say str1 and str2 then you can make use of split( ) method or rather word_tokenize( )(Natural Language Processing) to get all the words in the String
If you want you can also remove stopwords Here for better comparison
Now you can run a loop comparing each word and for clear perception, you can underline the words or a particular part of a word that doesn't match

How do I go to a random website? - python

How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)

I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.

A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.

The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.

python advanced search library

I have around 80,000 text files and I want to be able to do an advanced search on them.
Let's say I have two lists of keywords and I want to return all the files that include at least one of the keywords in the first list and at least one in the second list.
Is there already a library that would do that, I don't want to rewrite it if it exists.

As you need to search the documents multiple times, you most likely want to index the text files to makes such searches as fast as possible.
Implementing a reasonable index yourself is certainly possible, but a quick search lead me to:
https://pypi.python.org/pypi/Whoosh/
http://pythonhosted.org/Whoosh/
Take a look at the documentation. It should hopefully be rather trivial to achieve the desired behaviour.

I just get a feeling you want to use MapReduce type of processing for the search. It should be very scalable, Python should have MapReduce packages.

Microsoft Speech Recognition Custom Training

I have been wanting to create an application using the Microsoft Speech Recognition.
My application's users are expected to often say abbreviated things, such as 'LHC' for 'Large Hadron Collider' or 'CERN'. Given that exact order, my application will return
You said: At age C.
You said: Cern
While it did work for 'CERN', it failed very badly for 'LHC'.
However, if I could make my own custom training files, I could easily place the term 'LHC' somewhere in there. Then, I could make the user access the Speech Control Panel and run my training file.
All the links I have found for this have been frustratingly useless, as they just say things like 'This is ----, you should try going to the ---- forum instead'.
If it does help, here is a list of the links:
http://compgroups.net/comp.speech.users/add-my-own-training/153194
https://groups.google.com/forum/#!topic/microsoft.public.speech.server/v58SH1ov22s
http://social.msdn.microsoft.com/Forums/en/servercorefordevelopers/thread/f7a35f3f-b352-464a-b264-e16eb4afd049
Is my problem even possible? Or are the training files themselves in a special format? If so, can that format be reproduced?
A solution that can also work on Windows XP would be ideal.
Thanks in advance!
P.S. If there are any libraries or modules out there already for this, could anyone point me to some? A Python or C/C++ solution would be splendid. Also, since I'd rather not post another question regarding this, is it possible to utilize the train utilities from command prompt (or without the GUI visible, but still having total command of all controls)?

Okay, pulling this from a thing I wrote three or four years ago now, but I believe you want to do something like this.
The grammar library is a trained system which can recognize words. You can create your own grammar library cued to specific words.
C#, sorry
using System.Speech
using System.Speech.Recognition
using System.Speech.AudioFormat
SpeechRecognitionEngine sre = new SpeechRecognitionEngine();
string[] words = {"L H C", "CERN"};
Choices choices = new Choices(words);
GrammarBuilder gb = new GrammarBuilder(choices);
Grammar grammar = new Grammar(gb);
sre.LoadGrammar(grammar);
That is as far as I can get you. From docs it looks like you can define the pronunciations somehow. So perhaps that way you could have LHC map directly to a single word. Here are the docs on the grammar class - http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx
Small update - see example in their docs here http://msdn.microsoft.com/en-us/library/ms554228.aspx

Design help for static content with fixed keywords search framework

I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?

I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.