Leveraging Spell Checker on local machine?

Leveraging Spell Checker on local machine? - python

I notice that common applications on a given machine (Mac, Linux, or Windows) have their respective spell checkers. Everything from various IDE, to MS Word/Office, to Note taking software.
I am trying to utilize the built in utility of our respective machines in order to analyze strings for syntactic correctness. It seems that I cant just use what is on the machine and would have to likely download a dictionary in which to compare against.
I was not sure if there was a better way to accomplish this. I was looking at trying to do things locally, but I was not opposed to doing api or curl requests to determine if the words in a string are spelled correctly.
I was looking at:
LanguageTool (hello wrold failed to return an error)
Google's tbproxy seems to not be functional
Dictionary / Meriam-Webster require api keys for automation.
I was looking at Node packages and noticed spell checker modules which encapsulate wordlists as well.
Is there a way to utilize the built in machine dictionaries at all, or would it be ideal if I download a dictionary / wordlist to compare against?
I am thinking a wordlist might be best bet, but i didnt want to reinvent the wheel. What have others done to accomplish similar?

The Credit is going to Lukas Knuth. I want to give an explicit how to for using dictionary and nspell.
Install The following 2 dependancies:
npm install nspell dictionary-en-us
Here is a Sample File I wrote in order to solve the problem.
// Node File
// node spellcheck.js [path]
// path: [optional] either absolute or local path from pwd/cwd
// if you run the file from within Seg.Ui.Frontend/ it works as well.
// node utility/spellcheck.js
// OR from the utility directory using a path:
// node spellcheck.js ../src/assets/i18n/en.json
var fs = require("fs");
var dictionary = require("dictionary-en-us");
var nspell = require("nspell");
var process = require("process");
// path to use if not defined.
var path = "src/assets/i18n/en.json"
let strings = [];
function getStrings(json){
let keys = Object.keys(json);
for (let idx of keys){
let val = json[idx];
if (isObject(val)) getStrings(val);
if (isString(val)) strings.push(val)
}
}
function sanitizeStrings(strArr){
let set = new Set();
for (let sentence of strArr){
sentence.split(" ").forEach(word => {
word = word.trim().toLowerCase();
if (word.endsWith(".") || word.endsWith(":") || word.endsWith(",")) word = word.slice(0, -1);
if (ignoreThisString(word)) return;
if (word == "") return;
if (isNumber(word)) return;
set.add(word)
});
}
return [ ...set ];
}
function ignoreThisString(word){
// we need to ignore special cased strings, such as items with
// Brackets, Mustaches, Question Marks, Single Quotes, Double Quotes
let regex = new RegExp(/[\{\}\[\]\'\"\?]/, "gi");
return regex.test(word);
}
function spellcheck(err, dict){
if (err) throw err;
var spell = nspell(dict);
let misspelled_words = strings.filter( word => {
return !spell.correct(word)
});
misspelled_words.forEach( word => console.log(`Plausible Misspelled Word: ${word}`))
return misspelled_words;
}
function isObject(obj) { return obj instanceof Object }
function isString(obj) { return typeof obj === "string" }
function isNumber(obj) { return !!parseInt(obj, 10)}
function main(args){
//node file.js path
if (args.length >= 3) path = args[2]
if (!fs.existsSync(path)) {
console.log(`The path does not exist: ${process.cwd()}/${path}`);
return;
}
var content = fs.readFileSync(path)
var json = JSON.parse(content);
getStrings(json);
// console.log(`String Array (length: ${strings.length}): ${strings}`)
strings = sanitizeStrings(strings);
console.log(`String Array (length: ${strings.length}): ${strings}\n\n`)
dictionary(spellcheck);
}
main(process.argv);
This will return a subset of strings to look at and they may be misspelled or false positives.
A false positive will be denoted as:
An acronym
non US English variants for words
Un recognized Proper Nouns, Days of the Week and Months for example.
Strings which contain parenthese. This can be augmented out by trimming them off the word.
Obviously, this isnt for all cases, but i added an ignore this string function you can leverage if say it contains a special word or phrase the developers want ignored.
This is meant to be run as a node script.

Your question is tagged as both NodeJS and Python. This is the NodeJS specific part, but I imagine it's very similar to python.
Windows (from Windows 8 onward) and Mac OS X do have built-in spellchecking engines.
Windows: The "Windows Spell Checking API" is a C/C++ API. To use it with NodeJS, you'll need to create a binding.
Mac OS X: "NSSpellChecker" is part of AppKit, used for GUI applications. This is an Objective-C API, so again you'll need to create a binding.
Linux: There is no "OS specific" API here. Most applications use Hunspell but there are alternatives. This again is a C/C++ library, so bindings are needed.
Fortunately, there is already a module called spellchecker which has bindings for all of the above. This will use the built-in system for the platform it's installed on, but there are multiple drawbacks:
1) Native extensions must be build. This one has finished binaries via node-pre-gyp, but these need to be installed for specific platforms. If you develop on Mac OS X, run npm install to get the package and then deploy your application on Linux (with the node_modules-directory), it won't work.
2) Using build-in spellchecking will use defaults dictated by the OS, which might not be what you want. For example, the used language might be dictated by the selected OS language. For a UI application (for example build with Electron) this might be fine, but if you want to do server-side spellchecking in languages other than the OS language, it might prove difficult.
At the basic level, spellchecking some text boils down to:
Tokenizing the string (e.g. by spaces)
Checking every token against a list of known correct words
(Bonus) Gather suggestions for wrong tokens and give the user options.
You can write part 1 yourself. Part 2 and 3 require a "list of known correct words" or a dictionary. Fortunately, there is a format and tools to work with it already:
simple-spellchecker can work with .dic-files.
nspell is a JS implementation of Hunspell, complete with its own dictionary packages.
Additional Dictionaries can be found for example in this repo
With this, you get to choose the language, you don't need to build/download any native code and your application will work the same on every platform. If you're spellchecking on the server, this might be your most flexible option.

Related

Syntax for using config data in rules

Is there someplace that fully describes use of config data in snakemake rules?
There is an example in the user guide of this in a yaml file:
samples:
A: data/samples/A.fastq
B: data/samples/B.fastq
Then, it is used in a rule like this:
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
It seems like the above would replace {sample} to "data/samples/A.fastq" rather than by "A" (and "B" etc.) as it apparently does.
What is the right way to make use of config data in output rules, e.g. to help form the output filename? This form doesn't work:
output: "{config.dataFolder}/{ID}/{ID}.yyy"
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
The yaml and JSON config files are severely limited in that they cannot use values defined earlier in the file to define new values, right? And that's something that would often be done when setting configuration parameters.
What is the advantage of using a configfile? Why not instead just use include: an include a python file to define parameters?
A useful thing would be a reference manual that describes the details of SnakeMake thoroughly. The current website is kind of scattered, takes a while to find things that you remember seeing previously somewhere in it.

How should config data be used in "output" rules? I found that the output string cannot contain {config.} values. However, they can be included using Python code, as follows:
output: config["OutputDir"] + "/myfile.txt"
But, this method does NOT work (in either output: or input:):
params: config=config
output: "{params.config[OutputDir]}/myfile.txt"
However, it DOES work in "shell:":
params: config=config
output: config["OutputDir"] + "/myfile.txt"
shell: echo "OutputDir is {params.config[OutputDir]}" > {output}
Notice that there are no quotes around OutputDir inside the [] in the shell cmd. The {} method of expanding values in a string does not use quotes around the keys.
Can config data be defined snakefile-wise OR python-wise? YES!
Parameters can be defined in a .yaml file included using 'configfile', or via a regular Python file included using 'include'. The latter is IMHO superior, since .yaml files don't allow definitions to reference previous ones, something that would be common in all but the simplest configuration files.
To define the "OutputDir" parameter above using yaml:
xxx.yaml:
OutputDir: DATA_DIR
snakefile:
configfile: 'xxx.yaml'
To define it using Python to be exactly compatible with above:
xxx.py:
config['OutputDir'] = "DATA_DIR"
snakefile:
include: 'xxx.py'
Or, to define a simple variable 'OutputDir' in a Python included configuration file and then use it in a rule:
xxx.py:
OutputDir = "DATA_DIR"
snakefile:
include: 'xxx.py'
rule:
output: OutputDir + "/myfile.txt"
Multi-nested dictionaries and lists can be easily defined and accessed, both via .yaml files and python files. Example:
MACBOOK> cat cfgtest.yaml
cfgtestYAML:
A: 10
B: [1, 2, 99]
C:
nst1: "hello"
nst2: ["big", "world"]
MACBOOK> cat cfgtest.py
cfgtestPY = {
'X': -2,
'Y': range(4,7),
'Z': {
'nest1': "bye",
'nest2': ["A", "list"]
}
}
MACBOOK> cat cfgtest
configfile: "cfgtest.yaml"
include: "cfgtest.py"
rule:
output: 'cfgtest.txt'
params: YAML=config["cfgtestYAML"], PY=cfgtestPY
shell:
"""
echo "params.YAML[A]: {params.YAML[A]}" >{output}
echo "params.YAML[B]: {params.YAML[B]}" >>{output}
echo "params.YAML[B][2]: {params.YAML[B][2]}" >>{output}
echo "params.YAML[C]: {params.YAML[C]}" >>{output}
echo "params.YAML[C][nst1]: {params.YAML[C][nst1]}" >>{output}
echo "params.YAML[C][nst2]: {params.YAML[C][nst2]}" >>{output}
echo "params.YAML[C][nst2][1]: {params.YAML[C][nst2][1]}" >>{output}
echo "" >>{output}
echo "params.PY[X]: {params.PY[X]}" >>{output}
echo "params.PY[Y]: {params.PY[Y]}" >>{output}
echo "params.PY[Y][2]: {params.PY[Y][2]}" >>{output}
echo "params.PY[Z]: {params.PY[Z]}" >>{output}
echo "params.PY[Z][nest1]: {params.PY[Z][nest1]}" >>{output}
echo "params.PY[Z][nest2]: {params.PY[Z][nest2]}" >>{output}
echo "params.PY[Z][nest2][1]: {params.PY[Z][nest2][1]}" >>{output}
"""
MACBOOK> snakemake -s cfgtest
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 1
1
rule 1:
output: cfgtest.txt
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
MACBOOK> cat cfgtest.txt
params.YAML[A]: 10
params.YAML[B]: 1 2 99
params.YAML[B][2]: 99
params.YAML[C]: {'nst1': 'hello', 'nst2': ['big', 'world']}
params.YAML[C][nst1]: hello
params.YAML[C][nst2]: big world
params.YAML[C][nst2][1]: world
params.PY[X]: -2
params.PY[Y]: range(4, 7)
params.PY[Y][2]: 6
params.PY[Z]: {'nest1': 'bye', 'nest2': ['A', 'list']}
params.PY[Z][nest1]: bye
params.PY[Z][nest2]: A list
params.PY[Z][nest2][1]: list

YAML Configuration
This has to do with the nesting of YAML files, see an example here.
The config["samples"] request will return both 'A' and 'B'. I'm my head I think of it returning a list, but I am not positive on the variable type.
By using the configfile as listed here:
https://snakemake.readthedocs.io/en/latest/tutorial/advanced.html
You can link in the following YAML configuration files, in YAML format.
settings/config.yaml:
samples:
A
B
OR
settings/config.yaml:
sampleID:
123
124
125
baseDIR:
data
Resulting call with YAML config access
Snakefile:
configfile: "settings/config.yaml"
rule all:
input:
expand("{baseDIR}/{ID}.bam", baseDIR=config["baseDIR"], ID=config["sampleID"]),
rule fastq2bam:
input:
expand("{{baseDIR}}/{{ID}}.{readDirection}.fastq", readDirection=['1','2'])
output:
"{baseDIR}/{ID}.bam"
#Note different number of {}, 1 for wildcards not in expand.
#Equivalent line with 'useless' expand call would be:
#expand("{{baseDIR}}/{{ID}}.bam")
shell:
"""
bwa mem {input[0]} {input[1]} > {output}
"""
Dummy examples, just trying to exemplify the use of different strings and config variables. I use wildcards in the fastq2bam rule. Typically I only use config variables to set things in my rule 'all', when possible this is best practice. I cannot say if the shell call actually works for bwa mem, but I think you get the idea of what I'm implying.
A larger version of a Snakefile can be seen here
Once the configfile is setup, to reference anything in it, use 'config'. It can be used to access deep into a YAML as needed. Here I'll go down 3 hypothetical levels, like so:
hypothetical_var = config["yamlVarLvl1"]["yamlVarLvl2"]["yamlVarLvl3"]
Equates to (I'm not POSITIVE about the typing, I think it converts to strings)
hypothetical_var = ['124', '125', '126', '127', '128', '129']
If the YAML is:
yamlVarLvl1:
yamlVarLvl2:
yamlVarLvl3:
'124'
'125'
'126'
'127'
'128'
'129'
Code Organization
Python and Snakemake code, for the most part can be interleaved in certian places. I would advise against this as it will make the code difficult to maintain. It's up to the user to decide how to implement this. E.g, using the run or the shell directive changes how to access the variables.
YAML and JSON files are preferred configuration variable files as I believe the provide some support for editting and Command-Line Interface over-ridding of variables. This would not be as clean if it was implemented using externally imported python variables. Also it helps my brain, knowing python files do things, and YAML files store things.
YAML is always an external file, but...
If you are using a single Snakefile, put the supporting python at the top?
If you are using a multi-file system, consider having the supporting python scripts externalized.
Tutorials
I think a perfect vignette is difficult to design. I'm trying to teach my group about Snakemake, and I have over 40 pages of personally written documentation, I've provided three 1hr+ presentations with PowerPoint slideshows, I've read nearly the entire ReadTheDocs.io manual for Snakemake, and I just recently finished going through the list of additional resources, yet, I'm still learning too!
Side note, I found this tutorial to be very nice as well.
Does that provide enough context?

Is there someplace that fully describes use of config data in snakemake rules?
There is no limit to what you can put in the config file, provided it can be parsed into python objects. Basically, "your imagination is the limit".
What is the right way to make use of config data in output rules, e.g. to help form the output filename?
I extract things from the config outside the rules, in plain python.
Instead of output: "{config.dataFolder}/{ID}/{ID}.yyy", I would do:
data_folder = config.dataFolder
rule name_of_the_rule:
output:unction
os.path.join(data_folder, "{ID}", "{ID}.yyy")
I guess that with what you tried, snakemake has problems formatting the string when there is a mix of things coming from the wildcards, and others. But maybe the following works in python 3.6, using formatted string litterals: output: f"{config.dataFolder}/{ID}/{ID}.yyy". I haven't checked.
I'm looking for syntax guidance if I define complex structured data in the yaml file - how do I make use of it in the snake rules? When do I use Python syntax and when do I use SnakeMake syntax?
In the snakefile, I typically read the config file to extract configuration information before the rules. This is essentially pure python except that a config object is directly made available by Snakemake for convenience. You could probably just use plain standard python using config = json.load("config.json") or config = yaml.load("config.yaml").
In the snakefile, outside the rules, you can do whatever computations you want in python. This can be before reading the config as well as after. You can define functions that can be used in rules (for instance to generate rule's inputs), compute lists of things that will be used as wildcards. I think the only thing is that an object needs to be defined before the rules that use it.
Snakemake syntax seems mainly a means of describing the rules. Within the run part of a rule, you can use whatever python you want, knowing that you have access to a wildcards object to help you. Input and output of rules are lists of file paths, and you can use python in order to build them.

Vim and python - jump to definition key binding

The following youtube video shows that it is possible to jump to definition using vim for python.
However when I try the same shortcut (Ctrl-G) it doesnt work...
How is it possible to perform the same "jump to definition"?
I installed the plugin Ctrl-P but not rope.

This does not directly answer your question but provides a better alternative. I use JEDI with VIM, as a static code analyser, it offers far better options than ctags. I use the spacemacs key-binding in vim so with localleader set to ','
" jedi
let g:jedi#use_tabs_not_buffers = 0 " use buffers instead of tabs
let g:jedi#show_call_signatures = "1"
let g:jedi#goto_command = "<localleader>gt"
let g:jedi#goto_assignments_command = "<localleader>ga"
let g:jedi#goto_definitions_command = "<localleader>gg"
let g:jedi#documentation_command = "K"
let g:jedi#usages_command = "<localleader>u"
let g:jedi#completions_command = "<C-Space>"
let g:jedi#rename_command = "<leader>r"

Vim's code navigation is based on a universal database called tags file. It needs to be generated (and updated) manually. :help ctags lists some applications that can do that. Exuberant ctags is a common one that supports many programming languages, but there are also specialized ones, like ptags.py (found in your Python source directory at Tools/scripts/ptags.py).
Plugins like easytags.vim provide more convenience by e.g. automatically updating the tags file on each save.
The default command for jumping to the definition is CTRL-] (not CTRL-G; that prints the current filename; see :help CTRL-G), or the Ex command :tag {identifier}; see all at :help tag-commands.

Some suggestions for people reading other answers to this question in the future:
tags file has one limitation. If in your code multiple objects has the same name you will have problem using ctrl-] as it will jump to first one and not necessary correct one. In this situation you can use g ctrl-] (or :tjump command or :tselect command) to get selection list. Potentially you want to map ctrl-] to "g ctrl-]"
It is possible that you want to have possibility to jump to correct object. In that case you might want to use jedi vim and if you are used to c-] you might want to use this mapping for jedi goto let g:jedi#goto_command = ""
Lastly you want to use universal ctags instead of excuberant ctags because of better new files support (not necessary python).

If you're using YouCompleteMe there is a command for that
:YcmCompleter GoToDefinition
if you want to add a shortcut for doing that in a new tab
nnoremap <leader>d :tab split \| YcmCompleter GoToDefinition<CR>

Use Powershell to retrieve number from a sequence using web request

I have a very specific problem I have been trying to work out. I'm using a PowerShell script to name newly imaged computers during the imaging proceess, and I need to grab a newly generated number from a sequence. I use SCCM 2012 R2 for this, btw
For example, I have the script naming our computers by our convention using wmi query:
if ($ComputerVersion -eq "ThinkPad T400")
{
$OSDComputerName = "T400xxxx-11"
$TSEnv = New-Object -COMObject Microsoft.SMS.TSEnvironment
$TSEnv.Value("OSDComputerName") = "$OSDComputerName"
}
I set the $ComputerVersion variable, using WMI query, like so:
$ComputerVersion = (Get-WmiObject -Class Win32_ComputerSystemProduct | Select-Object Version).Version
So, the crux of my question is I want to set another variable, probably something simple
like $num, for the next number available to label our computers. This number will be replacing the "xxxx". I'll be doing that by:
if ($ComputerVersion -eq "ThinkPad T400")
{
$OSDComputerName = "T400" + $num + "-11"
$TSEnv = New-Object -COMObject Microsoft.SMS.TSEnvironment
$TSEnv.Value("OSDComputerName") = "$OSDComputerName"
}
This number is being generated by a linux server we have, and its already running some python script to dish out the next available number in the sequence. I can post that python script if needed, but it's 133 lines.
What I need to know is how to call for that web request via PowerShell, and set that returned number (the next available) as a new variable.
I've never used web-services or web-requests before and any help would be greatly appreciated. Thanks in advance!

Depends what the web request returns and whether or not you need to process any return data, but if it simply returns the number you could do this:
$webClient = New-Object System.Net.WebClient
$num = $webClient.downloadstring("http://yourwebservice.com/buildnumber")

how to implement python spell checker using google's "did you mean?"

I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;

Peter Norvig tells you how implement spell checker in Python.

Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.

How can I use Microsoft Word's spelling/grammar checker programmatically?

I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.

It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.