Parser generation

Parser generation - python

i am doing a project on SOFWARE PLAGIARISM DETECTION..i am intended to do it with language C..for that i am supposed to create a token generator, and a parser..but i dont know where to start..any one can help me out with this..
i created a database of tokens and i separated the tokens from my program.Next thing i wanna do is to compare two programs to find out whether it's plagiarized or not. For that i need to create a syntax analyzer.I don't know where to start from...
i.e I want to create a parser for c programs in python

If you want to create a parser in Python you can look at these libraries:
PLY
pyparsing
and Lepl - new but very powerful

Building a real C parser by yourself is a really big task.
I suggest you either find one that is already done, eg. pycparser or you define a really simple subset of C that is easily parsed.
You'll have plenty of work to do for your plagiarism detector after you are done parsing C.

I'm not sure you need to parse the token stream to detect the features you're looking for. In fact, it's probably going to complicate things more than anything.
what you're really looking for is sequences of original source code that have a very strong similarity with a suspect sample code being tested. This sounds very similar to the purpose of a Bayes classifier, like those used in spam filtering and language detection.

Related

NLP how to go beyond simple intent finding--using context and targeting objects

Edit: Apologies if this isn't the proper space for this kind of question. Would appreciate if you can please point me to a better forum
I'm manually applying NLP rules to a chatbot.
Currently, I've a simple set of rules--actions that follow certain trigger words.
Ex: "Create the match on saturday."
This has been working for relatively simple phrases like the above example, where I would expect words like "create" and appoint it as an action word, expect "match" then appoint it as an object entity; "saturday" as a time, etc.
When I try to expand the scope of what the bot can handle, it becomes more complicated. Here's an example of something I'm trying to handle:
"Update the memo title of the match on saturday to 'Game Day'."
I'm not sure how to move forward.
I considered manually expanding(expecting) the entities, then trying a method where I still parse for action words, but if a certain threshold of varying objects is passed, then I execute a subset of that action.
For example: update will obviously signal an action, but the addition of "memo", "title", "'Game day'", signals a subset of the action as there's more to this than a simpler "create match". Then, checking the additional objects like "title" against a predetermined set of entities will narrow down the intent to update + title.
I see many holes in this logic, esp. as the complexity even slightly increases.
This led me to the field of dependency parsing.
But upon looking into dep. parsing, I wonder if this is something feasible in manual implementation.
I'll be using python code, it won't be deep learning based.
What do you think of the basic rule-based method I've outlined? Is it something that sounds workable for a domain-specific bot?
Should I be looking into using NLP libraries offered in Python such as NLTK or spaCy and use their features? My concern with using these features such as spaCy's dependency parser was that it was overkill, or too much added complexity for handling my domain specific tasks. Furthermore, and likely because I haven't yet seen effective use of it in another project, I guess I have doubts the practicality of dependency parsers outside of academia or research.
Edit: Apologies if this isn't the proper space for this kind of question. Would appreciate if you can please point me in the right direction

Ways to search and tag a MongoDB database of academic papers

Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!

Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.

fault-tolerant python based parser for WikiLeaks cables

Some time ago I started writing a BNF-based grammar for the cables which WikiLeaks released. However I now realized that my approach is maybe not the best and I'm looking for some improvement.
A cabe consists of three parts. The head has some RFC2822-style format. This parses usually correct. The text part has a more informal specification. For instance, there is a REF-line. This should start with REF:, but I found different versions. The following regex catches most cases: ^\s*[Rr][Ee][Ff][Ss: ]. So there are spaces in front, different cases and so on. The text part is mostly plain text with some special formatted headings.
We want to recognize each field (date, REF etc.) and put into a database. We chose Pythons SimpleParse. At the moment the parses stops at each field which it doesn't recognize. We are now looking for a more fault-tolerant solution. All fields have some kind of order. When the parser don't recognize a field, it should add some 'not recognized'-blob to the current field and go on. (Or maybe you have some better approach here).
What kind of parser or other kind of solution would you suggest? Is something better around?

Cablemap seems to do what you're searching for: http://pypi.python.org/pypi/cablemap.core/

I haven't looked at the cables but lets take a similar problem and consider the options: Lets say you wanted to write a parser for RFCs, there's an RFC for formatting of RFCs, but not all RFCs follow it.
If you wrote a strict parser, you'll run into the situation you've run into - the outliers will halt your progress - in this case you've got two options:
Split them into two groups, the ones that are strictly formatted and the ones that aren't. Write your strict parser so that it gets the low hanging fruit and figure out based on the number outliers what the best options are (hand processing, outlier parser, etc)
If the two groups are equally sized, or there are more outliers than standard formats - write a flexible parser. In this case regular expressions are going to be more beneficial to you as you can process an entire file looking for a series of flexible regexs, if one of the regexes fails you can easily generate the outlier list. But, since you can make the search against a series of regexes you could build a matrix of pass/fails for each regex.
For 'fuzzy' data where some follow the format and some do not, I much prefer using the regex approach. That's just me though. (Yes, it is slower, but having to engineer the relationship between each match segment so that you have a single query (or parser) that fits every corner case is a nightmare when dealing with human generated input.

Named Entity Recognition from personal Gazetter using Python

I try to do named entity recognition in python using NLTK.
I want to extract personal list of skills.
I have the list of skills and would like to search them in requisition and tag the skills.
I noticed that NLTK has NER tag for predefine tags like Person, Location etc.
Is there a external gazetter tagger in Python I can use?
any idea how to do it more sophisticated than search of terms ( sometimes multi words term )?
Thanks,
Assaf

I haven't used NLTK enough recently, but if you have words that you know are skills, you don't need to do NER- just a text search.
Maybe use Lucene or some other search library to find the text, and then annotate it? That's a lot of work but if you are working with a lot of data that might be ok. Alternatively, you could hack together a regex search which will be slower but probably work ok for smaller amounts of data and will be much easier to implement.

Have a look at RegexpTagger and eventually RegexpParser, I think that's exactly what you are looking for.
You can create your own POS tags, ie. map skills to a tag, and then easily define a grammar.
Some sample code for the tagger is in this pdf.

Parsing Meaning from Text

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:
"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",
what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).
To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.
This section of the manual looks very relevant: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.

Use the NLTK, in particular chapter 7 on Information Extraction.
You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.
See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:
>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent)
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(PERSON Brooke/NNP T./NNP Mossman/NNP)
...)
Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

Natural Language Processing (NLP) is the name for parsing, well, natural language. Many algorithms and heuristics exist, and it's an active field of research. Whatever algorithm you will code, it will need to be trained on a corpus. Just like a human: we learn a language by reading text written by other people (and/or by listening to sentences uttered by other people).
In practical terms, have a look at the Natural Language Toolkit. For a theoretical underpinning of whatever you are going to code, you may want to check out Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.
(source: stanford.edu)

Here is the book I stumbled upon recently: Natural Language Processing with Python

What you want is called NP (noun phrase) chunking, or extraction.
Some links here
As pointed out, this is very problem domain specific stuff. The more you can narrow it down, the more effective it will be. And you're going to have to train your program on your specific domain.

This is a really really complicated topic. Generally, this sort of stuff falls under the rubric of Natural Language Processing, and tends to be tricky at best. The difficulty of this sort of stuff is precisely why there still is no completely automated system for handling customer service and the like.
Generally, the approach to this stuff REALLY depends on precisely what your problem domain is. If you're able to winnow down the problem domain, you can gain some very serious benefits; to use your example, if you're able to determine that your problem domain is baseball, then that gives you a really strong head start. Even then, it's a LOT of work to get anything particularly useful going.
For what it's worth, yes, an existing corpus of words is going to be useful. More importantly, determining the functional complexity expected of the system is going to be critical; do you need to parse simple sentences, or is there a need for parsing complex behavior? Can you constrain the inputs to a relatively simple set?

Regular expressions can help in some scenario. Here is a detailed example: What’s the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.
In the post, a regular expression as such was used:
(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))
in order to match either of the following:
two words, then model number (including all-in-one), then “scanner”
“scanner”, then one or two words, then model number (including
all-in-one)
As a result, the text extracted from the post was like,
discontinued HP C9900A photo scanner
scanning his old x-rays
new Epson V700 scanner
HP ScanJet 4850 scanner
Epson Perfection 3170 scanner
This regular expression solution worked in a way.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.