How to transform unstructured text to rdf turtle in practice?

How to transform unstructured text to rdf turtle in practice? - python

i am currently working on a study project, where i have to transform the vehicle complaints descriptions from the NHTSA Database (https://catalog.data.gov/dataset/nhtsas-office-of-defects-investigation-odi-complaints) into rdf-turtle and later into a Knowledge Graph representation maybe using GraphDB, etc. A set of descriptions can be found in the attachment.
I have research topics like NER, Relation Extration, OpenIE, FRED, Knowledge Graph Construction, RDFS, OWL and Ontologies in theory.
Now, i come to a point where i just don't know how to practically do it.
Can someone help me with it and guide me a little bit through it?
Where should i start and with what?
Thanks you very much,
Dennis
Examples customer complaints

Stanford CORE NLP has Open IE that can extract triples from the text.
https://nlp.stanford.edu/software/openie.html
If you want to do it with python, look into Stanza
https://github.com/stanfordnlp/stanza.
It has an official python wrapper that fires up a JAVA-based CORE NLP server at the backend.

Related

Classifying people's future hopes in a test by using Python?

I have a csv file which includes people's hopes about an educational program. So I want to classify people and match them to see whose are having some similar dreams. I am using python TFID for this. What kind of technologies or repositories do you suggest me?
pieceof a part of data

Obtain relationship between words of a sentence

I am working on a project based something on natural language understanding.
So, what I am currently doing is to try and reference the pronouns to their respective antecedents, for which I am trying to build a model. I have worked out the basic part of it, but to complete the task, I need to understand the narrative of the sentence. So what I want is to check whether the noun and object are associated with each other by the verb using an API in python.
Example:
method(laptop, have, operating-system) = yes
method(program, have, operating-system) = No
method("he"/"proper_noun", play, football) = yes
method("he"/"proper_noun", play, college) = No
I've heard about nltk's wordnet API, but I am not sure whether I can use it to perform the same. Can it be used?
Also, I am kind of on a clock.
Any suggestions are welcome and appreciated.
Notes: I am using parsey-mcparseface to break the sentence. I could do the same with nltk but P-MPF is more accurate.
** Why isn't there an NLU tag available? **
Edit 1:
Thanks to alexis, The thing I am trying to do is called "Anaphora Resolution".

The name for what you want is "anaphora resolution", or "coreference resolution". It's a hard problem (probably harder than you realize-- nlp tasks are like that), so unless your purpose is just to learn, I recommend you try some existing solutions. I don't know of an anaphora resolution module in the nltk itself, but you can find it as part of the Stanford CoreNLP suite.
See this question about how to interface to it from the nltk. (I haven't tried it myself).

AIML for Intelligent Answering Engine

I have heard about a programming language called AIML which can be used for programming Intelligent Robots.
I am a web developer and have a web crawler build using Python 2.7 and have indexed Wikipedia ...
So I wanted to build a answering engine using python which would use a string variable
(It is a HUGE variable containing the whole of Wikipedia) as a source of information and use AI to answer...
Finally, I wanted to put this up on my school website...
So can I do that in AIML?
Later on I also want to modify it so as to give my live scores answers to questions like:
"What is the age of ~someperson~?" etc.
For that I'll send my web crawler to index some score pages etc..
Can I program this sort of answering agent in AIML?
If yes please provide links to tutorials which tell me how to do that? (using string variables as a source of information to parse queries and answer like a human)
moreover, AIML uses syntax like:
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
<think><set name="topic">Me</set></think>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
Where pattern is the query and template is answer, so does that mean I have to sit and write these tags for all possible queries?
Or can I make it use its brains to figure out what the person wants and give them answers
using the string variable as its source of information.
Thank you.

AIML
It looks like AIML is a form of pattern matching. Moreover, it looks like this is mainly meant for dialog based agents. Therefore, to use AIML, you would likely need to manually generate every question and the correct response (answer).
Question answering
What it seems like you are really after is what we call a question answering system. Very briefly, a QA system generally has these components:
Question analysis.
Extract keywords.
(Sometimes) determine expected answer type (location, person, color, number, etc.).
Candidate document selection---doing a search on your knowledge base using an information retrieval system.
Candidate document analysis.
Answer extraction---select some part of the document (sentence(s), paragraph(s)).
Response generation.
Scores and ranks each answer.
Displays the most confident answer(s).
Research
If you're really want to dig deeply into this area, I'd suggest using Google Scholar and search for some of the terms I've mentioned, which will give you some research papers that go into detail about many of these topics. Some papers to get you started:
Natural language question answering: The view from here
Answering complex, list and context questions with LCC's Question-Answering Server
The structure and performance of an open-domain question answering system
Learning surface text patterns for a question answering system
Learning question classifiers
What is not in the Bag of Words for Why-QA?
Shameless plug
I've recently taken a course on natural language processing, and developed a rudimentary QA system that uses Wikipedia as a knowledge base. (Actually, I used the Simple English Wikipedia because it was much easier to work with; though the system does work with the full version just much more slowly.)
If you are interested in looking at some Python code as a reference, you may do so on the project's GitHub page: bwbaugh/causeofwhy. In addition, there is some more detailed documentation on what goes on in each step of the system components.
There is also a very basic working demo of the QA system in action that is (currently) available, however bear in mind the system is a proof-of-concept and can take upwards of 30 seconds to respond to a question (depending on the question).

Can some one explain how transliteration works?

I am new to programming, and I am trying to understand transliteration - like the Google Input Tools that will allow the user to type from one language to another language.
How does transliteration work? Specifically, if I am translating from English to Hindi or English to Russsian, do I need to incorporate a dictionary of words for English, Hindi and Russian languages?
Does any one know of any tutorials showing how to write the code for transliteration? I have tried searching, but no luck.
Also, does the code have to be in JavaScript/JQuery (client side code)? My project is Python/django. Can I write the transliteration code in python/dgango?
Thanks.

Direct dictionary-to-dictionary automatic translation produces poor results due to differences in grammar and the presence of idiomatic sentences. The starting point in python, in my experience, should be NLTK (Natural Language ToolKit) libraries and tutorials.
Then, trying to provide you a working example you may start from here:
Machine Translation using babelize_shell() in NLTK
Translating human languages in Python
Google is your friend
Bing is your friend
The use of javascript/jquery depends on the UI you are planning, maybe you want to trigger an automatic translation after a few key pressed, or onblur or onchange in a input tag but is not relevant for the translation itself.
The process of translating is also really resource consuming, so I discourage you to do it inside a django view. My suggestion is to not reinvent the wheel, and use some already existing API like google or bing ones.

I found that the better search term is Input Method Editor not transliteration.
There is a project on github here: https://github.com/wikimedia/jquery.ime that deals with IME's and transliteration here.
I hope that this helps some one.

The typical way of implementing transliteration is to use a mapping dictionary. An example of this can be seen in the mapping.py file for the CyrTranslit Python package.

Word translation usages a database to convert English word into Hindi Word.
Some apps are based on this concept like:
English to Hindi Dictionary

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.