Databases for running statistical tests

Databases for running statistical tests - python

I am trying to apply knowledge i learnt during statistics courses to real world datasets.
I am looking for some real database/tables. Would be helpful if the link to page added as well . Format is not a constraint - i use python and i can well convert to sqlite .
One example would be [one medium sized table] of for identifying country for given ip address : http://ip-to-country.webhosting.info/node/view/6

Well, since your profile says you're from India, I thought some Indian Government statistics would help, so a quick google search yields this site:
http://mospi.nic.in/dwh/index.htm
Click on 'Tables', and you'll have a list of more data/tables than you could possibly need.
...these files all seem to be in Microsoft XLS format, but another quick google search yields a free converter: http://download.cnet.com/XLS-Converter/3000-2077_4-10401513.html
...or you could run a python program, xlrd ( http://pypi.python.org/pypi/xlrd ) and read the files directly.

Related

Searchable database of files

I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.

The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3

How to convert pdf to xml /json using python code

Can any one help me on how to convert pdf file to xml file using python code? My pdf contains:
Unstructured data
It has images
Mathematical equations
Chemical Equations
Table Data
Logo's tag's etc.
I tried using PDFMiner, but my pdf data was not converted into .xml/json file format. Are there any libraries other than PDFMiner? PyPDF2, Tabula-py, PDFQuery, comelot, PyMuPDF, pdf to dox, pandas- these other libraries/utilities all not suitable for my requirement.
Please advise me on any other options. Thank you.

The first thing I would recommend you trying is GROBID (see here for the full documentation). You can play with an online demo here to see if fits your needs (select TEI -> Process Fulltext Document, and upload a PDF). You can also check out this from the Allen Institute (it is based on GROBID and has a handy function for converting TEI.XML to JSON).
The other package which--obviously--does a good job is the Adobe PDF Extract API (see here). It's of course a paid service but when you register for an account you get 1.000 document transactions for free. It's easy to implement in Python, well documented, and a good way for experimenting and getting a feel for the difficulties of reliable data extraction from PDF.
I worked with both options to extract text, figures, tables etc. from scientific papers. Both yielded good results. The main problem with out-of-the-box solutions is that, when you work with complex formats (or badly formatted docs), erroneously identified document elements are quite common (for example a footnote or a header gets merged with the main text). Both options are based on machine learning models and, at least for GROBID, it is possible to retrain these models for your specific task (I haven't tried this so far, so I don't know how worthwhile it is).
However, if your target PDFs are all of the same (simple) format (or if you can control their format) you should be fine with either option.

Search Engine for a single DB column

I'm looking for a search engine that I can point to a column in my database that supports advanced functions like spelling correction and "close to" results.
Right now I'm just using
SELECT <column> from <table> where <colname> LIKE %<searchterm>%
and I'm missing some results particularly when users misspell items.
I've written some code to fix misspellings by running it through a spellchecker but thought there may be a better out-of-the box option to use. Google turns up lots of options for indexing and searching the entire site where I really just need to index and search this one table column.

Apache Solr is a great Search Engine that provides (1) N-Gram Indexing (search for not just complete strings but also for partial substrings, this helps greatly in getting similar results) (2) Provides an out of box Spell Corrector based on distance metric/edit distance (which will help you in getting a "did you mean chicago" when the user types in chicaog) (3) It provides you with a Fuzzy Search option out of box (Fuzzy Searches helps you in getting close matches for your query, for an example if a user types in GA-123 he would obtain VMDEO-123 as a result) (4) Solr also provides you with "More Like This" component which would help you out like the above options.
Solr (based on Lucene Search Library) is open source and is slowly rising to become the de-facto in the Search (Vertical) Industry and is excellent for database searches (As you spoke about indexing a database column, which is a cakewalk for Solr). Lucene and Solr are used by many Fortune 500 companies as well as Internet Giants.
Sphinx Search Engine is also great (I love it too as it has very low foot print for everything & is C++ based) but to put it simply Solr is much more popular.
Now Python support and API's are available for both. However Sphinx is an exe and Solr is an HTTP. So for Solr you simply have to call the Solr URL from your python program which would return results that you can send to your front end for rendering, as simple as that)
So far so good. Coming to your question:
First you should ask yourself that whether do you really require a Search Engine? Search Engines are good for all use cases mentioned above but are really made for searching across huge amounts of full text data or million's of rows of tabular data. The Algorithms like Did you Mean, Similar Records, Spell Correctors etc. can be written on top. Before zero-ing on Solr please also search Google for (1) Peter Norvig Spell Corrector & (2) N-Gram Indexing. Possibility is that just by writing few lines of code you may get really the stuff that you were looking out for.
I leave it up to you to decide :)

I would suggest looking into open source technologies like Sphynx Search.

Before going down the Solr/Sphinx route for full text indexing - which adds complexity and their own overhead - you can try the built-in full text engine in PostgreSQL if you are using that database. It's easy to setup and performs better than LIKE queries.
Check out https://github.com/hcarvalhoalves/django-tsearch2

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

How can I extract EDB (ms exchange storage file) to PST Under Linux? (preferably in Python)

I can extract and read messages from PST files using libpst , but i want to extract from edb files too (not online exchange server but from offline files). And in Linux.
Any python lib or any kind for linux commandline tool should help.
Thanks.

Take a look at Joachim Metz' work. He reverse engineered the edb format and analyzed the exchange database to a limited extend. It's open source and there's even some documentation about the tables and columns:
http://sourceforge.net/projects/libesedb/files/
However it doesn't create a PST or anything similar. It merely extracts all tables into separate files and tries to decode some of the data. In order to extract the Emails from your EDB file you need to get into the documentation and do plenty of coding as the data is rather scattered around within the databse (of course it just looks like scattered. Microsoft definitely didn't just want to make the lifes of the reverse engineers miserable).
Good luck

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.