open source data mining/text analysis tools in python

open source data mining/text analysis tools in python - python

I have a database full of reviews of various products. My task is to perform various calculation and "create" another "database/xml-export" with aggregated data. I am thinking of writing command line programs in python to do that. But I know someone have done this before and I know that there is some open source python solution or similar which probably gives lot more interesting "aggregated data" then I can possibly think off.
The problem is I don't really know much about this area other then basic data manipulation from command line nor I know what are the terms I should use to even search for this thing.. I am really not looking for some scientific/visualization stuff (not that I don't mind if the tool provides), something simple to start with and gradually see/develop stuff what I need.
My only requirement is either the "end aggregated data" be in a database or export as XML file no proprietary stuff. Its a bit robust then my python scripts as I have to deal with "lots" of data across 4 machines.
Any hint where should I start my research?
Thanks.

Looks like you are looking for a Data Integration solution.
One suggestion is the open source Kettle project part of the Pentaho suite.
For python, a quick search yielded PyDI and SnapLogic

What kind of analysis are you trying to do?
If you're analyzing text take a look at the Natural Language Toolkit (NLTK).
If you want to index and search the data, take a look at the whoosh search engine.
Please provide some more detail on what kind of analysis you're looking to do.

Related

TexSoup for Bib-Files

this is my first question so I will try to do everything as proper as possible.
I am currently using LaTeX to write my documents at my University because I want to use the powerful citing capabilities provided by BibTeX. For ease of use, I am writing on scripts that will implement my .bib-files into my .tex files easier and allow easier management of my .bib-files. As I am using Arch Linux, I did this in bash, but it is a little clunky. Therefore I wanted to switch to python, as I came across the TexSoup-library for Python.
My issue is now, that I cannot find resources regarding the use of TexSoup for .bib files, I can only find resources on .tex-files. Does anybody know, if and if yes how I can use TeXSoup to find books / articles or other entries in my bib-files with python (or the TexSoup-library)?
with open("bib_complete.bib") as f:
soup = TexSoup(f)
print(soup)
This is a code sample I am trying to use, but I don't know how to look for entry names or entry-types with the package. I would really appreciate if someone could guide me to good resources if they exist.
I hope my writing was comprehensive enough and not too long.
Thanks everybody!

Bentley Project Wise for Data Retriaval

This is my first post on stack.
I'm looking to gather a large amount of data from a multitude of files on PW so I can quantify a few things about the records.
The directories I'm working with have unique numbers and offer files that are all similar to files in other folders.
Is there a library from python I can use or any other useful tips for taking on this task?
It could potentially save many hours of work if I can do this with code.
A pseudocode example may look like.
for element in dataField:
search(folder)
if folder found:
search(file)
if file found
extract certain data from file X
extractedData.append(data)
Thank you,
R

Based off a quick web search for projectwise api, there is a web-based REST API available, so you'll definitely want to look into that more. You'll need to read the docs carefully to figure out which endpoint does what, but once you know what information you need to send and what kind of data you'll receive, programming a basic Python interface shouldn't be too difficult. One may already exist, I didn't look too hard.

Running a Python Script on a Website (in the background)

Firstly, apologies for the very basic question. I have looked into other answers but they haven't quite answered what I'm after. I'm confident designing a site in HTML/CSS and have very very basic knowledge of Python.
I want to run a very basic Python script on my website. It analyses tweets about a specific topic, and then posts a sentiment analysis score. I want it to run this sentiment analysis every hour and cache the score.
I have a working Python script which does this in Jupyter Notebook. Could you give me an overview of how I would make this script function online and cache the results? I've read into using Python web frameworks, but from my limited understanding, they seem like overkill?
Thank you for your help!

Could you give me an overview of how I would make this script function online
The key thing would be to uncouple the two parts of your system:
Producing the data
Showing it in a website.
So the first thing to do is have your sentiment-analysis script push its value to a database. The database could be something as simple as a csv file, or it could be a key/value store, or something like MySQL or CouchDB (or hundreds of other choices).
Over on the website you have to make a decision between:
Server-side
Client-side
If the former, you could program in Python if that is what you are most familiar with. Whatever language/framework combination you go for, there will an example tutorial of how to read a value from a database and display it: it is just about the most fundamental thing.
If client-side you will usually be programming in JavaScript. Again you need to choose a framework, but again you should easily be able to find a tutorial to follow.
(Unless you have a good reason to prefer server-side, such as familiarity with an existing framework, or security issues with accessing your database, I'd go with a client-side approach.)
I've read into using Python web frameworks... overkill?
Yes and no. You are going to need some kind of database, and some kind of framework. It would be good to understand the basics of web security, too. If the sentiment analysis is your major goal, all that is going to be a distraction, and it might be better to find a friend who already knows web programming to work with. Or just find a tutorial that is very close to what you want to do, and adapt that.
(P.S. I was going to flag your question as "too broad", but you did ask for an overview, so I hope this helps.)

I need a starting point to code an app to extract text from pdf to excel

To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.

Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example

Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)

Python log analysis tool/library

I'm looking for a tool/library written in python similar with logstash (ruby + java).
My goals are:
parse all system logs from syslog
parse application specific logs (apache, django, mysql etc.)
store results in something like elasticsearch
graph results based on different criteria
thanks!
ps: regexes are a way to go but I feel will be quite of work to start from scratch

Shameless plug (I am the author of the library) -
logtools does everything you mentioned and much, much more. I try to keep the documentation up to date and show alot of examples, similar to use cases you describe, in the README file. hopefully it would fit what you have in mind, give it a try, and any feedback is welcome - I try to add/fix any issues brought up by users. Check it out at http://github.com/adamhadani/logtools or download latest stable release at https://pypi.python.org/pypi/logtools

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.