Python: Text localiser using Tensorflow - python

I would like to code a script that could locate a specific word or number in a financial statement. Financial statements roughly contain the same information, they are however not identical and organized in the same way. My thought is that by using Tensorflow I could train a neural network to locate the specific words or numbers for me. I am thinking that if I label different text and numbers in 1000 financial statements and use them to train the neural network, it will then be able to identify these numbers or words in all financial statements. For example, tell it in all 1000 training statements which number that is the profit of the company.
Is this doable? I have been working with coding in python for a couple of months and so far I've built some web scrapers and integrated them with twitter, slack and google sheets. I would be very grateful for all your thoughts on this project and if anyone could steer me in the right direction by sharing relevant tutorials.
Thanks a lot!

Great thing that you're getting started, I believe before thinking about the actual implementation using tensorflow or any other library, you should first try to understand the problem in regards with the basic domain of the problem itself.
I'm not really sure what are you exactly trying to achieve but to a rough idea I'm guessing it's about trying to find is a statement turns out to be a benificial to the company or not, something like of semantic analysis type of problem.
So I strongly believe that, first you should try to learn the various methodologies related to semantic analysis and find the most appropriate technique.
In short theory/understanding before the actual code.
Finally i would suggest you ask such theoratical questions on stack exchange of AI, here in SO we generally deal with code or something that of intermediate to code.
I hope that makes sense? ;)
drop a comment if any doubts.

Related

Breaking Down 3D models up to lines and curves

I'm working on a project to breakdown 3D models but I'm quite lost. I hope you can help me.
I'm getting a 3D model from Autodesk BIM and the format could be native or generic CAD formats (.stp, .igs, .x_t, .stl). Then, I need to "measure" somehow the maximum dimensions to model a raw material body, it will always have the shape of a huge panel. Once I get both bodies, I will get the difference to extract the solids I need to analyze; and, on each of these bodies, I need to extract the faces, and then the lines or curves of each face.
This sounds something really easy to do on a CAD software, but the idea is to automate this process. I was looking into openSCAD, but seems that works only to model geometry and it doesn't handle well imported solids. I'm leaving a picture with the idea of what I need to do in the link below.
So, Any idea how can I do this? which langue and library can help in this project?
I can see this automation possible with a few in between steps:
OpenSCAD can handle differences well, so your "Extract Bodies" seems plausible
1.5 Before going further, you'll have to explain how you "filtered out" the cylinder. Will you do this manually? If you don't, you will have it considered for analysis and have a lot of faces as a result.
I don't think openSCAD provides you a vertex array. However, it can save to .STL, which is kinda easy to parse with the programming language of your choice, you'll have to study .stl file structure a bit (this sounds much more frightening than it is - if you open an stl with an editor you will probably immediately realize what's happening).
Since you've parsed the file, you can now calculate lines with high school math.
This is not an easy, GUI way to do what you ask, but if you have a few skills you'll have your automation, and depending on the amount of your projects it may be worth it.
I have been working in this project, and foundt the library "trimesh" is better to solve this concern. Give it a shot, and save some time.

Implementation of multi-variable, multi-type RNN in Python

I have a dataset which has items with the following layout/schema:
{
words: "Hi! How are you? My name is Helennastica",
ratio: 0.32,
importantNum: 382,
wordArray: ["dog", "cat", "friend"],
isItCorrect: false,
type: 2
}
where I have a lot of different types of data, including:
Arrays (of one type only, e.g an array of strings or array of numbers, never both)
Booleans
Numbers with fixed min/max (i.e on a scale of 0 to 1)
Limitless integers (any integer -∞ to ∞)
Strings, with some dictionary, some new, words
The task is to create an RNN (well, generally, a system that can quickly retrain when given one extra bit of data instead of reprocessing it all - I think an RNN is the best choice; see below for reasoning) which can use all of these factors to categorise any dataset into one of 4 categories - labelled by the type key in the above example, a number 0-3.
I have a set of lots of the examples in the above format (with answer provided), and I have a database filled with uncategorised examples. My intention is to be able to run the ML model on that set, and sort all of them into categories. The reason I need to be able to retrain quickly is because of the feedback feature: if the AI gets something wrong, any user can report it, in which case that specific JSON will be added to the dataset. Obviously, having to retrain with 1000+ JSONs just to add one extra on would take ages - if I am not mistaken, an RNN can get around this.
I have found many possible use-cases for something like this, yet I have spent literal hours browsing through Github trying to find an implementation, or some Tensorflow module/addon to make this easier/copy, but to no avail.
I assume this would not be too difficult using Tensorflow, and I understand a bit of the maths and logic behind it (but not formally educated, so I probably have gaps!), but unfortunately I have essentially no experience with using Tensorflow/any other ML frameworks (beyond copy-pasting code for some other projects). If someone could point me in the right direction in the form of a Github repo/Python framework, or even write some demo code to help solve this problem, it would be greatly appreciated. And if you're just going to correct some of my technical knowledge/tell me where I've gone horrendously wrong, I'd appreciate that feedback to (just leave it as a comment).
Thanks in advance!

FORE! Choosing a data type for my horrendous golf game

Ive started learning Python and decided to give myself a golf related project to work on. My question revolves around choosing the best data type to use. Now i know th3nanswer to this is based on requirements but tht isnt helping me.
Besides simple data like name, date, name of course, etc., ill alao be generating 9 and 18 hole scores for multiple players in my locl society.
While keeping a historical record of past scores is nice i may want to perform some analytics across my dataset to find handicaps, hardest holes, etc. And, yes, i know there are apps out there aldeady im doing this to learn. ;)
So....which data structur should i use to work with? Lists, dictionaries, numpy arrats, objects, or a combination?
Many thanks!

Sentiment analysis of various lines of data

I'm new to programming and do not have much experience yet. I understand some python codes, but not into detail.
I have an Excel file which contains log files of problems people encountered. The description of the problem is pasted as an email (so it's a bunch of text). I want to analyze all of these texts (almost 1.000 rows in Excel) at once, and I think Python can do this.
The type of analysis I want to do is sentiment analysis (positive, neutral, negative) or I want to see the main problem out of the text. I don't know if the second one is possible.
I copied the emails that are listed in the Excel file, to a .txt file, so now every rule is one message. How can I use Python to analyze every single rule as one message and let it show me the sentiment or the main problem?
I'd appreciate the help
Sentiment analysis is a fairly large problem in computer science/language. How specific did you want to get?
I'd recommend looking into Text-Processing for simple SA.
Their API docs are here, http://text-processing.com/docs/sentiment.html, which will return a simple pos and neg score for your text.
If you want anything more specific, I'd recommend looking into the IBM Watson, specifically Natural Language Understanding https://www.ibm.com/watson/developercloud/natural-language-understanding.html

Ways to search and tag a MongoDB database of academic papers

Apologies for the vague nature of this question but I'm honestly not quite sure where to start and thought I'd ask here for guidance.
As an exercise, I've downloaded several academic papers and stored them as plain text in a mongoDB database.
I'd like to do write a search feature (using Python, R, whatever) that when you enter text and returns the most relevant articles. Clearly, relevant is really hard -- that's what google got so right.
However, I'm not looking for it to be perfect. Just to get something. A few thoughts I had were:
1) Simple MongoDB full text search
2) Implement Lucene Search
3) Tag them (unsure how though) and then return them sorted by the most number of tags?
Is there a solution someone has used that's out of the box and works fairly well? I can always optimize the search feature later -- for now I just want all the pieces to move together...
Thanks!
Is there a solution someone has used that's out of the box and works fairly well?
It depends on how you define well, but in simple terms, I'd say no. There is just no single and accurate definition of fairly well. A lot of challenges intrinsic to a particular problem arise when one trying to implement a good search algorithm. Those challenges lies in:
users needs diversity. Users in different fields have different intentions and as a result different expectation from a search result page;
natural languages diversity, if you are trying to implement multi-language search (German has a lot of noun compounds, Russian has enormous flexion variability etc.);
There are some algorithms that are proven to work better than others though, thus are good to start from. TF*IDF and BM25 two most popular.
I can always optimize the search feature later -- for now I just want all the pieces to move together...
MongoDB or any RDBMS with fulltext indexing support is good enough for a proof-of-concept, but if you need to optimize for search performance, you will need an inverted index (Solr/Lucene). From Solr/Lucene you will get ability to manage:
how exactly words are stemmed (this is important to solve undersemming/overstemming problems);
what the word is. Is "supercomputer" one word? What about "stackoverflow" or "OutOfBoundsException"?
synonyms and word expansion (should "O2" be found for a "oxygen" query?)
how exactly search is performed. Which words could be ignored during search. Which ones are required to be found. Which one are required to be found near each other (think of search phrase: "not annealed" or "without expansion").
This is just what comes to mind first.
So if you are planning to work these things out I definitely recommend Lucene as a framework or Solr/ElasticSearch as a search system if you need to build proof-of-concept fast. If not, MongoDB/RDMS will work well.

Categories

Resources