Predict next time for repeating event with tensorflow - python

I have a dataset for store inventory management. For every product I have the history of product Order of renewal. For example for a product A, I have:
A,last_time_of_renawal,volume_order,time_of_order
A,last_time_of_renawal1,volume_order1,time_of_order1
for every line, I have also other information like (category of product, sales number, stock_volume...)
How can I use this dataset and tensorflow (or other deep learning library) to predict the next time_of_order for a product knowing the last_time_of_order

that is too broad of a question for StackOverflow. Try to specify it using this guide.
But essentially you want to do a regression on the delta between time_of_order and last_time_of_order. That's your y. Then you have your features using category of product etc (your x).
Now you have a wide world of statistical analysis at your disposal.
If you insist on using deep-learning: Try setting up a "simple" neural network using a youtube playlist. When you succeed, you can try using your own data.
And you if you encounter problems... come back to StackOverflow with a specific programming question :) Have fun!

Related

Need suggestions for a NLP use case

I am trying to build a web scraper that can predict the content of a given URL into multiple categories, but I am currently confused about which method is best suited for my use case. Here's the overall use case:
I want to predict a researcher's interest from their biography and categorize them into one or multiple categories based on SDG 17 goals. I have three data points to work with:
The biography of each researcher (can be scrapped and tokenized)
A list of keywords that are often associated with each of the SDG categories/goals (here's the example of said keywords)
Hundreds of categorizations that are done manually by students in the form of binary data (here's the example of said data)
So far, we have students that read each researcher's biography and decide which SDG category/goal each researcher belongs to. One research can belong to one or more SDG categories. We usually categorize it based on how often SDG keywords listed in our database are present in each researcher's bio.
I have looked up online machine learning models for NLP but couldn't decide on which method would work best with my use case. Any suggestions and references would be super appreciated because I'm a bit lost here.
The problem that you have here is a multi-label classification and you can solve it by applying supervised learning since you have a labelled dataset.
A labelled dataset should look something like this,
article 1 - sdg1, sdg2, sdg4
article 2 - sdg4
.
.
.
The implementation is explained in detail here - keras - multi-label-classification
This one has plenty of things abstracted and the implementation is kept simple - fasttext multi-label-classification
Profound insights of these libraries are here,
keras and fasttext

What should be used between Doc2Vec and Word2Vec when analyzing product reviews?

I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.

Python - Using pandas with reinforcement learning

I would like to create in python some RL algorithm where the algorithm would interact with a very big DataFrame representing stock prices. The algorithm would tell us: Knowing all of the prices and price changes in the market, what would be the best places to buy/sell (minimizing loss maximizing reward). It has to look at the entire DataFrame each step (or else it wouldn't have the entire information from the market).
Is it possible to build such algorithm (which works relatively fast on a large df)? How should it be done? What should my environment look like and which algorithm (specifically) should I use for this type of RL and which reward system? Where should I start
I think you are a little confused with this .what I think you want to do is to check whether the stock prices of a particular company will go up or not or the stock price of which company will shoot up where you already have a dataset regarding the problem statement.
about RL it does not work on any dataset it's a technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
you can check this blog for some explanation don't get confused.
https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861

How to analyze music MFCC?

I'm trying to make a program for recognizing music instrument and notes(like C, C# B, ...) using machine learning in python.
I got data from IRMAS and philhamonic orchestra homepage.
How can I analyze music? I want to get noise removed and MFCC values. in 20 second music, I want to get within 20 featured values. I'm trying to use SVM using these data.
Sorry for too broad question... If there is something else i should mention, let me know then i'll answer imediately.
I have mathematica, also. I tried it using 'MFCC encoder' but i have no idea how can i normalize these data and set a threshold.
Take a look at this Mathematica example of using Neural Networks and MFCC encoding to classify music genre.

Machine learning algorithm which gives multiple outputs mapped from single input

I need some help, i am working on a problem where i have the OCR of an image of an invoice and i want to extract certain data from it like invoice number, amount, date etc which is all present within the OCR. I tried with the classification model where i was individually passing each sentence from the OCR to the model and to predict it the invoice number or date or anything else, but this approach takes a lot of time and i don't think this is the right approach.
So, i was thinking whether there is an algorithm where i can have an input string and have outputs mapped from that string like, invoice number, date and amount are present within the string.
E.g:
Inp string: The invoice 1234 is due on 12 oct 2018 with amount of 287
Output: Invoice Number: 1234, Date: 12 oct 2018, Amount 287
So, my question is, is there an algorithm which i can train on several invoices and then make predictions?
Essentially you are looking for NER (Named entity recognition). There are multiple free and paid tools available for intent and entity mapping. You can use Google DialogFlow, MS LUIS, or open source RASA for entity identification in given text.
if you want to develop your own solution then you can look at OpenNLP too.
Please revert on your observation on these wrt to your problem
What you are searching for is invoice data extraction ML. There are plenty of ML algorithms available, but none of them is done for your use case. Why? Because it is a very special use case. You can't just use Tensorflow and use sentences as input, although it can return multiple outputs.
You could use NLP (natural language processing) approaches to extract data. It is used by Taggun to extract data from receipts. In that case, you can use only sentences. But you will still need to convert your sentences into NLP form (tokenization).
You could use deep learning (e.g. Tensorflow). In that case, you need to vectorize your sentences into vectors that can be input into a neural network. This approach needs much more creativity while there is no standard approach to do that. The goal is to describe every sentence as good as possible. But there is still one problem - how to parse dates, amounts, etc. Would it help NN if you would mark sentences with contains_date True/False? Probably yes.
A similar approach is used in invoice data extraction services like:
rossum.ai
typless.com
So if you are doing it for fun/research I suggest starting with a really simple invoice. Try to write a program that will extract invoice number, issue date, supplier and total amount with parsing and if statements. It will help you to define properties for feature vector input of NN. For example, contains_date, contains_total_amount_word, etc. See this tutorial to start with NN.
If you are using it for work I suggest taking a look at one of the existing services for invoice data extraction.
Disclaimer: I am one of the creators of typless. Feel free to suggest edits.

Categories

Resources