Customer profiling and Anomaly detection -Python - python

Thank you for taking the time to check this question.
I am interested in creating a profile for customers buying pattern.
Once I created a profile for everyone, we take unseen data and check with the profile to see if the customers followed their profile if not raise a flag. In this manner we do not create a set alert for all buyers but we can detect anomaly based on individual buyers to benchmark against their profile.
Any thoughts or inputs to how to approach this problem.
If you have a course or tutorial on this matter please feel free to suggest it.
Thanks in advance.

You can either go by supervised learning method, basically machine learning. Also, buying pattern, I would suggest to explore more about RFM rule i.e. recency, frequency and monetary value. This will help you in creating features for model or profile customers.

Related

How can we use NLP on user-input text to generate relevant graphs

For example:
user: Compare sales between Hyderabad and Mumbai for the month of January 2022.
output: relevant graph showing this comparison. (say a bar chart)
Another example:
User: Top 10 products sold in Hyderabad during in New Year.
Output: Tabular data showing the results for new year, top-10 products sold in Hyderabad.
What i Tried:
Created Templates-functions in python to query data using Pandas and SQL. Now the params of these functions are taken from Text input by user. Hence, i've Extracted POS and NER from given inputs, but it works for certain queries as sometimes some states are being classified as ORG in spacy.
So i added these updated the same as when the discrepancies are found. But when scaling the project, i feel that there must be better way to do it.
My opinion:
We may use transfer learning, but how should i create data for this?
If we go with rule based system, then extracting context would help, but scaling this project would be an issue.
Using DIET classifier type of model: Since i have worked with RASA framework, i guess using their architecture would help. But can you suggest as to how can i use transfer learning with such models? If we choose RASA then our task will be transferred there only and will limit itself to a chatbot kind of project, which i dont want as of now.
Any input/suggestion would be highly appreciated.

Python - Using pandas with reinforcement learning

I would like to create in python some RL algorithm where the algorithm would interact with a very big DataFrame representing stock prices. The algorithm would tell us: Knowing all of the prices and price changes in the market, what would be the best places to buy/sell (minimizing loss maximizing reward). It has to look at the entire DataFrame each step (or else it wouldn't have the entire information from the market).
Is it possible to build such algorithm (which works relatively fast on a large df)? How should it be done? What should my environment look like and which algorithm (specifically) should I use for this type of RL and which reward system? Where should I start
I think you are a little confused with this .what I think you want to do is to check whether the stock prices of a particular company will go up or not or the stock price of which company will shoot up where you already have a dataset regarding the problem statement.
about RL it does not work on any dataset it's a technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
you can check this blog for some explanation don't get confused.
https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861

is there a way to use pretrained doc2vec model to evaluate some document dataset

Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model.
By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one
week must chart daily use medium radio television newspaper magazine
film video etc wake radio alarm listen traffic report commuting get
news watch sport soap opera watch tv use internet work home read book
see movie use data collect journal basis analysis examining
information using us gratification model discussed textbook us
gratification article provided perhaps carrying small notebook day
inputting material evening help stay organized smartphone use note app
track medium need turn diary trust tell tell immediately paper whether
actually kept one begin medium diary soon possible order give ample
time complete journal write paper completed diary need write page
paper use medium functional analysis theory say something best
understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio
television newspaper magazine film video etc wake radio alarm listen
traffic report commuting get news watch sport soap opera watch tv use
internet work home read book see movie use data collect journal basis
analysis examining information using us gratification model discussed
textbook us gratification article provided perhaps carrying small
notebook day inputting material evening help stay organized smartphone
use note app track medium need turn diary trust tell tell immediately
paper whether actually kept one begin medium diary soon possible order
give ample time complete journal write paper completed diary need
write page paper use medium functional analysis theory say something
best understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper
life work garry winogrand famous street photographer also influenced
street photography aim towards thoughtful imaginative treatment detail
referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page
essay tang dynasty essay discus buddhism tang dynasty name artifact
tang dynasty discus them history put heading paragraph information
tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data.
To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.

WALS Model Tensorflow - Get recommendations for new user

I wonder if there is any way I can get recommendations for a new user, using an already trained WALS model, and given the list of items the user liked.
Currently, to get a recommendation you must provide the id of the user, which must be among the users the model was trained with. I would like to get a recommendation by providing the list of items that were liked by a new user.
There is a similar feature in the implicit python library
What you are describing is called "cold start problem".
You want a recommendation from no information on user's past behaviour.
As Matrix factorization needs historical data from the user, the best recommendation you can make if a most-popular recommendation.
To tackle this problem, you can add another type of model that handle this. You can, for example, add user's data in the algorithm in order not to be absolutely dependant on past data or by doing the online recommendation by analysing data during session.

NLP project on Comment Summarization

I am planning to do my final year project on Natural Language Processing (using NLTK) and my area of interest is Comment Summarization from Social media websites such as Facebook. For example, I am trying to do something like this:
Random Facebook comments in a picture :
Wow! Beautiful.
Looking really beautiful.
Very pretty, Nice pic.
Now, all these comments will get mapped (using a template based comment summarization technique) into something like this:
3 people find this picture to be "beautiful".
The ouput will consist of the word "beautiful" since it is more commonly used in the comments than the word "pretty" (and also the fact that Beautiful and pretty are synonyms).In order to accomplish this task, I am going to use approaches like tracking Keyword frequency and Keyword Scores (In this scenario,"Beautiful" and "Pretty" have a very close score).
Is this the best way to do it?
So far with my research, I have been able to come up with the following papers but none of the papers address this kind of comment summarization :
Automatic Summarization of Events from Social Media
Social Context Summarization
-
What are the other papers in this field which address a similar issue?
Apart from this, I also want my summarizer to improve with every summarization task.How do I apply machine learning in this regard?
Topic model clustering is what you are looking for.
A search on Google Scholars for "topic model clustering will give you lots of references on topic model clustering.
To understand them, you need to be familiar with approaches for the following tasks, apart from basics of Machine Learning in general.
Clustering: Cosine distance clustering, k-means clustering
Ranking: PageRank, TF-IDF, Mutual Information Gain, Maximal Marginal Relevance

Categories

Resources