I am looking for data sets and tutorials which are specifically targeting business data analysis issues. I know about Kaggle but it's main focus is on Machine learning and associated problems/issues. Would be great to know a blog or dump regarding data analysis issues. Or may be a good read/book?
The correct answer to this all depends on how comfortable you are currently with machine learning. Business data analysis and predictions are so closely aligned with machine learning that most developers consider it a subset that more general ML skills will cover. So I will suggest two things to you. If you have no experience in ML launch into the Data Science(python) career track of Data camp - It is excellent! This will help you get to grips with the overall ideas of cleaning your data and data processing, as well as supervised and unsupervised learning.
If you are already comfortable with all that I would suggest looking at pbpython.com - This site covers python for business analysis use entirely and suggests a plethora of books specialized for certain topics. As well as covering individual topics itself very well.
Related
For a recommender system, when we are using Surprise, we normally only pass on UserID, ItemID and Rating using load_from_df.
But if I also have other features which I want to load from a df, how can I do it? I couldn't find any useful information or examples on the Surprise API https://surprise.readthedocs.io/en/stable/dataset.html.
Can someone guide me to the right direction?
Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. Surprise was designed with the following purposes in mind: Give users perfect control over their experiments. ... Provide tools to evaluate, analyse and compare the algorithms' performance
Recently I was given a task by a potential employer to do the following :
- transfer a data set to S3
- create metadata for the data set
- creat a feature for the data set in spark
Now this is a trainee position, and I am new to data engineering in terms of concepts and I am having trouble understanding how or even if metadata is used to create a feature.
I have gone through numerous sites in feature engineering and metadata but none of which really give me an indication on if metadata is directly used to build a feature.
what I have gathered so far from sites is that when you build a feature it extracts certain columns from a given data set and then you put this information into a feature vector for the ML algorithm to learn from. So to me, you could just build a feature directly from the data set directly, and not be concerned with the metadata.
However, I am wondering if is it common to use metadata to search for given information within multiple datasets to build the feature, i.e you look in the metadata file see certain criteria
that fit the feature your building and then load the data in from the metadata and build the feature from there to train the model.
So as an example say I have multiple files or certain car models for manufacture i.e (vw golf, vw fox, etc) and it contains the year and the price of the car for that year and I would like the ML algorithm to predict the depreciation of the car for the future or depreciation of the newest model of that car for years to come. Instead of going directly through all the dataset, you would check the metadata (tags, if that the correct wording) for certain attributes to train the model then by using the (tags) it loads the data in from the specific data sets.
I could very well be off base here, or my example I given above may be completely wrong, but if anyone could just explain how metadata can be used to build features if it can that would be appreactied or even if links to data engineering websites that explain. It just over the last day or two researching, I find that there more on data sic than data engineering itself and most data engineering info is coming from blogs so I feel like there a pre-existing knowledge I am supposed to have when reading them.
P.S though not a coding question, I have used the python tag as it seems most data engineers use python.
I'll give synopsis on this !!!
Here we need to understand two conditions
1)Do we have features which directly related in building ML models.
2)are we in data scarcity ?
Always make a question , what the problem statement suggest us in generating features ?
There are many ways we can generate features from given dataset like PCA,truncated SVD,TSNE used for dimensionality reduction techniques where new features are created from given features.feature engineering techniques like fourier features,trignometric features etc. and then we move to the metadata like type of feature,size of feature,time when it extracted if it etc..like this metadata also helps us in creating features for building ML Models but this depends how we have performed feature engineering on datacorpus of respective Problem.
Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model.
By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one
week must chart daily use medium radio television newspaper magazine
film video etc wake radio alarm listen traffic report commuting get
news watch sport soap opera watch tv use internet work home read book
see movie use data collect journal basis analysis examining
information using us gratification model discussed textbook us
gratification article provided perhaps carrying small notebook day
inputting material evening help stay organized smartphone use note app
track medium need turn diary trust tell tell immediately paper whether
actually kept one begin medium diary soon possible order give ample
time complete journal write paper completed diary need write page
paper use medium functional analysis theory say something best
understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio
television newspaper magazine film video etc wake radio alarm listen
traffic report commuting get news watch sport soap opera watch tv use
internet work home read book see movie use data collect journal basis
analysis examining information using us gratification model discussed
textbook us gratification article provided perhaps carrying small
notebook day inputting material evening help stay organized smartphone
use note app track medium need turn diary trust tell tell immediately
paper whether actually kept one begin medium diary soon possible order
give ample time complete journal write paper completed diary need
write page paper use medium functional analysis theory say something
best understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper
life work garry winogrand famous street photographer also influenced
street photography aim towards thoughtful imaginative treatment detail
referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page
essay tang dynasty essay discus buddhism tang dynasty name artifact
tang dynasty discus them history put heading paragraph information
tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data.
To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.
I am planning to do my final year project on Natural Language Processing (using NLTK) and my area of interest is Comment Summarization from Social media websites such as Facebook. For example, I am trying to do something like this:
Random Facebook comments in a picture :
Wow! Beautiful.
Looking really beautiful.
Very pretty, Nice pic.
Now, all these comments will get mapped (using a template based comment summarization technique) into something like this:
3 people find this picture to be "beautiful".
The ouput will consist of the word "beautiful" since it is more commonly used in the comments than the word "pretty" (and also the fact that Beautiful and pretty are synonyms).In order to accomplish this task, I am going to use approaches like tracking Keyword frequency and Keyword Scores (In this scenario,"Beautiful" and "Pretty" have a very close score).
Is this the best way to do it?
So far with my research, I have been able to come up with the following papers but none of the papers address this kind of comment summarization :
Automatic Summarization of Events from Social Media
Social Context Summarization
-
What are the other papers in this field which address a similar issue?
Apart from this, I also want my summarizer to improve with every summarization task.How do I apply machine learning in this regard?
Topic model clustering is what you are looking for.
A search on Google Scholars for "topic model clustering will give you lots of references on topic model clustering.
To understand them, you need to be familiar with approaches for the following tasks, apart from basics of Machine Learning in general.
Clustering: Cosine distance clustering, k-means clustering
Ranking: PageRank, TF-IDF, Mutual Information Gain, Maximal Marginal Relevance
I want to build an analytics engine on top of an article publishing platform. More specifically, I want to track the users' reading behaviour (e.g. number of views of an article, time spent with the article open, rating, etc), as well as statistics on the articles themselves (e.g. number of paragraphs, author, etc).
This will have two purposes:
Present insights about users and articles
Provide recommendations to users
For the data analysis part I've been looking at cubes, pandas and pytables. There is a lot of data, and it is stored in MySQL tables; I'm not sure which of these packages would better handle such a backend.
For the recommendation part, I'm simply thinking about feeding data from the data analysis engine to a clustering model.
Any recommendations about how to put all this together, as well as cool python projects out there that can help me out?
Please let me know if I should give more information.
Thank you
Scikit-learn should make you happy for the data processing (clustering) part.
For the analysis and visualization side, you have Cubes as you mentioned, and for viz I use CubesViewer which I wrote.