I want to build an analytics engine on top of an article publishing platform. More specifically, I want to track the users' reading behaviour (e.g. number of views of an article, time spent with the article open, rating, etc), as well as statistics on the articles themselves (e.g. number of paragraphs, author, etc).
This will have two purposes:
Present insights about users and articles
Provide recommendations to users
For the data analysis part I've been looking at cubes, pandas and pytables. There is a lot of data, and it is stored in MySQL tables; I'm not sure which of these packages would better handle such a backend.
For the recommendation part, I'm simply thinking about feeding data from the data analysis engine to a clustering model.
Any recommendations about how to put all this together, as well as cool python projects out there that can help me out?
Please let me know if I should give more information.
Thank you
Scikit-learn should make you happy for the data processing (clustering) part.
For the analysis and visualization side, you have Cubes as you mentioned, and for viz I use CubesViewer which I wrote.
Related
For a recommender system, when we are using Surprise, we normally only pass on UserID, ItemID and Rating using load_from_df.
But if I also have other features which I want to load from a df, how can I do it? I couldn't find any useful information or examples on the Surprise API https://surprise.readthedocs.io/en/stable/dataset.html.
Can someone guide me to the right direction?
Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. Surprise was designed with the following purposes in mind: Give users perfect control over their experiments. ... Provide tools to evaluate, analyse and compare the algorithms' performance
I have been trying to pull out financial statements embedded in annual reports in pdf and export them in excel/CSV format using python But I am encountering some problems:
1. A specific Financial statement can be on any page in the report. If I were to process hundreds of pdfs, I would have to specify page numbers which takes alot of time. Is there any way through which the scraper knows where the exact statement is?
2. Some reports span over multiple pages and the end result after scraping a pdf isnt what I want
3. Different annual reports have different financial statement formats. Is there any way to process them and change them to a specific standard format?
I would also appreciate if anyone have done something like this and can share examples.
Ps I am working with python and used tabula and Camelot
I had a similar case where the problem was to extract specific form information from pdfs (name, date of birth and so on). I used the tesseract open source software with pytesseract to perform OCR on the files . Since I did not need the whole pdfs, but specific information from them, I designed an algorithm to find the information: In my case I used simple heuristics (specific fields, specific line number and some other domain specific stuff), but you can also use a machine-learning approach and train a classifier which can find the needed text-parts. You could use domain-specific heuristics as well, because I am sure that a financial statement has special vocabulary or some text markers which indicate its beginning/its end.
I hope I could at least give you some ideas how to approach the problem
P.S.: With tesseract you can also process multipage pdfs. To 3) - Machine learning approach would need some samples to learn a good generalization of how a financial statement may look like.
I would like to code a script that could locate a specific word or number in a financial statement. Financial statements roughly contain the same information, they are however not identical and organized in the same way. My thought is that by using Tensorflow I could train a neural network to locate the specific words or numbers for me. I am thinking that if I label different text and numbers in 1000 financial statements and use them to train the neural network, it will then be able to identify these numbers or words in all financial statements. For example, tell it in all 1000 training statements which number that is the profit of the company.
Is this doable? I have been working with coding in python for a couple of months and so far I've built some web scrapers and integrated them with twitter, slack and google sheets. I would be very grateful for all your thoughts on this project and if anyone could steer me in the right direction by sharing relevant tutorials.
Thanks a lot!
Great thing that you're getting started, I believe before thinking about the actual implementation using tensorflow or any other library, you should first try to understand the problem in regards with the basic domain of the problem itself.
I'm not really sure what are you exactly trying to achieve but to a rough idea I'm guessing it's about trying to find is a statement turns out to be a benificial to the company or not, something like of semantic analysis type of problem.
So I strongly believe that, first you should try to learn the various methodologies related to semantic analysis and find the most appropriate technique.
In short theory/understanding before the actual code.
Finally i would suggest you ask such theoratical questions on stack exchange of AI, here in SO we generally deal with code or something that of intermediate to code.
I hope that makes sense? ;)
drop a comment if any doubts.
I am looking for data sets and tutorials which are specifically targeting business data analysis issues. I know about Kaggle but it's main focus is on Machine learning and associated problems/issues. Would be great to know a blog or dump regarding data analysis issues. Or may be a good read/book?
The correct answer to this all depends on how comfortable you are currently with machine learning. Business data analysis and predictions are so closely aligned with machine learning that most developers consider it a subset that more general ML skills will cover. So I will suggest two things to you. If you have no experience in ML launch into the Data Science(python) career track of Data camp - It is excellent! This will help you get to grips with the overall ideas of cleaning your data and data processing, as well as supervised and unsupervised learning.
If you are already comfortable with all that I would suggest looking at pbpython.com - This site covers python for business analysis use entirely and suggests a plethora of books specialized for certain topics. As well as covering individual topics itself very well.
As part of our research group, we're collecting large amounts of location data. Our data essentially looks like (user id, lat/long co-ordinates, timestamp). There's other metadata involved too, but that's not relevant here.
We're collecting about 2-3 million records a week, and expect to collect about a year's worth of data in due time.
I'd really like some advice on techniques on storing and processing this data. We'd like to be able to answer queries similar to:
(1) For a given location, who was near that location (within a specified distance) over a specified period of time?
(2) Which locations are near each other?
That's the general idea. We don't need a real-time response, but what are good databases (or other data storage software)? I've come across people talking about k-d trees, does that work at this scale? What kind of hardware do I need? I'm hoping to get pointers towards general strategies. How do we store this data? Does it even make sense to store it all in a database? Which data/software/packages lend themselves well to distance/radius calculations?
We're most familiar with Python/Linux, would prefer to stay away from Java and prefer open source/free software. We're new to all this, pointers to books and papers would also be useful. All and any advice would be greatly useful.
PostGIS is probably what you are looking for.