How to build a real time recommendation engine with good performance?

How to build a real time recommendation engine with good performance? - python

I am a data analyst and just assigned to build a real time recommendation engine for our website.
I need to analyse the visitor behavior and do real time analysis for those input. Thus I have three questions about this project.
1) Users are not forced to sign-up. Is there any methodology to capture user behavior such as search and visit history?
2) The recommendation models can be pre-trained but the prediction process takes time. How can we improve the performance?
3) I only know how to write Python Scripts. How can I implement the recommendation engine with my python scripts?
Thanks.
===============
However, 90% of our customers purchase the products during their first visit and will not come back shortly.
We cannot make a ready model for new visitors.
And they prefer to use itemCF for the recommendation engine.
It sounds like a mission impossible now...

This is quite a broad question however I will do my best to answer:
Visit history can be tracked by enabling some form of analytics tracking on your domain. This can either be a pre-built solution that you implement and will provide a detailed overview of all visitors to your domain, usually with some form of dashboard. Most pre-built solutions provide a way to obtain the analytics that have been collected.
Another way would be to use browser cookies to store information pertaining to each visit to your domain and their search/page history. This information will be available to the website whenever the user visits it within the same browser. When the user visits your website, you could send the information to a server/rest endpoint which could analyse information (I.P/geolocation/number of visits/search/page history) and make recommendation based on that. Another common method is to track past purchases ect.
To improve performance one solution would be to always have the prediction model for a particular user ready for when they next visit the site. That way, there is no delay. However, the first time a user visits you likely wont have enough information to make detailed predictions so you will have to resort to providing options based on geolocation (which shouldn't take to long and wont impact performance)
There is another approach that can be taken and above mainly talked about making predictions based on a users behavior browsing the website. Content-based filtering is another approach which will recommend things that are similar to a item that the user is currently viewing. This approach is generally easier, as it just requires that you query a database for items that are similar in category, purpose/use ect.
There is no getting around using javascript for the client side stuff, however your recommendation engine can be built in Python (it could be a simple REST api endpoint with access to the items database). Most people use flask,django or eve to implement REST API's in Python.

Related

How to implement a Nadex autotrading bot?

I have been searching for a way to autotrade on Nadex
https://www.nadex.com
and came across this script https://github.com/FreeTheQuarks/NadexBot
It is an old script and I am not that experienced in Python.
Q1: Is this a good way to go about, thus since it is not a official API and is probably scraping data from the site, which would mean slower requests and trade execution?
There is also an unofficial API client https://github.com/knoguchi/nadex/tree/master/nadex
But again, not sure if good for live trading.
Q2: Is there better ways to go about this and if so where should I start?

I'm the author of the Nadex unofficial API Python client.
I still maintain it. Recently the streamer support was added.
However, I suggest that you use JavaScript from nadex.com. It's always up-to-date and works just like the official web site, obviously.
The JS code is professionally written. Very readable. There are 100 JavaScript files but essential ones for the API access are only handful.
Nadex is a part of IG Group. Hence the JS has lots of IG namespace. IG offers API and documents for developers. Nadex message format is a little different from IG's but the design is same. Once you learn the document, all the JavaScript code is really easy to understand.

A1: Measure Twice Rather Before One Cut$
Simply put, it is your money you trade with, so warnings like this one ( from FreeTheQuarks )
(cit.:)This was the first non-trivial program I wrote.It hasn't received any significant updates in years.I've only made minor readability updates after first putting this on git.
should give one a sufficient sign to re-think the risks, before one puts a first dollar on table.
This is a for-profit game, isn't it?
A2: Yes, there is a better way.
All quantitatively supported trading strategies need stable and consistent care - i.e. one needs to have
rock-solid historical data
stable API to work with ( Trade Execution & Management, Market Events )
reliable & performant testbed for validating one's own quantitative model of the trading strategy
Having shared this piece of some 100+ man*years experience, one may decide on one's own whether to rely on or rather forget to start any reasonable work with just some reverse-engineering efforts on an "un-official API client".
In case the profit generation capability of some trading strategy supports the case, one may safely order an outsourced technical implementation and technical integration efforts in a turn-key manner.
Epilogue:If there are quantitatively supported reasons to implement a trading strategy,the profit envelope thereof sets the ultimate economically viable modelfor having that strategy automated and operated in-vivo.Failure to decide in this order of precedenceresults in nothing but wasted time & money.Your money.

I was looking into the Nadex platform recently too. I wrote a small wrapper over the oanda foreign exchange broker api v1 in python (now they have v2.0) so I have some experience.
To implement an autotrading bot is a big question, but to try and answer: you may either use a pre-existing wrapper for the Nadex API (it looks like either Python or Javascript are your choices), or write one yourself, in a language of your preference.
If you want to start from scratch, I believe Nadex offers a RESTful service, which basically means you can make GET, POST, DELETE, and other types of requests via a specific URL (most of the time there is a 'base' URL from where other endpoints spawn). I would first go about trying to find the endpoints to the Nadex servers - Kenjis unofficial API should point in the right direction there, since he is using URL strings and has a class for making different requests. I was unsuccessful trying to find any documentation for Nadex API myself, but Kenji's wrapper or the Javascript API both look promising. Depending on the depth of the market and number of requests, I think you are correct saying that you wouldn't want a web scraper for something like this. It would be very slow (and probably wasteful of time) compared to using an existing wrapper. I would start writing classes and/or functions that make simple requests to the Nadex RESTFUL endpoints, for example a function that logs in the accesses account data. Next step would be to retrieve market data and eventually stream the market data into a trade logic algorithm that makes decisions for you.
If you want to build a trading bot easily and with most of the work cut out for you, I would recommend one of the other answers here. That way, you can use their predefined classes/functions and have the "boring" API access code written for you, ready to use.
Hope that helps or leads you in the right direction!

Advice for building a web analytics tool (preferably Python friendly) - OLAP/Python

I'm about to start the development of a web analytics tool for an e-commerce website.
I'm going to log several different events, basically clicks on various elements of the page and page views.
These events carry metadata (username of the loggedin user, his country, his age, etc...) and the page itself carries other metadata (category, subcategory, product etc...).
My companies would like something like an OLAP cube, to be able to answer questions like:
How many customer from country x visited category y?
How many pageviews for category x in January 2012?
How many customer from country x visited category y?
My understanding is that I should use an OLAP engine to record these events, and then build a reporting interface to allow my colleagues to use it.
Am I right? Do you have advices on the engine and frontend/reporting tool I should use? I'm a Python programmer, so anything Python-friendly would be nice.
Thank you!

The main question is how big your cube is going to be and if you need an open source OLAP solution or not.
If you're dealing with big cubes and want to get room for future features you might go for a real OLAP Server. A few are open source - Mondrian - and other have a 'limited' community edition - Palo, icCube. The important point here is being compatible with MDX and XMLA. defacto OLAP standard, so you can plug different reporting tools and/or using existing libraries. My understanding, there is no Phyton version for an XMLA library as in Java or .NET not sure this is the way to go.
If you're cubes are small you can develop something on your own or go for other quicker solutions as the comment of Charlax is indicating.

As mentioned in the selected answer, it depends on your data amount. However, just you run into a case that a light-weight Python OLAP framework would be sufficient, then you might try Cubes, sources are on github. It contains SQL backend (any other might be implemented as well) and provides a light HTTP OLAP server. Example of an application (PHP front-end with HTTP Slicer OLAP server backend) using it can be found here It does not contain visualization layer and complex queries thought, but that is trade-off for being small.

Designing a Django voting system without using accounts

We are considering implementing a voting system (up, down votes) without using any type of credentials--no app accounts nor OpenID or anything of that sort.
Concerns in order:
Prevent robot votes
Allow individuals under a NAT to vote without overriding/invalidating someone else's vote
Preventing (or, at the very least making very difficult for) users to vote more than once
My questions:
If you've implemented something similar, any tips?
Any concerns that perhaps I'm overlooking?
Any tools that I should perhaps look into?
If you have any questions that would help for you in forming an answer to any of these questions, please ask in the comments!

To address your concerns:
1: a simple Captcha would probably do the trick, if you google "django captcha", there are a bunch of plugins. I've never used them myself, so I can't say which is the best.
2 & 3: Using Django's sessions addresses both of these problems - with it you could save a cookie on the user's browser to indicate that the person has already voted. This obviously allows people to vote via different browsers or by clearing their cache, so it depends on how important it is that people not be allowed to vote twice. I would imagine that only a small percentage of people would actually think to try clearing their cache, though. As far as I know the only other way to limit users without a sign-in process would be to test IP addresses, but that would violate your second criteria since people on the same network will show up as having the same IP address.
If you don't want multiple votes to be as simple as deleting browser cookies, you could also allow facebook or twitter login - the django-socialregistration plugin is pretty well documented and straightforward to implement.
Hope that helps!

Recaptcha is an excellent choice. For Django, here's the one that I've had the most success with, which actually uses images loaded from Recaptcha (as opposed to local images generated on the fly):
http://pypi.python.org/pypi/recaptcha-client#downloads
Instructions for installation are in this snippet:
http://djangosnippets.org/snippets/433/
If Recaptcha is a bit unwieldy for what you're doing, I've heard of people implementing a form that loads with a hidden input containing a timestamp value, corresponding to when the form was loaded. Then, when the form is submitted, generate a new timestamp and get the difference between the two. If the difference in seconds is below a certain threshold that's unreasonable for a human visitor, chances are you have a bot. This works for contact forms with several fields...it usually takes a person more than 10 seconds to fill them out.
I can't speak to how effective this technique actually is in production....a lot of these spam bots these days are smarter than I am. But it might be something you'd consider looking into or testing.

How to model a social news feed on Google App Engine

We want to implement a "News feed" where a user can see messages
broadcasted by her friends, sorted with newest message first. But the
feed should reflect changes in her friends list. (If she adds new
friends, messages from those should be included in the feed, and if
she removes friends their messages should not be included.) If we use
the pubsub-test example and attach a recipient list to each message
this means a lot of manipulation of the message recipients lists when users
connect and disconnect friends.
We first modeled publish-subscribe "fan out" using conventional RDBMS
thinking. It seemed to work at first, but then, since the IN operator
works the way it does, we quickly realized we couldn't continue on
that path. We found Brett Slatkin's presentation from last years
Google I/O and we have now watched it a few times but it isn't clear to
us how to do it with "dynamic" recipient lists.
What we need are some hints on how to "think" when modeling this.

Pasting the answer I got for this question in the Google Group for Google App Engine http://groups.google.com/group/google-appengine/browse_thread/thread/09a05c5f41163b4d# By Ikai L (Google)
A couple of thoughts here:
is removing of friends a common event? similarly, is adding of
friends a common event? (All relative,
relative to "reads" of the news feed)
From what I remember, the only way to make heavy reads scale is to write
the data multiple times in peoples'
streams. Twitter does this, from what
I remember, using a "eventually
consistent" model. This is why your
feed will not update for several
minutes when they are under heavy
load. The general consensus, though,
is that a relational, normalized
model simply will not work.
the Jaiku engine is open source for your study:
http://code.google.com/p/jaikuengine.
This runs on App Engine Hope these
help when you're considering a design.

Number of visitors in Django

In Django, how can I see the number of current visitors? Or how do I determine the number of active sessions?
Is this a good method?
use django.contrib.sessions.models.Session, set the expiry time short. Every time when somebody does something on the site, update expiry time. Then count the number of sessions that are not expired.

You might want to look into something like django-tracking for this.
django-tracking is a simple attempt at
keeping track of visitors to
Django-powered Web sites. It also
offers basic blacklisting
capabilities.
Edit: As for your updated question... [Answer redacted after being corrected by muhuk]
Alternatively, I liked the response to this question: How do I find out total number of sessions created i.e. number of logged in users?
You might want to try that instead.

django-tracking2
can be helpful to track the visitors.
As specially this is easy to configure in the deployment like AWS, because it is not required any dependency and environment variables.
django-tracking2 tracks the length of time visitors and registered users spend on your site. Although this will work for websites, this is more applicable to web applications with registered users. This does not replace (nor intend) to replace client-side analytics which is great for understanding aggregate flow of page views.

There is also a little application django-visits to track visits https://bitbucket.org/jespino/django-visits

Edit: Added some more information about why I present this answer here. I found chartbeat when I tried to answer this same question for my django based site. I don't work for them.
Not specifically Django, but chartbeat.com is very interesting to add to a website as well.
django-tracking is great, +1 for that answer, etc.
Couple of things I could not do with django-tracking, that chartbeat helped with; tracked interactions with completely cached pages which never hit the django tracking code and pages not delivered through django (e.g. wordpress, etc.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.