I'm currently exploring different ways to judge and predict the performance of various offers and marketing campaigns. I have a list of metrics to pull from which I'm currently using now to predict performance, such as:
Day the offer was sent
Month
Weather
Time of Day
+more
And for my performance metric, I use
Redemption Rate (For every offer sent, how many times was it redeemed) - This is how I judge success
But one of the most important metrics is the offer itself, which I know in the form of a text-string.
Here are a few user-generated examples.
Get $4.00 off a large pizza
Receive 20% off your next order
Buy any Chocolate Milkshake, get another one half price
Two wraps for $7.50
Free cookie with any purchase
..and hundred's more
Now, I know there's very important information in those text stings, but I don't know the best way to analyze it and extract key information. For example, in this text it shows the product its advertising, the discount, the dollar amount, the percentage off, etc. I need a generalized way to go through each string (I'm assuming through some tolkenized method), and extract relevant information.
I'm hoping to get some input on how I could analyze these strings, eventually with the purpose of generating a string-based dataset (along with the other aforementioned data points) that I can use for predictions.
I am writing my code using python 3.0.
Any advice is greatly appreciated. Thanks.
Im about to learn about Python in order to work with data analysis - and would like to know how do we apply slicing in real life cases. If someone has some examples from the real life
slicing is used to separate/get a specified part of something right?
so lets say i have some purchase data from my website
this data includes satisfaction level, average income, amount spent and time spent looking to purchase an item.
if i wanted to see if there was any correlation between income and amount spent on my website i could use slicing to only retrieve data from incomes that are above a certain point.
all in all slicing is used to manipulate data, and its crucial for data preparation in data analysis.
I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom
I would like to create in python some RL algorithm where the algorithm would interact with a very big DataFrame representing stock prices. The algorithm would tell us: Knowing all of the prices and price changes in the market, what would be the best places to buy/sell (minimizing loss maximizing reward). It has to look at the entire DataFrame each step (or else it wouldn't have the entire information from the market).
Is it possible to build such algorithm (which works relatively fast on a large df)? How should it be done? What should my environment look like and which algorithm (specifically) should I use for this type of RL and which reward system? Where should I start
I think you are a little confused with this .what I think you want to do is to check whether the stock prices of a particular company will go up or not or the stock price of which company will shoot up where you already have a dataset regarding the problem statement.
about RL it does not work on any dataset it's a technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
you can check this blog for some explanation don't get confused.
https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861
Situation
Our company generates waste from various locations in US. The waste is taken to different locations based on suppliers' treatment methods and facilities placed nationally.
Consider a waste stream A which is being generated from location X. Now the overall costs to take care of Stream A includes Transportation cost from our site as well treatment method. This data is tabulated.
What I want to achieve
I would like my python program to import excel table containing this data and plot the distance between our facility and treatment facility and also show in a hub-spoke type picture just like airlines do as well show data regarding treatment method as a color or something just like on google maps.
Can someone give me leads on where should I start or which python API or module that might best suite my scenario?
This is a rather broad question and perhaps not the best for SO.
Now to answer it: you can read excel's csv files with the csv module. Plotting is best done with matplotlib.pyplot.