Compare two products big data set - python

I have two different product data of 5.4 Million and 4.5 Million Products, which scraped from the competitor website. Most products are non branded products that don't have any unique standard SKU. I want to compare 300K product data with similar products which our competitor are selling and want to find out the price difference.
I have tired compare dataset using two different sphinx with similar words but not able to find out a good result because of the title are not similar of non branded products with a standard brand name, title or SKU
Is there any way to get the result using ML or some big data algorithm ?

If you use Sphinx/Manticore you can:
take each of your products from dataset 1
convert it into a query using a quorum operator with percentile and a ranking formula of your choice
run the query against dataset 2
find results
take top K
There're some additional tricks that can help like:
IDF boosting
skipping stop-words
use of atc-based ranking
The tricks and the concept of finding similar content in general are described in this interactive course - https://play.manticoresearch.com/mlt/

Related

finding big enough sample size by expanding search categories. Algorithmic clustering?

I'm interested in finding 50+ similar samples within a dataset of 3M+ rows and 200 columns
Consider we've got .csv database of vehicles. Every row is one car, and in the columns, there are features like brand, millage, engine size etc.
brand
year bin
engine bin
millage
Ford
2014-2016
1-2
20K-30K
The procedure to automate:
When I receive a new sample I want to find 50+ similar ones. If I can't find exactly the same I can drop/broaden some information. For example, the same model of Ford between 2012 and 2016 is nearly the same car so I would expand the search with a bigger year bin. I expect that if I expand the search for enough categories I will always find a required population.
After this, I got a "search query" like this which returns me 50+ samples so it's maximally precise and big enough to observe the mean, variance etc.
brand
year bin
engine bin
millage
Ford
2010-2018
1-2
10K-40K
Is there anything like this already implemented?
I've tried k-means clustering vehicles by those features but it isn't precise enough and isn't easily interpretable for people without a data science background. I think the "distance" based metrics can't learn the "hard" constraints like not searching in different brands. But maybe there is a way of feature weighting?
I'm happy to receive every suggestion!

Text Analysis to determine Offer Performance

I'm currently exploring different ways to judge and predict the performance of various offers and marketing campaigns. I have a list of metrics to pull from which I'm currently using now to predict performance, such as:
Day the offer was sent
Month
Weather
Time of Day
+more
And for my performance metric, I use
Redemption Rate (For every offer sent, how many times was it redeemed) - This is how I judge success
But one of the most important metrics is the offer itself, which I know in the form of a text-string.
Here are a few user-generated examples.
Get $4.00 off a large pizza
Receive 20% off your next order
Buy any Chocolate Milkshake, get another one half price
Two wraps for $7.50
Free cookie with any purchase
..and hundred's more
Now, I know there's very important information in those text stings, but I don't know the best way to analyze it and extract key information. For example, in this text it shows the product its advertising, the discount, the dollar amount, the percentage off, etc. I need a generalized way to go through each string (I'm assuming through some tolkenized method), and extract relevant information.
I'm hoping to get some input on how I could analyze these strings, eventually with the purpose of generating a string-based dataset (along with the other aforementioned data points) that I can use for predictions.
I am writing my code using python 3.0.
Any advice is greatly appreciated. Thanks.

Machine learning unsupervised approach to extract pattern from text data using python?

I would like to know that how to use unsupervised approach to exract pattern from the text data.
I have data set about the description of the product in the form of title,short and long description.My goal is to find the value of product attribute using the description available.The value which I am trying to find is present in the descripton in many varaitions.
Below are few examples of attribute which product has:
1. recomended minimum and maximum age for particular product.(get the values)
2. Is particular product is made from recycling or not ? (Yes or no).
3. Is remote control included for particular product ? (yes or no)
Currently I am using the regualar expression to get the values/find if its present in the data or not. But its very hard to find the values as I mentioned the values are present in many variations. I can't write all rules or more specific to say I can't generalize these patterns. If new varaitions come then my regex gets fail.
I was wondering is there any fairly intuitive way to automatically build these regex patterns with some sort of algorithm.
How do I use machine learning approach to build some intelligence model that can solve my problem.
Below is one example of the prodcut description.
Example:
UVM1067 Features Quantity per Selling Unit: 1 Set **Total Recycled Content: 30pct** Product Keywords: Kleer-Fax, Inc., Indexes, 8 Color, 10 Color Binders Sets per Pack: 1 Tab Style: 15-Tab Color: Multicolor Country of Manufacture: United States Index Divider Style: Printed Numeric Dimensions Overall Height - Top to Bottom: 11'' Overall Width - Side to Side: 8.5'' Overall Product Weight: 0.3 lbs
You can see in above description of the product it mentioned that total recycled it means that product is made from recycled so I would like to predict the 'Y' as my output.
I can do this by searching word or regex but i want to build some intelligent/automatic model/way to achive this.
Thanks,
Niranjan

Data preparation for scikit learn decision tree

I'm trying to prepare a dataset for scikit learn, planning to build pandas dataframe to feed it to a decision tree classifier.
The data represents different companies with varying criteria, but some criteria can have multiple values - such as "Customer segment" - which, for any given company, could be any, or all of: SMB, midmarket, enterprise, etc. There are other criteria/columns like this with multiple possible values. I need decisions made upon individual values, not the aggregate - so company A for SMB, company A for Midmarket, and not for the "grouping" of customer A for SMB AND midmarket.
Is there guidance on how to handle this? Do I need to generate rows for every variant for a given company to be fed into the learning routine? Such that an input of:
Company,Segment
A,SMB:MM:ENT
becomes:
A, SMB
A, MM
A, ENT
As well as for any other variants that may come from additional criteria/columns - for example "customer vertical" which could also include multiple values? It seems like this will greatly increase the dataset size. Is there a better way to structure this data and/or handle this scenario?
My ultimate goal is to let users complete a short survey with simple questions, and map their responses to values to get a prediction of the "right" company, for a given segment, vertical, product category, etc. But I'm struggling to build the right learning dataset to accomplish that.
Let's try.
df = pd.DataFrame({'company':['A','B'], 'segment':['SMB:MM:ENT', 'SMB:MM']})
expended_segment = df.segment.str.split(':', expand=True)
expended_segment.columns = ['segment'+str(i) for i in range(len(expended_segment.columns))]
wide_df = pd.concat([df.company, expended_segment], axis=1)
result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company'])))
result.dropna()

Algorithm process is slow

Think about a platform where an user choose what factors he give more importance. For example 5 factors of criteria A, B, C, D, E
Then each product review has a weighing for A1, B1, C1, D1, E1. So, if he gave more importance to A, then the weighing will take that in consideration. The result is that each review can have an different overall for each user.
My problem is about the algorithm for that. Currently the processing is slow.
For each category summary, I need to iterate over all companies of that category, and all reviews for each company.
#1 step
find companies of category X with more than 1 review published
companies_X = [1, 2, 3, 5, n]
#2 step
iterate all companies, and all reviews of these companies
for company in companies:
for review in company:
#calculate the weighing of the review for the current user criteria
#give more importance to recent reviews
#3 step
avg of all reviews for each company data
#4 step
make the avg of all companies of this category to create a final score for the category x
This works, but I can't have a page that takes 30 seconds to load.
I am thinking about cache this page, but in that case i need to process this page for all users in background. Not a good solution, definitely.
Any ideas about improvements? Any insight will be welcome.
First option: using numpy and pandas could improve your speed, if leveraged in a smart way, so by avoiding loops whenever it is possible. This can be made by using the apply method, working on both numpy and pandas, along with some condition or lambda function.
for company in companies:
for review in company:
can be replaced by review_data["note"] = note_formula(review_data["number_reviews"])
Edit: here note_formula is a function returning the weighting of the review, as indicated in the comments of the question:
# calculate the weighing of the review for the current user criteria
# give more importance to recent reviews
Your step 4 can be performed by using groupby method from pandas along with a calculation of average.
Second option: where are your data stored? If they are in a data base, a good rule to boost performance is: move the data as little as possible, so perform the request directly in the data base, I think all your operations can be written in SQL, and then redirect only the result to the python script. If your data are stored in an other way, consider using a data base engine, SQLite for instance at the beginning if you don't aim at scaling fast.

Categories

Resources