Getting profiled user groups from ALS(Alternating Least Squares) algorithm - python

We are using ALS (Alternating Least Squares) method in our Google Cloud spark environment to recommend some companies to our users. For making the recommendation we are using this tuple (userId, companyId, rating) and the rating value consists of a combination of the user's interests such as clicking the company page, adding a company to favorite list, making an order from the company, etc. (our method is very similar to this link)
And the results are pretty good and works for our business case, however, we are missing 1 thing which is important for us.
We need to learn which users are grouped as similar interests(a.k.a neighbors), Do you know is there any way to get grouped users from pyspark's ALS algorithm?
So we would be able to tag the users according to that grouping
Edit:
I've tried the answered code in the below but the results are strange, my data is paired like this (userId, companyId, rating)
When I run the below code, it groups the users with no common companyId in the same clusterId.
For example, one of the results of the below code is:
(userId: 471, clusterId: 2)
(userId: 490, clusterId: 2)
However users 471 and 490 have nothing in common. I think there is a mistake here:
from __future__ import print_function
import sys
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import IntegerType
from pyspark.mllib.clustering import KMeans, KMeansModel
conf = SparkConf().setAppName("user_clustering")
sc = SparkContext(conf=conf)
sc.setCheckpointDir('checkpoint/')
sqlContext = SQLContext(sc)
CLOUDSQL_INSTANCE_IP = sys.argv[1]
CLOUDSQL_DB_NAME = sys.argv[2]
CLOUDSQL_USER = sys.argv[3]
CLOUDSQL_PWD = sys.argv[4]
BEST_RANK = int(sys.argv[5])
BEST_ITERATION = int(sys.argv[6])
BEST_REGULATION = float(sys.argv[7])
TABLE_ITEMS = "companies"
TABLE_RATINGS = "ml_ratings"
TABLE_RECOMMENDATIONS = "ml_reco"
TABLE_USER_CLUSTERS = "ml_user_clusters"
# Read the data from the Cloud SQL
# Create dataframes
#[START read_from_sql]
jdbcUrl = 'jdbc:mysql://%s:3306/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)
dfAccos = sqlContext.read.jdbc(url=jdbcUrl, table=TABLE_ITEMS)
dfRates = sqlContext.read.jdbc(url=jdbcUrl, table=TABLE_RATINGS)
print("Start Clustering Users")
# print("User Ratings:")
# dfRates.show(100)
#[END read_from_sql]
# Get all the ratings rows of our user
# print("Filtered User Ratings For User:",USER_ID)
# print("------------------------------")
# for x in dfUserRatings:
# print(x)
#[START split_sets]
rddTraining, rddValidating, rddTesting = dfRates.rdd.randomSplit([6,2,2])
print("RDDTraining Size:",rddTraining.count()," RDDValidating Size:",rddValidating.count()," RDDTesting Size:",rddTesting.count())
print("Rank:",BEST_RANK," Iteration:",BEST_ITERATION," Regulation:",BEST_REGULATION)
#print("RDD Training Values:",rddTraining.collect())
#[END split_sets]
print("Start predicting")
#[START predict]
# Build our model with the best found values
# Rating, Rank, Iteration, Regulation
model = ALS.train(rddTraining, BEST_RANK, BEST_ITERATION, BEST_REGULATION)
# print("-----------------")
# print("User Groups Are Created")
# print("-----------------")
user_features = model.userFeatures().map(lambda x: x[1])
related_users = model.userFeatures().map(lambda x: x[0])
number_of_clusters = 10
model_kmm = KMeans.train(user_features, number_of_clusters, initializationMode = "random", runs = 3)
user_features_with_cluster_id = model_kmm.predict(user_features)
user_features_with_related_users = related_users.zip(user_features_with_cluster_id)
clusteredUsers = user_features_with_related_users.map(lambda x: (x[0],x[1]))
orderedUsers = clusteredUsers.takeOrdered(200,key = lambda x: x[1])
print("Ordered Users:")
print("--------------")
for x in orderedUsers:
print(x)
#[START save user groups]
userGroupSchema = StructType([StructField("primaryUser", IntegerType(), True), StructField("groupId", IntegerType(), True)])
dfUserGroups = sqlContext.createDataFrame(orderedUsers,userGroupSchema)
try:
dfUserGroups.write.jdbc(url=jdbcUrl, table=TABLE_USER_CLUSTERS, mode='append')
except:
print("Data is already written to DB")
print("Written to DB and Finished Job")

Once you have trained your model you can get the users feature vector using userFeatures()
After that, you can calculate the distance between the users using some distance function or use a clustering model like KMeans
So if the model is already trained:
user_features = model.userFeatures().map(lambda x: x[1]).repartition(50)
number_of_clusters = 10
model_kmm = KMeans.train(user_features, number_of_clusters, initializationMode = "random", runs = 3)
user_features_with_cluster_id = model_kmm.predict(user_features).zip(user_features)

Related

How to resolve - ValueError: cannot set using a multi-index selection indexer with a different length than the value in Python

I have some sample code that I use to analyze entities and its sentiments using Google's natural language API. For every record in my Pandas dataframe, I want to return a list of dictionaries where each element is an entity. However, I am running into issues when trying to have it work on the production data. Here is the sample code
from google.cloud import language_v1 # version 2.0.0
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/json'
import pandas as pd
# establish client connection
client = language_v1.LanguageServiceClient()
# helper function
def custom_analyze_entity(text_content):
global client
#print("Accepted Input::" + text_content)
document = language_v1.Document(content=text_content, type_=language_v1.Document.Type.PLAIN_TEXT, language = 'en')
response = client.analyze_entity_sentiment(request = {'document': document})
# a document can have many entities
# create a list of dictionaries, every element in the list is a dictionary that represents an entity
# the dictionary is nested
l = []
#print("Entity response:" + str(response.entities))
for entity in response.entities:
#print('=' * 20)
temp_dict = {}
temp_meta_dict = {}
temp_mentions = {}
temp_dict['name'] = entity.name
temp_dict['type'] = language_v1.Entity.Type(entity.type_).name
temp_dict['salience'] = str(entity.salience)
sentiment = entity.sentiment
temp_dict['sentiment_score'] = str(sentiment.score)
temp_dict['sentiment_magnitude'] = str(sentiment.magnitude)
for metadata_name, metadata_value in entity.metadata.items():
temp_meta_dict['metadata_name'] = metadata_name
temp_meta_dict['metadata_value'] = metadata_value
temp_dict['metadata'] = temp_meta_dict
for mention in entity.mentions:
temp_mentions['mention_text'] = str(mention.text.content)
temp_mentions['mention_type'] = str(language_v1.EntityMention.Type(mention.type_).name)
temp_dict['mentions'] = temp_mentions
#print(u"Appended Entity::: {}".format(temp_dict))
l.append(temp_dict)
return l
I have tested it on sample data and it works fine
# works on sample data
data= ['Grapes are good. Bananas are bad.', 'the weather is not good today', 'Michelangelo Caravaggio, Italian painter, is known for many arts','look i cannot articulate how i feel today but its amazing to be back on the field with runs under my belt.']
input_df = pd.DataFrame(data=data, columns = ['freeform_text'])
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
But when i try to parse it thru production data using below code, it fails with multi-index error. I am not able to reproduce the error using the sample pandas dataframe.
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 670, in __setitem__
iloc._setitem_with_indexer(indexer, value)
File "/opt/conda/default/lib/python3.6/site-packages/pandas/core/indexing.py", line 1667, in _setitem_with_indexer
"cannot set using a multi-index "
ValueError: cannot set using a multi-index selection indexer with a different length than the value
Try doing this:
input_df.loc[0, 'entity_object'] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df.loc[i, 'entity_object'] = op
Or for your specific case, you don't need to use the loc function.
input_df["entity_object"] = ""
for i in range(len(input_df)):
op = custom_analyze_entity(input_df.loc[i,'freeform_text'])
input_df["entity_object"][i] = op

Writting data from pubsub to bigtable via cloud functions

I am a beginner at cloud big table and have big issues using cloud functions writing data from pub/sub to bigtable.
Cloud functions gets the messages from pubsub, but the issue is in the next step, writing it into bigtable.
The message is created in a python script and sent to pub/sub.
One example for a message:
b'{"eda":2.015176,"temperature":33.39,"bvp":-0.49,"x_acc":-36.0,"y_acc":-38.0,"z_acc":-128.0,"heart_rate":83.78,"iddevice":15.0,"timestamp":"2019-12-01T20:01:36.927Z"}'
For writing it into bigtable I created a table:
from google.cloud import bigtable
from google.cloud.bigtable import column_family
client = bigtable.Client(project="projectid", admin=True)
instance = client.instance("bigtableinstance")
table = instance.table("bigtable1")
print('Creating the {} table.'.format(table))
print('Creating columnfamily cf1 with Max Version GC rule...')
max_versions_rule = column_family.MaxVersionsGCRule(2)
column_family_id = 'cf1'
column_families = {column_family_id: max_versions_rule}
if not table.exists():
table.create(column_families=column_families)
print("Table {} is created.".format(table))
else:
print("Table {} already exists.".format(table))
This works without problems.
Now I tried to write the message via pub/sub to bigtable with the following python code in cloud functions using the main method:
import json
import base64
import os
from google.cloud import bigtable
from google.cloud.bigtable import column_family, row_filters
project_id = os.environ.get('projetid', 'UNKNOWN')
INSTANCE = 'bigtableinstance'
TABLE = 'bigtable1'
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(INSTANCE)
colFamily = "cf1"
def writeToBigTable(table, data):
# Parameters row_key (bytes) – The key for the row being created.
# Returns A row owned by this table.
row_key = data[colFamily]['iddevice'].value.encode()
row = table.row(row_key)
for colFamily in data.keys():
for key in data[colFamily].keys():
row.set_cell(colFamily,
key,
data[colFamily][key])
table.mutate_rows([row])
return data
def selectTable():
stage = os.environ.get('stage', 'dev')
table_id = TABLE + stage
table = instance.table(table_id)
return table
def main(event, context):
data = base64.b64decode(event['data']).decode('utf-8')
print("DATA: {}".format(data))
eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp = data.split(',')
table = selectTable()
data = {'eda': eda,
'temperature': temperature,
'bvp': bvp,
'x_acc':x_acc,
'y_acc':y_acc,
'z_acc':z_acc,
'heart_rate':heart_rate,
'iddevice':iddevice,
'timestamp':timestamp}
writeToBigTable(table, data)
print("Data Written: {}".format(data))
I tried different versions but cannot find a solution.
Thanks for the help.
All the best
Dominik
I think this line is wrong:
row_key = data[colFamily]['iddevice'].value.encode()
You're passing in the data object, but it doesn't have a 'cf1' property. You also don't have to encode it. Give this a try:
row_key = data['iddevice']
Your for loop will also have the same issue. I think this is what you want instead
for col in data.keys():
row.set_cell(colFamily, key, data[key])
Also, I know you're just playing with it, but using a device id as the only value for a rowkey will end up poorly. What is recommended might be to combine the rowkey and the date or one of your other properties (depending on your query,) and use that as your rowkey. There is a document on Cloud Bigtable schema that is helpful, and a codelab using a more realistic sample dataset and walks through how to pick a schema for that example. It's in Java, but you can still import the data and run your own queries.
first thanks a lot for the help.
I tried to fix it with you code recommendation which is , but unfortunately it doesn't work now due to other errors.
AttributeError: 'DirectRow' object has no attribute 'append'
I guess this is within the following line of code
row.set_cell(colFamily,
key,
data[key])
I could imagine that the errors origin is in the split of the string "data"
eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp = data.split(',')
E.g. eda would look like this:
"'eda':2.015176"
which looks pretty wrong to me.
Especially when I insert it into the following dict:
data = {'eda': eda,....}
The error
AttributeError: 'DirectRow' object has no attribute 'append'
seems to say, that there is a problem with the data I want to process with set_cell. There is said set_cell with row as a list or any other iterable of Direct Row Instance. Shouldn't fit a dic for it?
I tried a workaround with a list, but this seems to make it even worse.
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(INSTANCE)
colFamily = "cf1"
def writeToBigTable(table, dat):
row_key = "{}-{}".format(dat[16], dat[17])
row = table.row(row_key)
for n in range(len(dat)):
row.set_cell(colFamily,
dat[n],
dat[n+9])
table.mutate_rows([row])
return dat
def selectTable():
stage = os.environ.get('stage', 'dev')
table_id = TABLE + stage
table = instance.table(table_id)
return table
def main(event, context):
data = base64.b64decode(event['data']).decode('utf-8')
print("DATA: {}".format(data))
var_1, eda, var_2, temperature, var_3, bvp, var_4, x_acc, var_5, y_acc, var_6, z_acc, var_7, heart_rate, var_8, iddevice, var_9, timestamp = data.replace(':',',').split(',')
table = selectTable(); dat = [var_1, var_2, var_3, var_4, var_5, var_6, var_7, var_8, var_9, eda, temperature, bvp, x_acc, y_acc, z_acc, heart_rate, iddevice, timestamp];
# data = {'eda': eda,
# 'temperature': temperature,
# 'bvp': bvp,
# 'x_acc':x_acc,
# 'y_acc':y_acc,
# 'z_acc':z_acc,
# 'heart_rate':heart_rate,
# 'iddevice':iddevice,
# 'timestamp':timestamp}
writeToBigTable(table, dat)
print("Data Written: {}".format(data))
I am really hard stuck at this problem and have no further ideas how to solve it.

SparkStreaming: How to get list like collect()

I am beginner of SparkStreaming.
I want to load HBase record at SparkStreaming App.
So, I write the the under code by python.
My "load_records" function is getting HBase Records and return the records.
SparkStreaming can not use collect(). sc.newAPIHadoopRDD() need to be used at Driver Program. But SparkStreaming do not have the method which can get objects from workers to driver.
How to get HBase Record at SparkStreaming? or How to call sc.newAPIHadoopRDD()?
def load_records(sc, table, keys):
host = 'localhost'
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
rdd_list = []
for key in keys:
if table == "user":
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "user",
"hbase.mapreduce.scan.columns": "u:uid",
"hbase.mapreduce.scan.row.start": key, "hbase.mapreduce.scan.row.stop": key + "\x00"}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv, valueConverter=valueConv, conf=conf)
rdd_list.append(rdd)
first_rdd = rdd_list.pop(0)
for rdd in rdd_list:
first_rdd = first_rdd.union(rdd)
return first_rdd
sc = SparkContext(appName="UserStreaming")
ssc = StreamingContext(sc, 3)
topics = ["json"]
broker_list = "localhost:9092"
inputs = KafkaUtils.createDirectStream(ssc, topics, {"metadata.broker.list": broker_list})
jsons = inputs.map(lambda input: json.loads(input[1]))
user_id_rdd = jsons.map(lambda json: json["user_id"])
# the under line is not working. Any another methods?
user_id_list = user_id_rdd.collect()
user_record_rdd = load_records(sc, 'user', user_id_list)

Python loop inserting last row only in cassandra

I typed a small demo loop in order to insert random values in Cassandra but only the last record is persisted into the database. I am using cassandra-driver from datastax and its object modeling lib. Cassandra version is 3.7 and Python 3.4. Any idea what I am doing wrong?
#!/usr/bin/env python
import datetime
import uuid
from random import randint, uniform
from cassandra.cluster import Cluster
from cassandra.cqlengine import connection, columns
from cassandra.cqlengine.management import sync_table
from cassandra.cqlengine.models import Model
from cassandra.cqlengine.query import BatchQuery
class TestTable(Model):
_table_name = 'test_table'
key = columns.UUID(primary_key=True, default=uuid.uuid4())
type = columns.Integer(index=True)
value = columns.Float(required=False)
created_time = columns.DateTime(default=datetime.datetime.now())
def main():
connection.setup(['127.0.0.1'], 'test', protocol_version = 3)
sync_table(TestTable)
for _ in range(10):
type = randint(1, 3)
value = uniform(-10, 10)
row = TestTable.create(type=type, value=value)
print("Inserted row: ", row.type, row.value)
print("Done inserting")
q = TestTable.objects.count()
print("We have inserted " + str(q) + " rows.")
if __name__ == "__main__":
main()
Many thanks!
The problem is in the definition of the key column:
key = columns.UUID(primary_key=True, default=uuid.uuid4())
For the default value it's going to call the uuid.uuid4 function once and use that result as the default for all future inserts. Because that's your primary key, all 10 writes will happen to the same primary key.
Instead, drop the parentheses so you are just passing a reference to uuid.uuid4 rather than calling it:
key = columns.UUID(primary_key=True, default=uuid.uuid4)
Now each time you create a row you'll get a new unique UUID value, and therefore a new row in Cassandra.
You need to use the method save.
...
row = TestTable(type=type, value=value)
row.save()
...
http://cqlengine.readthedocs.io/en/latest/topics/models.html#cqlengine.models.Model.save

Unable to create a dataframe from json dstream using pyspark

I am attempting to create a dataframe from json in dstream but the code below does not seem to get the dataframe right -
import sys
import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
if __name__ == "__main__":
if len(sys.argv) != 3:
raise IOError("Invalid usage; the correct format is:\nquadrant_count.py <hostname> <port>")
# Initialize a SparkContext with a name
spc = SparkContext(appName="jsonread")
sqlContext = SQLContext(spc)
# Create a StreamingContext with a batch interval of 2 seconds
stc = StreamingContext(spc, 2)
# Checkpointing feature
stc.checkpoint("checkpoint")
# Creating a DStream to connect to hostname:port (like localhost:9999)
lines = stc.socketTextStream(sys.argv[1], int(sys.argv[2]))
lines.pprint()
parsed = lines.map(lambda x: json.loads(x))
def process(time, rdd):
print("========= %s =========" % str(time))
try:
# Get the singleton instance of SQLContext
sqlContext = getSqlContextInstance(rdd.context)
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = rdd.map(lambda w: Row(word=w))
wordsDataFrame = sqlContext.createDataFrame(rowRdd)
# Register as table
wordsDataFrame.registerTempTable("mytable")
testDataFrame = sqlContext.sql("select summary from mytable")
print(testDataFrame.show())
print(testDataFrame.printSchema())
except:
pass
parsed.foreachRDD(process)
stc.start()
# Wait for the computation to terminate
stc.awaitTermination()
No errors but when the script runs, it does read the json from streaming context successfully however it does not print the values in summary or the dataframe schema.
Example json I am attempting to read -
{"reviewerID": "A2IBPI20UZIR0U", "asin": "1384719342", "reviewerName":
"cassandra tu \"Yeah, well, that's just like, u...", "helpful": [0,
0], "reviewText": "Not much to write about here, but it does exactly
what it's supposed to. filters out the pop sounds. now my recordings
are much more crisp. it is one of the lowest prices pop filters on
amazon so might as well buy it, they honestly work the same despite
their pricing,", "overall": 5.0, "summary": "good", "unixReviewTime":
1393545600, "reviewTime": "02 28, 2014"}
I am absolute new comer to spark streaming and started working on pet projects by reading documentation. Any help and guidance is greatly appreciated.

Categories

Resources