Iterating user-defined class objects inside a pyspark RDD

Iterating user-defined class objects inside a pyspark RDD - python

I am reading data from csv and converting that data into a python class object. But when i try to iterate over that rdd with user-defined class objects, I get errors like,
_pickle.PicklingError: Can't pickle <class '__main__.User'>: attribute lookup User on __main__ failed
I'm adding some part of the code here,
class User:
def __init__(self, line):
self.user_id = line[0]
self.location = line[1]
self.age = line[2]
def create_user(line):
user = User(line)
return user
def print_user(line):
user = line
print(user.user_id)
conf = (SparkConf().setMaster("local").setAppName("exercise_set_2").set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
users = sc.textFile("BX-Users.csv").map(lambda line: line.split(";"))
users_objs = users.map(lambda entry: create_user(entry))
users_objs.map(lambda entry: print_user(entry))
For the above code, I get results like,
PythonRDD[93] at RDD at PythonRDD.scala:43
CSV data source URL(Needs a zip extraction): HERE
UPDATE:
changing the code to include collect will result in error again, I still have to try with Pickle. I never tried that one before, If you anyone have a sample, I can do it easily.
users_objs = users.map(lambda entry: create_user(entry)).collect()

When you use
def create_user(line):
user = User(line)
return user
directly in a map call, this means that the User class has to be accessible to your nodes. Typically this means it needs to be serializable/picklable. How would a node use that class, or know what it is (unless you have a common NFS mount or something)? That's why you have gotten that pickle error. To make your User class picklable, please read this: https://docs.python.org/2/library/pickle.html.
Additionally, you aren't performing a collect() on your RDD, which is why you see PythonRDD[93] at RDD at PythonRDD.scala:43. It's still just an RDD, your data is out on the nodes.

Okay, found an explanation. Storing classes in separate files will make the classes picklable automatically. So I stored the User class inside user.py
and added the following import into my code.
from user import User
contents of User.py
class User:
def __init__(self, line):
self.user_id = line[0]
self.location = line[1]
self.age = line[2]
As mentioned in earlier answer, I can user collect(an RDD method) on the created User objects. So the following code will print all user ids, as I wanted.
for user_obj in users.map(lambda entry: create_user(entry)).collect():
print_user(user_obj)

Related

Store data without Userinteraction

I want to store network devices data through a Django model Device to my database.
Workflow
Host configuration needs to be setup by User inside a View (Host Model)
When the Host configuration is finished the Network should be scanned for devices (Device Model)
The data should be stored inside the DB
Problem:
The fuction create_devices() only is allowed to get called when the Host is configured, but if Host.objects.values(): isnt working.
How is it possible to call the function create_devices() only if one Host Model exists?
Is it correct to use a view to store Dynamic and Static Data in to the DB without User interaction?
Models:
class Host(models.Model):
hostname = models.CharField(default="noads", max_length=6)
ipv4_address = models.GenericIPAddressField('IPv4')
ipv4_subnet = models.GenericIPAddressField('IPv4')
gateway = models.GenericIPAddressField('IPv4')
class Device(models.Model):
hostname = models.CharField(max_length=64)
mac_address = models.CharField(max_length=64)
ipv4_address = models.GenericIPAddressField('IPv4')
My View:
from webapp.models import Host, Device
from django.views import View
from django.views.generic.detail import DetailView
import multiprocessing.dummy
import multiprocessing
def create_devices():
"""
Creates DB entry of devices if they dont already exist
:return: List of mulitple devices stored in objects
:rtype: list ["Device", "Device", ...]
"""
available_devices = get_available_devices_in_list()
arp_table_of_all_hosts = get_arp_table_linux()
dev_list = []
for deviceip in available_devices:
#If device already exists in DB continue
if arp_table_of_all_hosts.get(deviceip):
if arp_table_of_all_hosts[deviceip] in Device.objects.filter(mac_address = arp_table_of_all_hosts[deviceip]):
continue
else:
devmac = arp_table_of_all_hosts[deviceip]
devname = "unknown" #socket.gethostbyaddr(deviceip)
dev = Device(hostname=devname, mac_address=devmac, ipv4_address=deviceip)
dev.save()
dev_list.append(dev)
return dev_list
class DeviceGetAll(DetailView):
if Host.objects.values():
create_devices()
model = Device
pass

Your view DeviceGetAll is not written correctly. The if condition needs to be placed in some of the methods of the view, not in the class definition.
I don't understand what exactly you are trying to do, so I cannot know in which method you need to add the code to, but you can look at the base code of DetailView and see if some of those methods are useful to you.
I don't even know, if the detail view is the best place to put code that creates instances of the same model; maybe it could also go after the creation of the host.
But if you want to use the DetailView, the you could for example override the method get_object() from the SingleObjectMixin (which is one of the parents of DetailView), the your code could look like this:
class DeviceGetAll(DetailView):
model = Device
def get_object(self, *args, **kwargs):
if Host.objects.exists():
create_devices()
obj = super().get_object(*args, **kwargs)
return obj
Also, you probably mean to use the .exists() method in the if-condition instead of .values().

How to move graphene resolve methods to different files?

I have the following code. Query is my root schema.
If I have only one profile it's ok to have resolve method inside of query. But what if schema is too big?
Is anyway to move resolve_profile inside of Profile object type?
import graphene
class Query(graphene.ObjectType):
profile = graphene.ObjectType(Profile)
def resolve_profile(self):
return ...
class Profile(graphene.ObjectType):
firstName = graphene.String(graphene.String)
lastName = graphene.String(graphene.String)

No, you can't move resolve_profile into Profile but there is another technique to handle having a large schema. You can split your query into multiple files and inherit each of these files in Query. In this example, I've broken Query into AQuery, BQuery and CQuery:
class Query(AQuery, BQuery, CQuery, graphene.ObjectType):
pass
And then you could define AQuery in a different file like this:
class AQuery(graphene.ObjectType):
profile = graphene.ObjectType(Profile)
def resolve_profile(self):
return ...
and put other code in BQuery and CQuery.
You can also use the same technique to split up your mutations.

How to resolve name error on variable in class

I am making my first model, and I'm creating an upload system which uploads to a folder with the name of the user uploading it.
For some reason, I get this error when I try to create an object from the model:
NameError at /admin/tracks/track/add/
name '_Track__user_name' is not defined
Here's my models.py
from django.core.exceptions import ValidationError
from django.db import models
from django.core.files.images import get_image_dimensions
# Create your models here.
class Track(models.Model):
user_name = "no_user"
def get_username():
user_name = "no_user"
if request.user.is_authenticated():
user_name = request.user.username
else:
user_name = "DELETE"
def generate_user_folder_tracks(instance, filename):
return "uploads/users/%s/tracks/%s" % (user_name, filename)
def is_mp3(value):
if not value.name.endswith('.mp3'):
raise ValidationError(u'You may only upload mp3 files for tracks!')
def generate_user_folder_art(instance, filename):
return "uploads/users/%s/art/%s" % (user_name, filename)
def is_square_png(self):
if not self.name.endswith('.png'):
raise ValidationError("You may only upload png files for album art!")
else:
w, h = get_image_dimensions(self)
if not h == w:
raise ValidationError("This picture is not square! Your picture must be equally wide as its height.")
else:
if not (h + w) >= 1000:
raise ValidationError("This picture is too small! The minimum dimensions are 500 by 500 pixels.")
return self
# Variables
track_type_choices = [
('ORG', 'Original'),
('RMX', 'Remix'),
('CLB', 'Collab'),
('LIV', 'Live'),
]
# Model Fields
name = models.CharField(max_length=100)
desc = models.TextField(max_length=7500)
track_type = models.CharField(max_length=3,
choices=track_type_choices,
default='ORG')
track_type_content = models.CharField(max_length=100,blank=True)
created = models.TimeField(auto_now=True,auto_now_add=False)
upload = models.FileField(upload_to=generate_user_folder_tracks,validators=[is_mp3])
albumart = models.ImageField(upload_to=generate_user_folder_art,validators=[is_square_png])
As you can see from the first line after the class is defined, there is clearly a variable called "user_name", and when using my upload functions, it is supposed to use this variable for the folder name.
I am very confused to why this is throwing an error, what am I doing wrong?

You have some serious problems with variable scope here. Just defining an attribute called "user_name" at the top of the class does not automatically give you access to it elsewhere in the class; you would need to access it via the class itself. Usually you do that through the self variable that is the first parameter to every method.
However, many of your methods do not even accept a self parameter, so they would give TypeError when they are called. On top of that, your user_name attribute is actually a class attribute, which would be shared by all instances of User - this would clearly be a bad thing. You should really make it a Django field, like the other attributes.
Finally, your scope issues worsen when you try and access request in one of those methods. Again, you can't access a variable unless it has been passed to that method (or is available in global scope, which the request is definitely not). So get_username cannot work at all.
I must say though that all that is irrelevant, as the error you get does not even match your code: you must have accessed Track.__user_name somewhere to get that error.

You do have a variable username, but its not a field which would mean that the query set it looks like you're creating won't find it
user_name = "no_user"
should be one of the following
user_name = models.CharField(default='no_user')
user = models.ForeignKey(settings.AUTH_USER_MODEL, null=True)
The only reason I've suggested a CharField here is incase you don't use some form of authorisation user model in your app. If you do, then you should use a foreign key to that model.

Use objects from a 3rd party library as models in Django Rest Framework development

I tried to describe it best I could in the title, but basically, I want to write an API using the Django REST Framework, but instead of using the Django db and pre defining models, I want my API to take an HTTP call from the user, use that to call another libraries functions, take the objects the 3rd party lib returns, build models based on what it gets back, serialize to JSON and give it back to the caller in JSON.
right now I'm using an extremeley simple class adn function to test this concept. it's got an object definition and a function that reads from a text file and converts it into an object list:
class myObj:
id = None
port = None
cust = None
product = None
def __init__(self, textLine):
props = [x.strip() for x in textLine.split(',')]
self.id = props[0]
self.port = props[1]
self.cust = props[2]
self.product = props[3]
def getObjList():
lines = [line.strip() for line in open("objFile.txt")]
objList = [myObj(x) for x in lines]
return objList
I want my Django REST project to call that getObjList function when I try to access the associated URL in a browser (or call via curl or somethig), build a model based on the object it gets back, create a list of that model, serialize it and give it back to me so I can view it in the browsable web interface. Is this possible or am I being an idiot?
Thanks, I've been a C# developer for a bit now but now working in Python and with this HTTP stuff is a bit overwhelming.

If anyone cares I figured it out, I had to skip models entirely and just build the serializer directly based on the object I got back, and I had to revert to using more basic django views.
Here is the view:
#api_view(['GET'])
def ObjView(request):
if request.method == 'GET':
objList = myObj.getObjList()
dynamic_serializer = SerializerFactory.first_level(objList)
return Response(dynamic_serializer.data)
the getObjList function is the one posted in my question, but this should work with any function and any object that gets returned, here is what goes on in the serializer factory:
from rest_framework import serializers
def first_level(cur_obj):
isList = False
ser_val = cur_obj
if type(cur_obj) in {list, tuple}:
isList = True
ser_val = cur_obj[0]
dynamic_serializer = create_serializer(ser_val)
return dynamic_serializer(cur_obj, many=isList)
def create_serializer(cur_obj):
if type(cur_obj) in {list, tuple}:
if hasattr(cur_obj[0], "__dict__"):
cur_ser = create_serializer(cur_obj[0])
return cur_ser(many=True)
else:
return serializers.ListField(child=create_serializer(cur_obj[0]))
elif type(cur_obj) == dict:
if hasattr(cur_obj.values()[0], "__dict__"):
child_ser = create_serializer(cur_obj.values()[0])
return serializers.DictField(child=child_ser())
else:
return serializers.DictField(child=create_serializer(cur_obj.values()[0]))
elif hasattr(cur_obj, "__dict__"):
attrs = {}
for key, val in cur_obj.__dict__.items():
if "__" not in key:
cur_field = create_serializer(val)
if hasattr(val, "__dict__"):
attrs.update({key: cur_field()})
else:
attrs.update({key: cur_field})
return type(cur_obj.__name__ + "Serializer", (serializers.Serializer,), attrs)
else:
return serializers.CharField(required=False, allow_blank=True, max_length=200)
as you can see i had to break it into 2 pieces (I'm sure there's a way to make it one but it wasn't worth it to me to spend time on it) and it involved a substantial amount of recursion and just generally playing with the different field types, this should be good for serializing any combination of objects, lists, dictionaries and simple data types. I've tested it on a pretty wide array of objects and it seems pretty solid.

How to get a filed from self.data in model_formset_factory in clean method

I am using modelformset_factory to edit multiple images on my interface.
I have following fields in each image.
Name
User
City
I have allowed user to select new user that is currently not in the system, (for that case I should get a text "Jack" in my
def clean_user(self) instead of ID.
But using model_formseta_factory, I am getting some wired names in my self.data. and when I try to get self.data.get('user'), I get nothing, obviously there is no key with this name,
the key is formed like form_0_user etc.
fields = ['city', 'name']
note, i do not have user in my fields. if I do, it fails the validation.
def clean(self):
data = self.cleaned_data
data['name'] = data.get('name', '').strip()
return data
Works fine
pic_credits = self.data.get('user')
This does not.
pic_credits = self.data.get('form-0-name')
This works fine too.
Please help.

If you want to use self.data instead of self.cleaned_data, you can construct the "composite prefix" using the fields auto_id and prefix (or at least when the form has been instanced by a formset).
See _construct_form() https://docs.djangoproject.com/es/1.9/_modules/django/forms/formsets/
Your method will look like this:
def clean(self):
# ...
form_prefix_and_autoid = "%s-%d-" % (self.prefix, self.auto_id)
pic_credits = self.data.get(form_prefix_and_autoid + 'name')
# ...
Update:
A lot simpler is calling the method self.add_prefix
pic_credits = self.data.get(self.add_prefix('name'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating user-defined class objects inside a pyspark RDD - python

Related

Store data without Userinteraction

How to move graphene resolve methods to different files?

How to resolve name error on variable in class

Use objects from a 3rd party library as models in Django Rest Framework development

How to get a filed from self.data in model_formset_factory in clean method

Categories

Resources