create PySpark Dataframe column based on class method

create PySpark Dataframe column based on class method - python

I have a python class and it has functions like below:
class Features():
def __init__(self, json):
self.json = json
def get_email(self):
email = self.json.get('fields', {}).get('email', None)
return email
And I am trying to use the get_email function in a pyspark dataframe to create a new column based on another column, "raw_json",which consists of json value:
df = data.withColumn('email', (F.udf(lambda j: Features.get_email(json.loads(j)), t.StringType()))('raw_json'))
So the ideal pyspark dataframe looks like below:
+---------------+-----------
|raw_json |email
+----------------+----------
| |
+----------------+--------
| |
+----------------+-------
But I am getting an error saying:
TypeError: unbound method get_email() must be called with Features instance as first argument (got dict instance instead)
How should I do to achieve this?
I have seen a similar question asked before but it was not resolved.

I guess you have misunderstood how classes are used in Python. You're probably looking for this instead:
udf = F.udf(lambda j: Features(json.loads(j)).get_email())
df = data.withColumn('email', udf('raw_json'))
where you instantiate a Features object and call the get_email method of the object.

Related

Store data in a class dynamically and access data as class attributes

I am trying to write a class that takes data where the dataframe IDs as strings and the values as DataFrames and create class attributes accessing the data.
I was able to write a small example of a similar class that needs the methods to be created in a static manner and return the objects as class methods but I would like to loop over the data, taking in the keys for the dfs and allow for access to each df using attributes.
minimum working example
from dataclasses import dataclass
import pandas as pd
# re-writing as dataclass
#dataclass
class Dataset:
# data container dictionary as class attribute
dict = {'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})}
def df1_id(self) -> pd.DataFrame:# class method to create as class attribute
return dict['df1_id']
def df2_id(self) -> pd.DataFrame:# same class method above
return dict['df2_id']
def df3_id(self) -> pd.DataFrame:# same class method above
return dict['df3_id']
def dataframes_as_class_attributes(self):
# store the dfs to access as class attributes
# replacing 3 methods above
return
result
datasets = Dataset()
print(datasets.df1_id())
expected result
datasets = Dataset()
print(datasets.df1_id) # class attribute created by looping through the dict object
Edit:
Similar to this: How to read the contents of a csv file into a class with each csv row as a class instance

You could use setattr like below:
from dataclasses import dataclass
import pandas as pd
#dataclass
class Dataset:
dict_ = {'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})}
def __post_init__(self):
for key, val in self.dict_.items():
setattr(self, key, val)
To avoid conflicts with python keywords put a single trailing underscore after variable name. (PEP 8)

taking in the keys for the dfs and allow for access to each df using attributes.
It seems that the only purpose of the class is to have attribute access syntax. In that case, it would be simpler to just create a namespace object.
from types import SimpleNamespace
class Dataset(SimpleNamespace):
pass
# extend it possibly
data = {
'df1_id':pd.DataFrame({'col1':[1,1]}),
'df2_id':pd.DataFrame({'col2':[2,2]}),
'df3_id':pd.DataFrame({'col3':[3,3]})
}
datasets = Dataset(**data)
Output:
>>> datasets.df1_id
col1
0 1
1 1
>>> datasets.df2_id
col2
0 2
1 2
>>> datasets.df3_id
col3
0 3
1 3

How to create a new column with a null value using Pyspark DataFrame?

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type, counter_type and billable_item_sid. I have a function called apply_event_key_transform in which I want to break up the concatenated eventkey and create new columns for each of the elements.
def apply_event_key_transform(data_frame: DataFrame):
output_df = data_frame.withColumn("account_type", getAccountTypeUDF(data_frame.eventkey)) \
.withColumn("counter_type", getCounterTypeUDF(data_frame.eventkey)) \
.withColumn("billable_item_sid", getBiSidUDF(data_frame.eventkey))
output_df.drop("eventkey")
return output_df
I've created UDF functions to retrieve the account_type, counter_type and billable_item_sid from a given eventkey value. I have a class called EventKey that takes the full eventkey string as a constructor param, and creates an object with data members to access the account_type, counter_type and billable_item_sid.
getAccountTypeUDF = udf(lambda x: get_account_type(x))
getCounterTypeUDF = udf(lambda x: get_counter_type(x))
getBiSidUDF = udf(lambda x: get_billable_item_sid(x))
def get_account_type(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.account_type.name
def get_counter_type(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.counter_type
def get_billable_item_sid(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.billable_item_sid
The issue that I'm running into is that a billable_item_sid can be null, but when I attempt to call withColumn with a None, the entire frame drops the column when I attempt to aggregate the data later. Is there a way to create a new column with a Null value using withColumn and a UDF?
Things I've tried (for testing purposes):
.withColumn("billable_item_sid", lit(getBiSidUDF(data_frame.eventkey)))
.withColumn("billable_item_sid", lit(None).castString())
Tried a when/otherwise condition for billable_item_sid for null checking

Found out the issue was caused when writing the DataFrame to json.Fixed this by upgrading pyspark to 3.1.1, which has a called ignoreNullFields=False

Is it possible to use a keyword name from **kwargs to filter my data frame?

Apologies if the title is a bit obscure, I am happy to change it..
Problem: I am trying to use a keyword name in the following code to filter by column name in a dataframe using pandas.
#staticmethod
def filter_json(json, col_filter, **kwargs):
'''
Convert and filter a JSON object into a dataframe
'''
df = pd.read_json(json).drop(col_filter, axis=1)
for arg in kwargs:
df = df[(df.arg.isin(kwargs[arg]))]
return df
However I get error AttributeError: 'DataFrame' object has no attribute 'arg' because arg is not a valid column name (makes sense) at line df.arg.isin(kwargs[arg]))]
I am calling the method with the following...
filter_json(json_obj, MY_COL_FILTERS, IsOpen=['false', 0])
Meaning df.arg should essentially be df.IsOpen
Question: Is there a way to use arg as my column name (IsOpen) here? Rather then me having to input it manually as df.IsOpen

You can access columns with dataframe[columnname] notation as well: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
Try:
for arg in kwargs: # arg is 'IsOpen'
df = df[(df[arg].isin(kwargs[arg]))] # df['IsOpen'] is same as df.IsOpen

Pandas Data Frame is not correctly identified: Instance of 'tuple' has no 'filter' member

I am writing a class containing pandas functionalities. As an input I have a pandas dataframe but python seems to not recognizing it right.
import pandas as pd
class box:
def __init__(self, dataFrame, pers, limit):
self.df = dataFrame,
self.pers = pers,
self.data = limit
def cleanDataset(self):
persDf = self.df.filter(regex=('^' + self.pers + r'[1-9]$'))
persDF.replace({'-': None})
self.df.filter(...) gives me the warning: Instance of 'tuple' has no 'filter' member. I have found this but cannot apply the solution though since the problem is not caused by django.
Anyone who can help me out here?

Your problem is the comma at the end of self.df = dataFrame, (and self.pers = pers,). The comma isn't necessary here.
The comma makes the class think you're defining self.df as a tuple with one member. To check this, create a box object b and try print type(box.df). I'm guessing this will return <type 'tuple'>.
Remove the commas after the attribute definitions:
class box:
def __init__(self, dataFrame, pers, limit):
self.df = dataFrame
self.pers = pers
self.data = limit

Populating an object from dataframe

Currently trying to implement Genetic Algorithm. I have built a Python class Gene
I am trying to load an object Gene from a dataframe df
class Gene:
def __init__(self,id,nb_trax,nb_days):
self.id=id
self.nb_trax=nb_trax
self.nb_days=nb_days
and then create another object Chrom
class Chromosome(object):
def __init__(self):
self.port = [Gene() for id in range(20)]
And a second class Chromosome with 20 Gene objects as its property
This is the dataframe
ID nb_obj nb_days
ECGYE 10259 62.965318
NLRTM 8007 46.550562
I successfully loaded the Gene using
tester=df.apply(lambda row: Gene(row['Injection Port'],row['Avg Daily Injection'],random.randint(1,10)), axis=1)
But i cannot load Chrom class using
f=Chromosome(tester)
I get this error
Traceback (most recent call last):
File "chrom.py", line 27, in <module>
f=Chromosome(tester)
TypeError: __init__() takes 1 positional argument but 2 were given
Any help please?

The error is misleading because it says __init__ takes 1 positional argument (which is the self from the object of the class Chromosome).
Secondly, what you are getting from the operation on df in tester is actually a DataFrame indexed as df with one column of Gene values.
To solve this you would have to change the code along these lines:
class Chromosome(object):
def __init__(self, df):
self.port = [Gene() for id in range(20)]
self.xxx = list(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

create PySpark Dataframe column based on class method - python

I guess you have misunderstood how classes are used in Python. You're probably looking for this instead: udf = F.udf(lambda j: Features(json.loads(j)).get_email()) df = data.withColumn('email', udf('raw_json')) where you instantiate a Features object and call the get_email method of the object.

Related

Store data in a class dynamically and access data as class attributes

How to create a new column with a null value using Pyspark DataFrame?

Is it possible to use a keyword name from **kwargs to filter my data frame?

Pandas Data Frame is not correctly identified: Instance of 'tuple' has no 'filter' member

Populating an object from dataframe

Categories

Resources