Trouble translating from Pandas to PySpark

Trouble translating from Pandas to PySpark - python

I'm having alot of trouble translating a function that worked on a pandas DataFrame to a PySpark UDF. Mainly, PySpark is throwing error that I don't really understand because it this is my first time using it. First, My dataset does contain some NaNs, which I didn't know would add some complexity to my task. With that said, the dataset contains the standard data types, i.e. categories and integers. Finally, I am running my algorithm using Pandas groupby() method, apply() to every row and using a lambda function I'm told that PySpark supports all these methods.
Now let me tell you about the algorithm. It's pretty much a counting game that I'm running on one column. and itself is written in vanilla python. The reason I'm saying this is because it's a bit too long to post. It returns three lists, i.e. arrays. Which from what I understand PySpark also supports. This is what a super short version of the algo looks like:
def algo(x, col):
# you will be looking at a specific pandas column --- pd.Series
x = x[col]
# LOGIC GOES HERE...
return list1, list2, list3
I'm running the algorithm using:
data = df.groupby("GROUPBY_THIS").apply(lambda x: algo(x, "COLUMN1"))
And everything is working fine. I'm returning the three lists of the correct length. Now when I try to run this algorithm using PySpark I'm confused on whether to use UDFs or PandasUDF. In addition, I'm throwing error that I can quite understand. Can someone point me in the correct direction here. Thanks!
Error:
ValueError: Invalid udf: the udf argument must be a pandas_udf of type GROUPED_MAP.

Related

When should I worry about using copy() with a pandas DataFrame?

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?
In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.

From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.
The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

Pandas Dataframe to Apache Beam PCollection conversion problem

I'm trying to convert a pandas DataFrame to a PCollection from Apache Beam.
Unfortunately, when I use to_pcollection() function, I get the following error:
AttributeError: 'DataFrame' object has no attribute '_expr'
Does anyone know how to solve it?
I'm using pandas=1.1.4, beam=2.25.0 and Python 3.6.9.

to_pcollection was only ever intended to apply to Beam's deferred Dataframes, but looking at this it makes sense that it should work, and isn't obvious how to do manually. https://github.com/apache/beam/pull/14170 should fix this.

I get this problem when I use a "native" Pandas dataframe instead of a dataframe created by to_dataframe within Beam. I suspect that the dataframe created by Beam wraps or subclasses a Pandas dataframe with new attributes (like _expr) that the native Pandas dataframe doesn't have.
The real answer involves knowing how to use apache_beam.dataframe.convert.to_dataframe, but I can't figure out how to set the proxy object correctly (I get Singleton errors when I try to later use to_pcollection). So since I can't get the "right" way to to work in 2.25.0 (I'm new to Beam and Pandas—and don't know how proxy objects work—so take all this with a grain of salt), I use this workaround:
class SomeDoFn(beam.DoFn):
def process(self, pair): # pair is a key/value tuple
df = pd.DataFrame(pair[1]) # just the array of values
## do something with the dataframe
...
records = df.to_dict('records')
# return a tuple with the same shape as the one we received
return [(rec["key"], rec) for rec in records]
which I invoke with something like this:
rows = (
pcoll
| beam.ParDo(SomeDoFn())
)
I hope others will give you a better answer than this workaround.

How to use parse from phonenumbers Python library on a pandas data frame?

How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.

Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.

Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)

Filtering a dataset on values not in another dataset

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.

Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Streamlining appending of boolean column in pandas dataframe

Disclaimer: My code is very amateurish as I am still undergoing course work activities. Please bear with me if my code is inefficient or of poor quality.
I have been learning the power of pandas in a recent Python tutorial and have been applying this to some of my course work. We have learnt how to use boolean filtering on Pandas so I decided to go one step further and try to append boolean values to a column in my data (efficiency).
The tutor has said we should focus on minimising code as much as we can -
I have attempted to do so for the below efficiency column.
The baseline efficiency value is 0.4805 (48.05%). If the values are above this, it is acceptable. If it is below this, it is a 'fail'.
I have appended this to my dataframe using the below code:
df['Classification'] = (df[['Efficiency_%']].sum(axis=1) > 0.4805)
df['Classification'] = (df['Classification'] == True).astype(int)
While this is only 2 lines of code - is there a way I can streamline this further into just one line?
I had considered using a 'lambda' function which I am currently reading into. I am interested if there are any other alternatives I could consider.
My approaches I have tried have been:
For Loops - Advised against using this due to it being inefficient.
If statements - I couldn't get this to work as I can't append a '1' or '0' to the df['Classification'] column as it is a dataframe and not a series.
if i > 0.4805:
df['Classification'].append('0') else:
df['Classification'].append('1')if test
Thank you.

This should do the same; It's unnecessary to sum a one column data frame by row, df[['Efficiency_%']].sum(axis=1) is the same as df['Efficiency_%'], and also Boolean Series == True is not necessary as it yields the same result as Boolean Series itself.
df['Classification'] = (df['Efficiency_%'] > 0.4805).astype(int)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble translating from Pandas to PySpark - python

Related

When should I worry about using copy() with a pandas DataFrame?

Pandas Dataframe to Apache Beam PCollection conversion problem

How to use parse from phonenumbers Python library on a pandas data frame?

Filtering a dataset on values not in another dataset

Streamlining appending of boolean column in pandas dataframe

Categories

Resources