convert pandas apply to polars apply - python

I am new to python polars and trying to convert the following pandas code to polars.
df.apply(lambda x: x[“obj”].compute(data), axis=1, expand = True)
Column obj in the dataframe df is composed of objects having a function property named compute. data is an external variable in the above code.
When I try the above code using polars,
dl.apply(lambda x: (x[0].compute(data)))
dl is the polars dataframe where the objects are stored in the first column, i.e 0.
I received the following error message:
‘Expr’ object doesn’t have compute property.
I am also not sure if polars have the expand feature.
Can you please help me how I can convert the above pandas apply to polars apply?
Thank you.

Have you tried :
df.with_columns([pl.col('obj').apply(lambda d:d.compute(data))])
I don´t think its optimal but I think it can work.

Related

How to return a DataFrame when using pandas.apply()

I'm trying to get a concated DataFrame using pandas.apply(), there is a demo below:
Just like the code shown, the apply() returns a Series instead of a concated DataFrame that I expected, how can I solve it?
Have you tried using df.applymap()?
Tecnically, they do the same thing but the applymap apply your function to the entire Dataframe and it returns a Dataframe no matter what happens.

Converting series from pandas to pyspark: need to use "groupby" and "size", but pyspark yields error

I am converting some code from Pandas to pyspark. In pandas, lets imagine I have the following mock dataframe, df:
And in pandas, I define a certain variable the following way:
value = df.groupby(["Age", "Siblings"]).size()
And the output is a series as follows:
However, when trying to covert this to pyspark, an error comes up: AttributeError: 'GroupedData' object has no attribute 'size'. Can anyone help me solve this?
The equivalent of size in pyspark is count:
df.groupby(["Age", "Siblings"]).count()
You can also use the agg method, which is more flexible as it allows you to set column alias or add other types of aggregations:
import pyspark.sql.functions as F
df.groupby('Age', 'Siblings').agg(F.count('*').alias('count'))

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Pandas merge not working due to a wrong type

I'm trying to merge two dataframes using
grouped_data = pd.merge(grouped_data, df['Pattern'].str[7:11]
,how='left',left_on='Calc_DRILLING_Holes',
right_on='Calc_DRILLING_Holes')
But I get an error saying can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>
What could be the issue here. The original dataframe that I'm trying to merge to was created from a much larger dataset with the following code:
import pandas as pd
raw_data = pd.read_csv(r"C:\Users\cherp2\Desktop\test.csv")
data_drill = raw_data.query('Activity =="DRILL"')
grouped_data = data_drill.groupby([data_drill[
'PeriodStartDate'].str[:10], 'Blast'])[
'Calc_DRILLING_Holes'].sum().reset_index(
).sort_values('PeriodStartDate')
What do I need to change here to make it a regular normal dataframe?
If I try to convert either of them to a dataframe using .to_frame() I get an error saying that 'DataFrame' object has no attribute 'to_frame'
I'm so confused at to what kind of data type it is.
Both objects in a call to pd.merge need to be DataFrame objects. Is grouped_data a Series? If so, try promoting it to a DataFrame by passing pd.DataFrame(grouped_data) instead of just grouped_data.

Passing PySpark pandas_udf data limit?

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

Categories

Resources