Passing PySpark pandas_udf data limit? - python

The problem is simple. Please observe the code below.
#pyf.pandas_udf(pyt.StructType(RESULTS_SCHEMA_LIST), pyf.PandasUDFType.GROUPED_MAP)
def train_udf(df):
return train_ml_model(df=df)
results_df = complete_df.groupby('training-zone').apply(train_udf)
One of the columns of the results_df is typically a very large string (>4e6 characters). While this isn't a problem for a pandas.DataFrame or a spark.DataFrame when I convert the pandas dataframe to a spark dataframe. It is a problem when the pandas_udf() attempts to do this. The error returned is pyarrrow.lib.ArrowInvalid could not convert **string** with type pyarrow.lib.StringValue: did not recognize the Python value type when inferring an Arrow data type
This UDF does work if I don't return the problematic column or I make the problematic column only contain some small string such as 'wow that is cool', so I know the problem is not with the udf itself per se.
I know the function train_ml_model() works because when I get a random group from the spark dataframe then convert it to a pandas dataframe and pass it to train_ml_model() it produces the expected pandas dataframe with the column with a massive string.
I know spark can handle such large strings because when I convert the pandas dataframe to a spark dataframe using spark.createDataFrame() the spark dataframe contains the full expected value.
PS: Why is pyarrow even trying to infer the datatype when I pass the types to the pandas_udf()?
Any help would be very much appreciated!

Related

convert pandas apply to polars apply

I am new to python polars and trying to convert the following pandas code to polars.
df.apply(lambda x: x[“obj”].compute(data), axis=1, expand = True)
Column obj in the dataframe df is composed of objects having a function property named compute. data is an external variable in the above code.
When I try the above code using polars,
dl.apply(lambda x: (x[0].compute(data)))
dl is the polars dataframe where the objects are stored in the first column, i.e 0.
I received the following error message:
‘Expr’ object doesn’t have compute property.
I am also not sure if polars have the expand feature.
Can you please help me how I can convert the above pandas apply to polars apply?
Thank you.
Have you tried :
df.with_columns([pl.col('obj').apply(lambda d:d.compute(data))])
I don´t think its optimal but I think it can work.

Pandas one-line filtering for the entire dataset - how is it achieved?

I am just now diving into this wonderful library and am pretty baffled by how filtering, or even column manipulation, is done and am trying to understand if this is a feature of pandas or of python itself. More precisely:
import pandas
df = pandas.read_csv('data.csv')
# Doing
df['Column'] # would display all values from Column for dataframe
# Even moreso, doing
df.loc[df['Column'] > 10] # would display all values from Column greater than 10
# and is the same with
df.loc[df.Column > 10]
So columns are both attributes, and keys, so DataFrame is both a dict, and object? Or perhaps I am missing some basic python functionality that I don't know about... And accessing a column basically loops over the whole dataset? How is this achieved?
Column filtering or column manipulation or overall data manipulation in a data set is a feature of pandas library itself. Once you load your data using pd.read_csv, the data set is stored as a pandas dataframe in a dictionary-like container. Then ,every column of dataframe is a series object of pandas. It depends on how you're trying to access the column, whether as an attribute(df.columnname) or a key(df['columnname']). Though you can apply methods like .head() or .tail() or .shape or .isna() on both the ways it is accessed. While accessing a certain column, it goes through whole dataset and tries to match the column name you have input. If it is matched, output is shown or else it throws some KeyError or AttributeError depends on how you're accessing it.

Save Pandas dataframe with numeric column as text in Excel

I am trying to export a Pandas dataframe to Excel where all columns are of text format. By default, the pandas.to_excel() function lets Excel decide the data type. Exporting a column with [1,2,'w'] results in the cells containing 1 and 2 to be numeric, and the cell containing 'w' to be text. I'd like all rows in the column to be text (i.e. ['1','2','w']).
I was able to solve the problem by assigning the column I need to be text using the .astype(str). However, if the data is large, I am concerned that I will run into performance issues. If I understand correctly, df[col] = df[col].astype(str) makes a copy of the data, which is not efficient.
import pandas as pd
df = pd.DataFrame({'a':[1,2,'w'], 'b':['x','y','z']})
df['a'] = df['a'].astype(str)
df.to_excel(r'c:\tmp\test.xlsx')
Is there a more efficient way to do this?
I searched SO several times and didn't see anything on this. Forgive me if this has been answered before. This is my first post, and I'm really happy to participate in this cool forum.
Edit: Thanks to the comments I've received, I see that Converting a series of ints to strings - Why is apply much faster than astype? gives me other options to astype(str). This is really useful. I also wanted to know if astype(str) was inefficient because it made a copy of the data, which I now see that it does not.
I don't think that you'll not have performance issues with that approach since data is not copied but replaced. You may also convert the whole dataframe into string type using
df = df.astype(str)

How to run parsing logic of Pandas read_csv on custom data?

read_csv contains a lot of parsing logic to detect and convert CSV strings to numerical and datetime Pythong values. My question is, is there a way to call same conversions also on a DataFrame which contains columns with string data, but where the DataFrame is not stored in CSV file but comes from a different (unparsed) source? So only a memory DataFrame object is available.
So saving such DataFrame to a CSV file and reading it back would do such conversion, but this looks very inefficient to me.
If you have e.g. a column of string type, but containing actually a date
(e.g. yyyy-mm-dd), you can use pd.to_datetime() to convert it to Timestamp.
Assuming that the column name is SomeDate, you can call:
df.SomeDate = pd.to_datetime(df.SomeDate)
Another option is to apply any own conversion function to any your column
(search the Web for description of apply).
You didn't give any details, so I can give only such very general advice.

exporting pandas dataframe of complex numbers to excel

I am working with pandas data frame which has complex numbers as column data. I am trying to export this DataFrame to excel using DataFrame.to_excel method which throws the following error.
raise ValueError("Cannot convert {0} to Excel".format(value))
ValueError: Cannot convert (1.044574-3496.069365j) to Excel
Is there any roundabout way of doing this? My DataFrame looks like this,
Freq lne_10720_15820_1-lne_10720_18229_1 lne_10720_15820_1 \
48 (1.044574-3496.069365j) (7.576632+64.778558j)
50 (1.049333-3355.448147j) (7.557604+67.544162j)
52 (1.054253-3225.613165j) (7.656567+70.317672j)
A simple solution to your problem is to typecast your variables in the dataframe to string type as follows:
DataFrame = Dataframe.astype(str)
After performing this operation, the types of each variable is an object type and can be verified using DataFrame.dtypes
Hope this helps :)

Categories

Resources