How to solve this Pyspark Code Block using Regexp - python

I have this CSV file
but when I am running my notebook regex shows some error
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
dff.show(truncate=False)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
print(columnLabel)
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$','')).drop(newColumnLabel)
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
As and a result I am getting this
Can anyone improvise this code, it will be a great help.
Expected output is
|��123456��,��Version2��,��All questions have been answered accurately and the guidance in the questionnaire was understood and followed��,��2010-12-16 00:01:48.020000000��|
But I am getting
��Id��,��Version��,��Questionnaire��,��Date��
Second column is showing Truncated value

You will need to import the libraries you want to use first, to use them. The below code in a cell before the regexp_replace call should fix this issue
from pyspark.sql.functions import regexp_replace

This is working asnwer
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
#dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)

Related

Polars dataframe doesn't drop column

I have a function in a script that I am testing and the df.drop() function is not working as expected.
app.py
def area(df,value):
df["area"] = df['geo'].apply(lambda row:to_area(row))
df["area"] = df["area"].apply(lambda row: abs(row - mean))
df = df.filter(pl.col("area") < value)
df = df.drop("area")
return df
test.py
def test():
df = some df
res = area(df,2)
res_2 = area(df,4)
At res_2, I keep getting the "area" column back in the dataframe, which is causing me problems with type checking. Any ideas on what might be causing this? I know that using df.clone() works, but I don't understand what is causing this issue with how things are set up.

Pandas Dataframe display total

Here is an example dataset found from google search close to my datasets in my environment
I'm trying to get output like this
import pandas as pd
import numpy as np
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df=pd.DataFrame(data, columns=['Product','State','Sales'])
df1=df.sort_values('State')
#df1['Total']=df1.groupby('State').count()
df1['line']=df1.groupby('State').cumcount()+1
print(df1.to_string(index=False))
Commented out line throws this error
ValueError: Columns must be same length as key
Tried with size() it gives NaN for all rows
Hope someone points me to right direction
Thanks in advance
I think this should work for 'Total':
df1['Total']=df1.groupby('State')['Product'].transform(lambda x: x.count())
Try this:
df = pd.DataFrame(data).sort_values("State")
grp = df.groupby("State")
df["Total"] = grp["State"].transform("size")
df["line"] = grp.cumcount() + 1

pyspark UDF returns AttributeError: 'DataFrame' object has no attribute 'sort_values'

I'm having a hard time with my program, I'm trying to apply a UDF to a dataframe and getting an error msg as per my title. Here is my code
import pandas as pd
import datetime as dt
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = pd.DataFrame({
'ID':[1,2,2],
'dt':[pd.Timestamp.now(),pd.Timestamp.now(),
pd.Timestamp.now()]})
df.head()
def FlagUsers(df,ids,tm,gap):
df=df.sort_values([ids,tm])
df[ids]=df[ids].astype(str)
df['timediff'] = df.groupby(ids)[tm].diff()
df['prevtime']= df.groupby (ids)[tm].shift()
df['prevuser']= df[ids].shift()
df['prevuser'].fillna(0,inplace=True)
df['timediff']=df.timediff/ pd.Timedelta('1 minute')
df['timediff'].fillna(99,inplace=True)
df['flagnew']=np.where((df.timediff<gap) & (df['prevuser']==df[ids]),'existing','new' )
df.loc[df.flagnew == 'new','sessnum'] = df.groupby([ids,'flagnew']).cumcount()+1
df['sessnum']=df['sessnum'].fillna(method='ffill')
df['session_key']= df[ids].astype(str)+"_"+df['sessnum'].astype(str)
df.drop(['prevtime', 'prevuser'], axis =1, inplace= True)
arr=df['session_key'].values
return arr
# Python Function works fine:
FlagUsers(df,'ID','dt',5)
s_df = spark.createDataFrame(df)
s_df.show()
spark.udf.register("FlagUsers", FlagUsers)
s_df = s_df.withColumn('session_key',FlagUsers(s_df,'ID','dt',5))
My function works fine in python but when i try to run it in Spark it does not work? i'm really sorry if this is a silly question! Thank you & Best Wishes
pyspark udf is not same as other native python udf, it has specific requirements,
Please experiment with pandas udf.
it is several times faster and better
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

How to convert pandas code using .str and .split to Pyspark

I wrote the following code using pandas:
df['last_two'] = df['text'].str[-2:]
df['before_hyphen'] = df['text'].str.split('-').str[0]
df['new_text'] = df['before_hyphen'].astype(str) + "-" + df['last_two'].astype(str)
But when I run it on a spark dataframe I get the following error:
TypeError: startPos and length must be the same type
I know I could just convert the df to pandas, run the code, and then convert it back to a spark df, but I wonder if there's a better way? Thanks
You can try the string functions below:
import pyspark.sql.functions as F
df2 = df.withColumn(
'last_two', F.expr('substring(text, -2)')
).withColumn(
'before_hyphen', F.substring_index('text', '-', 1))
).withColumn(
'new_text', F.concat_ws('-', 'before_hyphen', 'last_two')
)

Running Half life codes for a mean reverting series

I am currently trying to compute the Half life results for multiple columns of data. I have tried to incorporate the codes I got from 'pythonforfinance.com' Link.
However, I seem to have missed a few edits that is resulting in errors being thrown.
This is how my df looks like: Link
and the code I am running:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.read_excel('C:\\Users\Sai\Desktop\Test\Spreads.xlsx')
Halflife_results={}
for col in df1.columns.values:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
md = sm.OLS(spread_ret,spread_lag2)
mdf = md.fit()
half_life = round(-np.log(2) / mdf.params[1],0)
print('half life:', half_life)
The error that is being thrown is:
File "C:/Users/Sai/Desktop/Test/Half life test 2.py", line 12
spread_lag.ix([0]) = spread_lag.ix([1])
^
SyntaxError: can't assign to function call
Based on the error message, I seem to have made a very basic mistake but since I am a beginner I am not able to fix the issue. If not a solution to this code, an explanation to these lines of the codes would be of great help:
spread_lag = df1.shift(periods=1, axis=1)
spread_lag.ix([0]) = spread_lag.ix([1])
spread_ret = df1.columns - spread_lag
spread_ret.ix([0]) = spread_ret.ix([1])
spread_lag2 = sm.add_constant(spread_lag)
As explained by the error message, pd.Series.ixisn't callable: you should change spread_lag.ix([0]) to spread_lag.ix[0].
Also, you shouldn't shift on axis=1 (rows) since you're interested in differences along each column (axis=0, default value).
Defining a get_halflifefunction allows you then to directly apply it to each column, removing the need for a loop.
def get_halflife(s):
s_lag = s.shift(1)
s_lag.ix[0] = s_lag.ix[1]
s_ret = s - s_lag
s_ret.ix[0] = s_ret.ix[1]
s_lag2 = sm.add_constant(s_lag)
model = sm.OLS(s_ret,s_lag2)
res = model.fit()
halflife = round(-np.log(2) / res.params[1],0)
return halflife
df1.apply(get_halflife)

Categories

Resources