Pyspark - Column transformation causes data shuffle - python

I am trying to transform data in Pyspark dataframe in order to export it.
I have arrays like "[1,2,3]", and I need to transform it to a string like "(1;2;3)".
Array need to be concatenated, and parenthesis should be added at beginning and end of the array.
I also need to apply some regex.
Sample input would be like :
col1
array1
col2
"First"
[1,2,3]
"a~"
"Second"
[4,5,6]
"b"
Excepted output :
col1
array1
col2
"First"
"(1;2;3)"
"a"
"Second"
"(4;5;6)"
"b"
Actual wrong output :
col1
array1
col2
"First"
"(4;5;6)"
"a"
"Second"
"(X;X;X)"
"b"
where "(X;X;X)" would be data from another row.
I tried the following code :
for c in df.columns:
if isinstance(df.schema[c].dataType, ArrayType):
print(c)
df= df.withColumn(c, concat_ws(';', col(c))).withColumn(c, concat(lit("("), col(c), lit(")"))).withColumn(c, F.regexp_replace(c, '\n|\r|\\n|\\r|~|\\(\\)|', ''))
else:
df= df.withColumn(c, F.regexp_replace(c, '\n|\r|\\n|\\r|~|', ''))
I make a loop on every column of the Pyspark Dataframe. If the column is an array, I concatenate it and apply the regexp. If not, I only apply the regexp.
The issue is, after those operation, data is shuffled in my columns, and I don't have the data excepted. For example, if column d had "b" as a value for a given row, it would now be "c" or "d" for the same row.
How can I apply those transformations without "shuffling" the data ?
I am not sure that the way I actually loop on each column is a great practice with PySpark, but I really need to apply my function on every column, and check if it's an array or not to adapt the processing.

Based on your data, here is the dataframe:
a = [
("First", [1, 2, 3], "a~"),
("Second", [4, 5, 6], "b"),
]
b = "col1 array1 col2".split()
df = spark.createDataFrame(a,b)
df.show()
+------+---------+----+
| col1| array1|col2|
+------+---------+----+
| First|[1, 2, 3]| a~|
|Second|[4, 5, 6]| b|
+------+---------+----+
I tried you code. Nothing wrong :
from pyspark.sql import functions as F, types as T
for c in df.columns:
if isinstance(df.schema[c].dataType, T.ArrayType):
print(c)
df = (
df.withColumn(c, F.concat_ws(";", F.col(c)))
.withColumn(c, F.concat(F.lit("("), F.col(c), F.lit(")")))
.withColumn(c, F.regexp_replace(c, "\n|\r|\\n|\\r|~|\\(\\)|", ""))
)
else:
df = df.withColumn(c, F.regexp_replace(c, "\n|\r|\\n|\\r|~|", ""))
df.show()
+------+-------+----+
| col1| array1|col2|
+------+-------+----+
| First|(1;2;3)| a|
|Second|(4;5;6)| b|
+------+-------+----+

Related

Sort numpy arrays from .dat files in the right row of a pandas dataframe

I got a question about storing data from .dat files in the right row of a dataframe. I go with this minimal example.
I have already a dataframe like this:
data = {'col1': [1, 2, 3, 4],'col2': ["a", "b", "c", "d"]}
df = pd.DataFrame(data, index=['row_exp1','row_exp2','row_exp3','row_exp4'])
Now I want to add a new column called col3 with numpy arrays in each single cell. Thus, I will have 4 numpy arrays, one in every cell.
I get the numpy arrays from a .dat file.
The import part is that I have to found the right row. I have 4 .dat files and every dat file matches to the row name. For example the first .dat file has got the name 230109_exp3_foo.dat. So this dat file matches to the third row of my dataframe.
Then the algorithm has to put the data from the .dat file in the right cell:
col1
col2
col3
row_exp1
1
a
row_exp2
2
b
row_exp3
3
c
[1,2,3,4,5,6]
row_exp4
4
d
The other entries should be NaN and I would fill them with the right numpy array in the next loop.
I think the difficult part is to select the right row and to math this with the file name of the .dat file.
If you're working with time series data, this isn't how you want to structure your dataframe. Read up on "tidy" data. (https://r4ds.had.co.nz/tidy-data.html)
Every column is a variable. Every row is an observation.
So let's assume you're loading your data with a function called load_data that accepts a file name:
def load_data(filename):
# load the data, fill in your own details
pass
Then you would build up your dataframe like this:
meta_data = {
'col1': [1, 2, 3, 4],
'col2': ["a", "b", "c", "d"],
}
list_of_dataframes = []
for n, fname in enumerate(filenames):
this_array = load_data(fname)
list_of_dataframes.append(
pd.DataFrame({
'row_num': list(range(len(this_array))),
'col1': meta_data['col1'][n],
'col2': meta_data['col2'][n],
'values': this_array,
})
)
df = pd.concat(list_of_dataframes, ignore_index=True)
Maybe it helps:
# Do you have the similar pattern in each .dat file name? (I assume that yes)
list_of_files = ['230109_exp3_foo.dat', '230109_exp2_foo.dat', '230109_exp1_foo.dat', '230109_exp4_foo.dat']
# for each index trying to find value after row_ in file list
files_match = df.reset_index()['index'].map(lambda x: [y for y in list_of_files if x.replace('row_', '') in y])
# if I understand correctly, you know how to read .dat file,
# so you can insert your function instead of function_for_reading_dat_file
df['col3'] = files_match.map(lambda x: function_for_reading_dat_file(x[0]) if len(x) != 0 else 'None')

how to use list comprehension variable names in Pyspark dataframes

I am trying to build a list comprehension that has an iteration built into it. however, I have not been able to get this to work. What am I doing wrong?
Here is a trivial representation of what I am trying to do.
dataframe columns = ["code_number_1", "code_number_2", "code_number_3", "code_number_4", "code_number_5", "code_number_6", "code_number_7", "code_number_8",
cols = [0,3,4]
result = df.select([code_number_{f"{x}" for x in cols])
Addendum:
my ultimate goal is to do something like this:
col_buckets ["code_1", "code_2", "code_3"]
amt_buckets = ["code_1_amt", "code_2_amt", "code_3_amt" ]
result = df.withColumn("max_amt_{col_index}", max(df.select(max(**amt_buckets**) for col_indices of amt_buckets if ***any of col indices of col_buckets*** =='01')))
[code_number_{f"{x}" for x in cols] not a valid list comprehension syntax.
Instead try with ["code_number_"+str(x) for x in cols] generates list of column names ['code_number_0', 'code_number_3', 'code_number_4'].
.select accepts strings/columns as arguments to select the matching fields from dataframe.
Example:
df=spark.createDataFrame([("a","b","c","d","e")],["code_number_0","code_number_1","code_number_2","code_number_3","code_number_4"])
cols = [0,3,4]
#passing strings to select
result = df.select(["code_number_"+str(x) for x in cols])
#or passing columns to select
result = df.select([col("code_number_"+str(x)) for x in cols]).show()
result.show()
#+-------------+-------------+-------------+
#|code_number_0|code_number_3|code_number_4|
#+-------------+-------------+-------------+
#| a| d| e|
#+-------------+-------------+-------------+

Apply a function to all cells in Spark DataFrame

I'm trying to convert some Pandas code to Spark for scaling. myfunc is a wrapper to a complex API that takes a string and returns a new string (meaning I can't use vectorized functions).
def myfunc(ds):
for attribute, value in ds.items():
value = api_function(attribute, value)
ds[attribute] = value
return ds
df = df.apply(myfunc, axis='columns')
myfunc takes a DataSeries, breaks it up into individual cells, calls the API for each cell, and builds a new DataSeries with the same column names. This effectively modifies all cells in the DataFrame.
I'm new to Spark and I want to translate this logic using pyspark. I've converted my pandas DataFrame to Spark:
spark = SparkSession.builder.appName('My app').getOrCreate()
spark_schema = StructType([StructField(c, StringType(), True) for c in df.columns])
spark_df = spark.createDataFrame(df, schema=spark_schema)
This is where I get lost. Do I need a UDF, a pandas_udf? How do I iterate across all cells and return a new string for each using myfunc? spark_df.foreach() doesn't return anything and it doesn't have a map() function.
I can modify myfunc from DataSeries -> DataSeries to string -> string if necessary.
Option 1: Use a UDF on One Column at a Time
The simplest approach would be to rewrite your function to take a string as an argument (so that it is string -> string) and use a UDF. There's a nice example here. This works on one column at a time. So, if your DataFrame has a reasonable number of columns, you can apply the UDF to each column one at a time:
from pyspark.sql.functions import col
new_df = df.select(udf(col("col1")), udf(col("col2")), ...)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
def plus1_udf(x):
return x + 1
plus1 = spark.udf.register("plus1", plus1_udf)
new_df = df.select(plus1(col("col1")), plus1(col("col2")))
new_df.show()
+-----------+-----------+
|plus1(col1)|plus1(col2)|
+-----------+-----------+
| 2| 5|
| 3| 6|
| 4| 7|
+-----------+-----------+
Option 2: Map the entire DataFrame at once
map is available for Scala DataFrames, but, at the moment, not in PySpark.
The lower-level RDD API does have a map function in PySpark. So, if you have too many columns to transform one at a time, you could operate on every single cell in the DataFrame like this:
def map_fn(row):
return [api_function(x) for (column, x) in row.asDict().items()
column_names = df.columns
new_df = df.rdd.map(map_fn).toDF(df.columns)
Example
df = sc.parallelize([[1, 4], [2,5], [3,6]]).toDF(["col1", "col2"])
def map_fn(row):
return [value + 1 for (_, value) in row.asDict().items()]
columns = df.columns
new_df = df.rdd.map(map_fn).toDF(columns)
new_df.show()
+----+----+
|col1|col2|
+----+----+
| 2| 5|
| 3| 6|
| 4| 7|
+----+----+
Context
The documentation of foreach only gives the example of printing, but we can verify looking at the code that it indeed does not return anything.
You can read about pandas_udf in this post, but it seems that it is most suited to vectorized functions, which, as you pointed out, you can't use because of api_function.
The solution is:
udf_func = udf(func, StringType())
for col_name in spark_df.columns:
spark_df = spark_df.withColumn(col_name, udf_func(lit(col_name), col_name))
return spark_df.toPandas()
There are 3 key insights that helped me figure this out:
If you use withColumn with the name of an existing column (col_name), Spark "overwrites"/shadows the original column. This essentially gives the appearance of editing the column directly as if it were mutable.
By creating a loop across the original columns and reusing the same DataFrame variable spark_df, I use the same principle to simulate a mutable DataFrame, creating a chain of column-wise transformations, each time "overwriting" a column (per #1 - see below)
Spark UDFs expect all parameters to be Column types, which means it attempts to resolve column values for each parameter. Because api_function's first parameter is a literal value that will be the same for all rows in the vector, you must use the lit() function. Simply passing col_name to the function will attempt to extract the column values for that column. As far as I could tell, passing col_name is equivalent to passing col(col_name).
Assuming 3 columns 'a', 'b' and 'c', unrolling this concept would look like this:
spark_df = spark_df.withColumn('a', udf_func(lit('a'), 'a')
.withColumn('b', udf_func(lit('b'), 'b')
.withColumn('c', udf_func(lit('c'), 'c')

PySpark: list column names based on characters in values

In PySpark, I am trying to clean a dataset. Some of the columns have unwanted characters (=" ") in it's values. I read the dataset as a DataFrame and I already created a User Defined Function which can remove the characters successfully, but now I am struggling to write a script which can identify on which columns I need to perform the UserDefinedFunction. I only use the last row of the dataset, assuming the columns always contains similar entries.
DataFrame (df):
id value1 value2 value3
="100010" 10 20 ="30"
In Python, the following works:
columns_to_fix = []
for col in df:
value = df[col][0]
if type(value) == str and value.startswith('='):
columns_to_fix.append(col)
I tried the following in PySpark, but this returns all the column names:
columns_to_fix = []
for x in df.columns:
if df[x].like('%="'):
columns_to_fix.append(x)
Desired output:
columns_to_fix: ['id', 'value3']
Once I have the column names in a list, I can use a for loop to fix the entries in the columns. I am very new to PySpark, so my apologies if this is a too basic question. Thank you so much in advance for your advice!
"I only use the last row of the dataset, assuming the columns always contains similar entries." Under that assumption, you could collect a single row and test if the character you are looking for is in there.
Also, note that you do not need a udf to replace = in your columns, you can use regexp_replace. A working example is given below, hope this helps!
import pyspark.sql.functions as F
df = spark.createDataFrame([['=123','456','789'], ['=456','789','123']], ['a', 'b','c'])
df.show()
# +----+---+---+
# | a| b| c|
# +----+---+---+
# |=123|456|789|
# |=456|789|123|
# +----+---+---+
# list all columns with '=' in it.
row = df.limit(1).collect()[0].asDict()
columns_to_replace = [i for i,j in row.items() if '=' in j]
for col in columns_to_replace:
df = df.withColumn(col, F.regexp_replace(col, '=', ''))
df.show()
# +---+---+---+
# | a| b| c|
# +---+---+---+
# |123|456|789|
# |456|789|123|
# +---+---+---+

Fastest method of finding and replacing row-specific data in a pandas DataFrame

Given an example pandas DataFrame:
Index | sometext | a | ff |
0 'asdff' 'b' 'g'
1 'asdff' 'c' 'hh'
2 'aaf' 'd' 'i'
What would be the fastest way to replace all instances of the columns names in the [sometext] field with the data in that column, where the values to replace are row specific?
i.e. the desired result from the above input would be:
Index | sometext | a | ff |
0 'bsdg' 'b' 'g'
1 'csdhh' 'c' 'hh'
2 'ddf' 'd' 'i'
note: there is no chance the replacement values would include column names.
I have tried iterating over the rows but the execution time blows out as the length of the DataFrame and number of replacement columns increases.
the Series.str.replace method looks at single values as well so would need to be run over each row.
We can do this ..
df.apply(lambda x : pd.Series(x['sometext']).replace({'a':x['a'],'ff':x['ff']},regex=True),1)
Out[773]:
0
0 bsdg
1 csdhh
2 ddf
This way seems quite fast. See below for a brief discussion.
import re
df['new'] = df['sometext']
for v in ['a','ff']:
df['new'] = df.apply( lambda x: re.sub( v, x[v], x['new']), axis=1 )
Results:
sometext a ff new
0 asdff b g bsdg
1 asdff c hh csdhh
2 aaf d i ddf
Discussion:
I expanded the sample to 15,000 rows and this was the fastest approach by around 10x or more compared to the existing answers (although I suspect there might be even faster ways).
The fact that you want to use the columns to make row specific substitutions is what complicates this answer (otherwise you would just do a simpler version of #wen's df.replace). As it is, that simple and fast approach requires further code in both my approach and wen's although I think they are more or less working the same way.
I have the following:
d = {'sometext': ['asdff', 'asdff', 'aaf'], 'a': ['b', 'c', 'd'], 'ff':['g', 'hh', 'i']}
df = pd.DataFrame(data=d)
start = timeit.timeit()
def replace_single_string(row_label, original_column, final_column):
result_1 = df.get_value(row_label, original_column)
result_2 = df.get_value(row_label, final_column)
if 'a' in result_1:
df.at[row_label, original_column] = result_1.replace('a', result_2)
else:
pass
return df
for i in df.index.values:
df = replace_single_string(i, 'sometext', 'a')
print df
end = timeit.timeit()
print end - start
This ran in 0.000404119491577 seconds in Terminal.
The fastest method I found was to use the apply function in tandem with a replacer function that uses the basic str.replace() method. It's very fast, even with a for loop inside it, and it also allows for a dynamic amount of columns:
def value_replacement(df_to_replace, replace_col):
""" replace the <replace_col> column of a dataframe with the values in all other columns """
cols = [col for col in df_to_replace.columns if col != replace_col]
def replacer(rep_df):
""" function to by used in the apply function """
for col in cols:
rep_df[replace_col] = \
str(rep_df[replace_col]).replace(col.lower(), str(rep_df[col]))
return rep_df[replace_col]
df_to_replace[replace_col] = df_to_replace.apply(replacer, axis=1)
return df_to_replace

Categories

Resources