Complex functions from python to pyspark - EDIT: concatenation problem (I think) - python

I'm trying to transform a pandas function over two dataframes to a pyspark function.
In particular, I have a dataframe of Keys and functions as strings, namely:
> mv
| Keys | Formula | label |
---------------------------------------
| key1 | 'val1 + val2 - val3' | name1 |
| key2 | 'val3 + val4' | name2 |
| key3 | 'val1 - val4' | name3 |
and a dataframe df:
> df
| Keys | Datetime | names | values |
------------------------------------
| key1 | tmstmp1 | val1 | 0.3 |
| key1 | tmstmp1 | val2 | 0.4 |
| key1 | tmstmp1 | val3 | 0.2 |
| key1 | tmstmp1 | val4 | 1.2 |
| key1 | tmstmp2 | val1 | 0.5 |
| key2 | tmstmp1 | val1 | 1.1 |
| key2 | tmstmp2 | val3 | 1.0 |
| key2 | tmstmp1 | val3 | 1.3 |
and so on.
I've created two functions that read the code, evaluate the string ok the measure expression, and return a list of pandas.DataFrame that in the end I concat.
def evaluate_vm(x, regex):
m = re.findall(regex, x)
to_replace = ['#[' + i + ']' for i in m]
replaces = [i.split(', ') for i in m]
replacement = ["df.loc[(df.Keys == %s) & (df.names == %s), ['Datetime', 'values']].dropna().set_index('Datetime')"%tuple(i) for i in replaces]
for i in range(len(to_replace)):
x = x.replace(to_replace[i], replacement[i])
return eval(x)
def _mv_(X):
formula = evaluate_vm(X.Formula)
formula['Keys'] = X.Keys
formula.reset_index(inplace = True)
formula.rename_axis({'Formula': 'Values'}, axis = 1, inplace = True)
return formula[['Keys', 'Datetime', 'names', 'Values']]
After that my code is
res = pd.concat([_mv_(mv.loc[i]) for i in mv.index])
and res is what I need to obtain.
NOTE: I've slightly modified the functions and inputs in order to make it understandable: anyway, I don't think the problem lies here.
Here's the thing. I'm trying to transform this stuff in pyspark.
This is the code I wrote this far.
from pyspark.sql.functions import pandas_udf, PandasUDFType, struct
from pyspark.sql.types import FloatType, StringType, IntegerType, TimestampType, StructType, StructField
EvaluateVM = pandas_udf(lambda x: _mv_(x),\
functionType = PandasUDFType.GROUPED_MAP, \
returnType = StructType([StructField("Keys", StringType(), False),
StructField("Datetime", TimestampType(), False),\
StructField("names", StringType(), False),\
StructField("Values", FloatType(), False)])
)
res = EvaluateVM(struct([mv[i] for i in mv.columns]))
That is "almost" working: when I print res here's the result.
> res
Column<<lambda>(named_struct(Keys, Keys, Formula, Formula))>
And I can't see inside res: I think it created something like a python iterable, but I would like to have the same result I get in pandas.
What should I do? Did I get it all wrong?
EDIT: I think the problem might be in the fact that in pandas I create a list of dataframes that I concatenate after evaluating them, in pyspark I use a sort of apply(_mv_, axis = 1): this kind of syntax gave me error even in pandas (cannot concatenate dataframe of dimension 192 into one of size 5, something like that), and my workaround was the pandas.concat([…]). I don't know if this works in pyspark too, or if there's some way to avoid this.
EDIT 2: Sorry, I didn't write the expected output:
| Keys | Datetime | label | values |
---------------------------------------------
| key1 | tmstmp1 | name1 | 0.3 + 0.4 - 0.2 |
| key1 | tmstmp1 | name2 | 0.2 + 1.2 |
and so on. The values column should contain the numeric result, here I wrote the operands in order to let you understand.

Related

Comparing two Dataframes and creating a third one where certain contions are met

I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

replacing string with a different string in pandas depending on value

I am practicing pandas and I have an exercise with which I have a problem
I have an excel file that has a column where two types of urls are stored.
df = pd.DataFrame({'id': [],
'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| | 'www.something/12312' |
| | 'www.something/12343' |
| | 'www.somethingelse/42312' |
| | 'www.somethingelse/62343' |
I am supposed to transform this into ids, but only number should be part of the id, the new id column should look like this:
df = pd.DataFrame({'id': [id_12312 , id_12343, diffid_42312, diffid_62343], 'url': ['www.something/12312', 'www.something/12343', 'www.somethingelse/42312', 'www.somethingelse/62343']})
| id | url |
| -------- | -------------- |
| id_12312 | 'www.something/12312' |
| id_12343 | 'www.something/12343' |
| diffid_42312 | 'www.somethingelse/42312' |
| diffid_62343 | 'www.somethingelse/62343' |
My problem is how to get only numbers and replace them if that kind of id?
I have tried the replace and extract function in pandas
id_replaced = df.replace(regex={re.search('something', df['url']): 'id_' + str(re.search(r'\d+', i).group()), re.search('somethingelse', df['url']): 'diffid_' + str(re.search(r'\d+', i).group())})
df['id'] = df['url'].str.extract(re.search(r'\d+', df['url']).group())
However, they are throwing an error TypeError: expected string or bytes-like object.
Sorry for the tables in codeblock. The page was screaming that I have code that is not properly formatted when it was not in a codeblock.
Here is one solution, but I don't quite understand when do you use the id prefix and when to use diffid ..
>>> df.id = 'id_'+df.url.str.split('/', n=1, expand=True)[1]
>>> df
id url
0 id_12312 www.something/12312
1 id_12343 www.something/12343
2 id_42312 www.somethingelse/42312
3 id_62343 www.somethingelse/62343
Or using str.extract
>>> df.id = 'id_' + df.url.str.extract(r'/(\d+)$')

Pandas: Make the value of one column equal to the value of another

Hopefully a very simple question from a Pandas newbie.
How can I make the value of one column equal the value of another in a dataframe? Replace the value in every row. No conditionals, etc.
Context:
I have two CSV's, loaded into dataframe 'a' and dataframe 'b' respectively.
These CSVs are basically the same, except 'a' has a field that was improperly carried forward from another process - floats were rounded to ints. Not my script, can't influence it, I just have the CSVs now.
In reality I probably have 2mil rows and about 60-70 columns in the merged dataframe - so if it's possible to address the columns by their header (in the example these are Col1 and xyz_Col1), that would sure help.
I have joined the CSVs on their common field, so now I have a scenario where I have a dataframe that can be represented by the following:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0 | apple | 1 | 0.23 | apple |
| 2 | 0 | orange | 2 | 0.45 | orange |
| 3 | 1 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
The result should be such that Col1 = xyz_Col1:
+--------+------+--------+------------+----------+----------+
| CellID | Col1 | Col2 | xyz_CellID | xyz_Col1 | xyz_Col2 |
+--------+------+--------+------------+----------+----------+
| 1 | 0.23 | apple | 1 | 0.23 | apple |
| 2 | 0.45 | orange | 2 | 0.45 | orange |
| 3 | 0.68 | banana | 3 | 0.68 | banana |
+--------+------+--------+------------+----------+----------+
What I have in code so far:
import pandas as pd
a = pd.read_csv('csv1.csv')
b = pd.read_csv('csv2.csv')
#b = b.dropna(axis=1) drop any unnamed fields
#defind 'b' cols by adding an xyz_ prefix as xyz is unique
b = b.add_prefix('xyz_')
#Join the dataframes into a new dataframe named merged
merged = pd.merge(a, b, left_on='Col1', right_on='xyz_Col1')
merged.head(5)
#This is where the xyz_Col1 to Col1 code goes...
#drop unwanted cols
merged = merged[merged.columns.drop(list(merged.filter(regex='xyz')))]
#output to file
merged.to_csv("output.csv", index=False)
Thanks
merged['col1'] = merged['xyz_Col1']
or
merged.loc[:, 'col1'] = merged.loc[:, 'xyz_Col1']

Categories

Resources