How can I replace the value to the above value in dataframe? - python

There is a dataframe as below:
| Name Korean English Math highschool|
|0 YB 100 100 100 A|
|1 SW 90 90 90 B|
|2 EJ 80 80 80 C|
|3 EJ 70 70 70 D|
I would like to change the highschool value ("D") to the above value ("C") if the Name ("EJ") is same.
|Name Korean English Math highschool|
|0 YB 100 100 100 A|
|1 SW 90 90 90 B|
|2 EJ 80 80 80 C|
|3 EJ 70 70 70 C|
How can I solve this?

If there are other values in the Name column that do repeat, you can use this method to replace all of the highschool values at once:
df1 = df.groupby("Name")["highschool"].first()
df.set_index('Name', inplace=True)
df.update(df1)
df.reset_index()
Replace df with the name of your dataframe.

Related

How to remove some rows in a dataframe which are not in a list

I have a pandas data frame, lets call it df looking like this
|Rank| Country | All | Agr |Ind |Dom |
-------------------------------------------------------
|1 |Argentina |2 |3 |1 |5 |
|4 |Chile |3 |3 |4 |3 |
|3 |Colombia |1 |2 |1 |4 |
|4 |Mexico |3 |5 |4 |2 |
|3 |Panama |2 |1 |5 |4 |
|2 |Peru |3 |3 |4 |2 |
I want to remove from the rows that are not in the next list:
paises = ["Colombia", "Peru", "Chile"]
For that I tried this code:
df = df.drop(df["Country"]= paises)
But it did not work, because they do not have the same length.
I´m so rookie in python. Can you help me?
Use isin:
df = df[~df['Country'].isin(paises)]
Output:
>>> df
Rank Country All Agr Ind Dom
0 1 Argentina 2 3 1 5
3 4 Mexico 3 5 4 2
4 3 Panama 2 1 5 4
Use pandas.dataFrame.isin()
Create an appropriate dataFrame mask using the list of countries you don't want in your dataFrame.
paises = ["Colombia", "Peru", "Chile"]
df[df['country'].isin(paises)==False]
Use the dataFrame mask to assign the masked dataFrame, which excludes the paises.
paises = ["Colombia", "Peru", "Chile"]
df = df[df['country'].isin(paises)==False]
>>> df
Rank Country All Agr Ind Dom
0 1 Argentina 2 3 1 5
3 4 Mexico 3 5 4 2
4 3 Panama 2 1 5 4
You can also use the not isin() operator "~" to check if values are not in the DataFrame. To use the ~ operator: df = df[~df['country'].isin(paises)]
You don't even need the hand of Maradona for this one.
I would use the following
df = df[~df["Country"].isin(paises)]
Using negation sign ~.

Pyspark DataFrame Grouping by item that doesn't belong to the group

I am new to pyspark and am stuck in a situation could you please help me in obtaining a result in a manner as:
customer_id
item_id
amount
1
tx1
15
1
tx2
10
1
tx3
14
2
tx1
15
2
tx4
12
3
tx2
10
2
tx6
43
4
tx4
12
5
tx8
76
6
tx6
43
5
tx6
43
3
tx6
43
And want to know for each item:
The count of customers that didn't purchase this item
The sum of the amount of items that are not the customers of this item.
So the final table would look like:
item_id
target_cust
taget_amount
tx1
4
227
tx2
4
201
tx3
5
297
tx4
4
--
tx6
3
--
tx8
5
--
please help me in getting a similar output, any suggestions in the direction would be great
First group by customer_id and get list of purchased item_id with the associated amount like this:
import pyspark.sql.functions as F
items_by_customer_df = df.groupBy("customer_id").agg(
F.collect_set("item_id").alias("items"),
F.sum("amount").alias("target_amount")
)
items_by_customer_df.show()
#+-----------+---------------+-------------+
#|customer_id|items |target_amount|
#+-----------+---------------+-------------+
#|1 |[tx1, tx2, tx3]|39 |
#|2 |[tx1, tx6, tx4]|70 |
#|3 |[tx2, tx6] |53 |
#|5 |[tx6, tx8] |119 |
#|4 |[tx4] |12 |
#|6 |[tx6] |43 |
#+-----------+---------------+-------------+
Now, join this grouped dataframe with distinct item_ids from original df using negation of array_contains as condition, then group by item_id and do aggregations count(customer_id) + sum(amount):
result = df.select("item_id").distinct().join(
items_by_customer_df,
~F.array_contains("items", F.col("item_id"))
).groupBy("item_id").agg(
F.count("customer_id").alias("target_cust"),
F.sum("target_amount").alias("target_amount")
)
result.show()
#+-------+-----------+-------------+
#|item_id|target_cust|target_amount|
#+-------+-----------+-------------+
#| tx2| 4| 244|
#| tx4| 4| 254|
#| tx1| 4| 227|
#| tx8| 5| 217|
#| tx3| 5| 297|
#| tx6| 2| 51|
#+-------+-----------+-------------+

Pyspark create column based on maximum of multiple columns that match a certain condition in corresponding columns

Let's say I have a Pyspark dataframe with id and 3 columns representing code buckets.
col_buckets ["code_1", "code_2", "code_3"]
and 3 columns representing amounts for corresponding code buckets.
amt_buckets = ["code_1_amt", "code_2_amt", "code_3_amt" ]
Here is a pseudocode for what I am trying to do.
for el in ['01', '06', '07']
df= df.withColumn("max_amt_{el}", max(df.select(max(**amt_buckets**) for corresponding col_indices of amt_buckets if ***any of col_buckets*** ==el)))
how would I accomplish this?
here is a dataframe example for this:
Primary_id Code_1 Code_2 Code_3 Amt_1 Amt_2 Amt_3 Max_01 Max_07 Max_06
Xxxxx998 Null 01 04 2000 1000 100 1000 0 0
Xxxxx997 01 01 07 200 300 400 300 400 0
Xxxxx996 07 Null Null 100 Null Null 0 100 0
Xxxx910 Null Null Null 300 100 200 0 0 0
I am trying to get the max_01, max_07 and max_06 columns
For spark2.4+, you can try this.
df.show() #sample dataframe
#+----------+------+------+------+-----+-----+-----+
#|Primary_id|Code_1|Code_2|Code_3|Amt_1|Amt_2|Amt_3|
#+----------+------+------+------+-----+-----+-----+
#| Xxxxx998| null| 01| 04| 2000| 1000| 100|
#| Xxxxx997| 01| 01| 07| 200| 300| 400|
#| Xxxxx996| 07| null| null| 100| null| null|
#| Xxxx910| null| null| null| 300| 100| 200|
#+----------+------+------+------+-----+-----+-----+
from pyspark.sql import functions as F
dictionary = dict(zip(['Code_1','Code_2','Code_3'], ['Amt_1','Amt_2','Amt_3']))
df.withColumn("trial", F.array(*[F.array(F.col(x),F.col(y).cast("string"))\
for x,y in dictionary.items()]))\
.withColumn("Max_01",F.when(F.size(F.expr("""filter(trial,x-> exists(x,y->y='01'))"""))!=0,\
F.expr("""array_max(transform(filter(trial, x-> exists(x,y-> y='01')),z-> float(z[1])))"""))\
.otherwise(F.lit(0)))\
.withColumn("Max_06",F.when(F.size(F.expr("""filter(trial,x-> exists(x,y->y='06'))"""))!=0,\
F.expr("""array_max(transform(filter(trial, x-> exists(x,y-> y='06')),z-> float(z[1])))"""))\
.otherwise(F.lit(0)))\
.withColumn("Max_07",F.when(F.size(F.expr("""filter(trial,x-> exists(x,y->y='07'))"""))!=0,\
F.expr("""array_max(transform(filter(trial, x-> exists(x,y-> y='07')),z-> float(z[1])))"""))\
.otherwise(F.lit(0)))\
.drop("trial").show(truncate=False)
#+----------+------+------+------+-----+-----+-----+------+------+------+
#|Primary_id|Code_1|Code_2|Code_3|Amt_1|Amt_2|Amt_3|Max_01|Max_07|Max_06|
#+----------+------+------+------+-----+-----+-----+------+------+------+
#|Xxxxx998 |null |01 |04 |2000 |1000 |100 |1000 |0 |0 |
#|Xxxxx997 |01 |01 |07 |200 |300 |400 |300 |400 |0 |
#|Xxxxx996 |07 |null |null |100 |null |null |0 |100 |0 |
#|Xxxx910 |null |null |null |300 |100 |200 |0 |0 |0 |
#+----------+------+------+------+-----+-----+-----+------+------+------+

get first numeric values from pyspark dataframe string column into new column

I have a pyspark dataframe like the input data below. I would like to create a new column product1_num that parses the first numeric in each record in the productname column, in to a new column. I have example output data below. I'm not sure what's available in pyspark as far as string split and regex matching. Can anyone suggest how to do this with pyspark?
input data:
+------+-------------------+
|id |productname |
+------+-------------------+
|234832|EXTREME BERRY SAUCE|
|419836|BLUE KOSHER SAUCE |
|350022|GUAVA (1G) |
|123213|GUAVA 1G |
+------+-------------------+
output:
+------+-------------------+-------------+
|id |productname |product1_num |
+------+-------------------+-------------+
|234832|EXTREME BERRY SAUCE| |
|419836|BLUE KOSHER SAUCE | |
|350022|GUAVA (1G) |1 |
|123213|GUAVA G5 |5 |
|125513|3GULA G5 |3 |
|127143|GUAVA G50 |50 |
|124513|LAAVA C2L5 |2 |
+------+-------------------+-------------+
You can use regexp_extract:
from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()
+------+-------------------+------------+
| id| productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE| |
|419836| BLUE KOSHER SAUCE| |
|350022| GUAVA (1G)| 1|
|123213| GUAVA G5| 5|
|125513| 3GULA G5| 3|
|127143| GUAVA G50| 50|
|124513| LAAVA C2L5| 2|
+------+-------------------+------------+

Pyspark: Concatenating values into lists within certain time range

I have a pyspark dataframe that contains id, timestamp and value column. I am trying to create a dataframe that first groups rows with the same id, then separate the ones that are, say longer than 2 weeks apart and finally concatenate their value into a list.
I have already tried to use rangeBetween() Window function. It doesn't quite deliver what I want. I think the code below illutrates my question better:
My dataframe sdf:
+---+-------------------------+-----+
|id |tts |value|
+---+-------------------------+-----+
|0 |2019-01-01T00:00:00+00:00|a |
|0 |2019-01-02T00:00:00+00:00|b |
|0 |2019-01-20T00:00:00+00:00|c |
|0 |2019-01-25T00:00:00+00:00|d |
|1 |2019-01-02T00:00:00+00:00|a |
|1 |2019-01-29T00:00:00+00:00|b |
|2 |2019-01-01T00:00:00+00:00|a |
|2 |2019-01-30T00:00:00+00:00|b |
|2 |2019-02-02T00:00:00+00:00|c |
+---+-------------------------+-----+
My approach:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
DAY_SECS = 3600 * 24
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out = sdf \
.withColumn('val_seq', F.collect_list('value').over(w_spec))
Output:
+---+-------------------------+-----+-------+
|id |tts |value|val_seq|
+---+-------------------------+-----+-------+
|0 |2019-01-01T00:00:00+00:00|a |[a] |
|0 |2019-01-02T00:00:00+00:00|b |[a, b] |
|0 |2019-01-20T00:00:00+00:00|c |[c] |
|0 |2019-01-25T00:00:00+00:00|d |[c, d] |
|1 |2019-01-02T00:00:00+00:00|a |[a] |
|1 |2019-01-29T00:00:00+00:00|b |[b] |
|2 |2019-01-01T00:00:00+00:00|a |[a] |
|2 |2019-01-30T00:00:00+00:00|b |[b] |
|2 |2019-02-02T00:00:00+00:00|c |[b, c] |
+---+-------------------------+-----+-------+
My desired output:
+---+-------------------------+---------+
|id |tts |val_seq|
+---+-------------------------+---------+
|0 |2019-01-02T00:00:00+00:00|[a, b] |
|0 |2019-01-25T00:00:00+00:00|[c, d] |
|1 |2019-01-02T00:00:00+00:00|[a] |
|1 |2019-01-29T00:00:00+00:00|[b] |
|2 |2019-01-30T00:00:00+00:00|[a] |
|2 |2019-02-02T00:00:00+00:00|[b, c] |
+---+-------------------------+---------+
To sum it up: I want to group rows in sdf with the same id, and further concatenate the value for rows that are not longer than 2 weeks apart and finally only show these rows.
I am really new to pyspark so any suggestions are appreciated!
Below code should work:
w_spec = Window \
.partitionBy('id') \
.orderBy(F.col('tts').cast('timestamp').cast('long')) \
.rangeBetween((Window.currentRow)-(14*DAY_SECS), Window.currentRow)
out1 = df.withColumn('val_seq', F.collect_list('value').over(w_spec)).withColumn('occurrences_in_5_min',F.count('tts').over(w_spec))
w_spec2 = Window.partitionBy("id").orderBy(col("occurrences_in_5_min").desc())
Out2=out1.withColumn("rank",rank().over(w_spec2)).filter("rank==1")

Categories

Resources