Row Operations in pyspark dataframe - python

I have a Spark DataFrame, like the one shown below.
I need an algorithm that whenever I find 'M' in the row, I need to select the next two columns after 'M' and create a new row. If there are two 'M' in a single row, then I need to create two rows 1 with the two columns from first 'M' and 2nd row with two columns from second 'M'.
Input Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|_c10|_c11|
+------+---+--------------------+---+--------+--------+-----+----+-----------+---
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40| M|Bar |50|M|Drinks|100
+------+---+--------------------+---+--------+--------+-----+----+-----------+----+-----+
New Output Dataframe:
+------+---+--------------------+---+--------+--------+-----+----+-------
|rownum|_c0| _c1|_c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9|
+------+---+--------------------+---+--------+--------+-----+----+-------
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Drinks|30|
| 1|CS1|Jack Binions | 19|20191122|45796416|50021|M|Food|20|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Food|40|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Bar |50|
| 2|CS1|Jack Binions | 19|20191122|45794779|50022|M|Drinks |100|
+------+---+--------------------+---+--------+--------+-----+----+--+

Related

How To Change Entire Row Background Color on Condition

I have 2 DataFrame objects df and df2 each with their own set of data. Each object has one column in particular called 'Date Time'. df2 is a subset of df and I am trying to highlight the entire row where df['Date Time'] == df2['Date Time'] as it is a huge set of data and I want to easily find where they match. I have merged both sets of data - all df columns and df2 columns that I want - into df right now, but I removed the duplicated column in the merge and just have the remaining columns that I want in the correct row.
Here is a little code snippet of what I have:
def highlight_row(rows):
return ['background-color: green' for row in rows]
df[df['Date Time'] == df2['Date Time']].style.apply(highlight_row)
I'm getting an error, ValueError: Can only compare identically-labeled Series objects and I'm not sure what it's referring to that isn't identically labeled or if I have a simple mistake. It should just highlight the entire row where any df2 object is, in theory I guess for the end goal, but maybe down the line I'd need something like this conditional-wise - so I thought this could work.
As requested, here is some sample data:
df
| Date Time | Motion Detected | Alert |
| ---------------- | -------------- | ----- |
| 22-03-05 2:13:04 | False | No |
| 22-03-05 2:14:00 | True | Yes |
df2
| Date Time | WiFi Connection |
| --------- | --------------- |
| 22-03-05 2:14:00| Connected |
The actual data won't make any sense to you guys, so here is an example of data with a similar purpose. Basically I have - with this example data translated from my actual data - a table that looks like this in df:
| Date Time | Motion Detected | Alert | WiFi Connection |
| --------- | --------------- | ----- | --------------- |
| 22-03-05 2:13:04 | False | No | |
| 22-03-05 2:14:00 | True | Yes | Connected |
What I want is for any column (assume on a larger scale of data like this), where the 'Date Time' columns match (aka where there is data in 'WiFi Connection', to be highlighted a certain color.
Edit: In case the table formatting isn't showing properly, you can view the as requested information here

Filtering dataframe based on other dataframe column on Python

I have two DataFrames. One contains multiple columns with sample name and rows containing values. The second DataFrame contains one column called "Sample Name" which contains a list of the names of samples that pass a quality control.
df1
| mz | Sample 001| Sample 002...
|:---- |:---------:| ---------:|
| 234 | 3434 | 34545 |
|:---- |:---------:| ---------:|
| 4542 | 5656563 | 4545 |
df2
| Sample Name | RT |
| ----------- | ---|
| Sample001 | 8 |
| Sample002 | 8 |...
df1 contains more than 2000 rows and 200 columns, df2 contains 180 columns. I want to filter df1 to remove the columns that are NOT present on the df2 column "Sample Name"
The resulting DataFrame should be a version of df1 filtered with 180 columns present on the df2 list.
Se if this works:
for col in df1.columns:
if col not in df2['Sample Name'].unique():
df1.drop(columns=[col], inplace=True)

AWS Glue add column value as a column in another DynamicFrame

I'm new to AWS Glue and Pyspark, so I'm having some trouble with a transformation job. I have two DynamicFrames, one of them contains values in one of it's columns which needs to be added as a separate column in the other DF, and the values in the column need to be the value which corresponds a value from another column with the same id in the first table. Here's how it looks:
Table 1 Table2
+--+-----+-----+ +--+-----+-----+
|id|name |value| |id|col1 |col2 |
+--+-----+-----+ +--+-----+-----+
| 1|name1| 10 | | 1|str1 |val1 |
+--+-----+-----+ +--+-----+-----+
| 2|name2| 20 | | 2|str2 |val2 |
+--+-----+-----+ +--+-----+-----+
I need the new format to be:
Table2
+--+-----+-----+-----+-----+
|id|col1 |col2 |name1|name2|
+--+-----+-----+-----+-----+
| 1|str1 |val1 | 10 | | <--- add 10 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
| 2|str2 |val2 | | 20 | <--- add 20 only here because the id from the row in the first table must match the id from the second table
+--+-----+-----+-----+-----+
Suppose 2 dataframes are named df1 and df2.
df3 = df1.groupBy('id').pivot('name').sum('value')
df4 = df2.join(df3, on='id', how='inner')
df4.show(truncate=False)

PySpark search inside very large dataframe

I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object

Append duplicate rows as single record - pandas

I am trying to append the duplicated rows in one column. Duplication is based on some columns that need to be an exact match for a row to be declared as a match. the problem is we can't say how many repeated rows will be there so the solution needs to be generic.
I have tried some other solution but they are based on some user_id or key column. In my case, all columns except one have to be compared exactly. and appended columns should have the same column name.
Concat, merge, join tried everything nothing works.
I used the following code to find duplicated rows:
col = ['TITLE', 'ISSN', 'e-ISSN', 'ISBN', 'e-ISBN']
duplicated_data = data[data.duplicated(col, keep=False)]
here is Input :
------------------------------
DealName | Target | Category |
-----------------------------
ABC-XYZ | ABC | A |
------------------------------
ABC-XYZ | ABC | B |
------------------------------
ABC-XYZ | None | C |
------------------------------
ABC-XYZ | ABC | C |
------------------------------
ABC-XYZ | None | D |
------------------------------
ABC-XYZ | Zab | C |
and want to convert it to something like this:
------------------------------------------------------------------------------------------
DealName | Target | Category |DealName | Target | Category |DealName | Target | Category |
---------------------------------------------------------------------------------------
ABC-XYZ | ABC | A |ABC-XYZ | ABC | B |ABC-XYZ | ABC | C |
------------------------------------------------------------------------------------------
ABC-XYZ | None | C |ABC-XYZ | None | D |
Suppose I have set the condition that the first two columns need to be matched in order to select the row,
first, second and 4th row got appended as a single record
and third and fifth row got appended as another record
any idea how to do this?

Categories

Resources