Extracting multiple columns from column in PySpark DataFrame using named regex - python

Suppose I have a DataFrame df in pySpark of the following form:
| id | type | description |
| 1 | "A" | "Date: 2018/01/01\nDescr: This is a test des\ncription\n |
| 2 | "B" | "Date: 2018/01/02\nDescr: Another test descr\niption\n |
| 3 | "A" | "Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n |
which is of course a dummy set, but will suffice for this example.
I have made a regex-statement with named groups that can be used to extract the relevant information from the description-field, something along the lines of:
^(?:(?:Date: (?P<DATE>.+?)\n)|(?:Descr: (?P<DESCR>.+?)\n)|(?:Warning: (?P<WARNING>.+?)\n)+$
again, dummy regex, the actual regex is somewhat more elaborate, but the purpose is to capture three possible groups:
| DATE | DESCR | WARNING |
| 2018/01/01 | This is a test des\ncription | None |
| 2018/01/02 | Another test descr\niption | None |
| 2018/01/03 | None | This is a warnin\ng, watch out |
Now I would want to add the columns that are the result of the regex match to the original DataFrame (i.e. combine the two dummy tables in this question into one).
I have tried several ways to accomplish this, but none have lead to the full solution yet. A thing I've tried is:
def extract_fields(string):
patt = <ABOVE_PATTERN>
result = re.match(patt, string, re.DOTALL).groupdict()
# Actually, a slight work-around is needed to overcome the None problem when
# no match can be made, I'm using pandas' .str.extract for this now
return result
df.rdd.map(lambda x: extract_fields(x.description))
This will yield the second table, but I see no way to combine this with the original columns from df. I have tried to construct a new Row(), but then I run into problems with the ordering of columns (and the fact that I cannot hard-code the column names that will be added by the regex groups) that is needed in the Row()-constructor, resulting in a dataframe that is has the columns all jumbled up. How can I achieve what I want, i.e. one DataFrame with six columns: id, type, description, DATE, DESCR and WARNING?
Remark. Actually, the description field is not just one field, but several columns. Using concat_ws, I have concatenated these columns into a new columns description with the description-fields separated with \n, but maybe this can be incorporated in a nicer way.

I think you can use Pandas features for this case. Firstly I convert df to rdd to split description field. I pull a Pandas df then I create spark df with using Pandas df. It works regardless of column numbers in description field
>>> import pandas as pd
>>> import re
>>>
>>> df.show(truncate=False)
+---+----+-----------------------------------------------------------+
|id |type|description |
+---+----+-----------------------------------------------------------+
|1 |A |Date: 2018/01/01\nDescr: This is a test des\ncription\n |
|2 |B |Date: 2018/01/02\nDescr: Another test desc\niption\n |
|3 |A |Date: 2018/01/03\nWarning: This is a warnin\ng, watch out\n|
+---+----+-----------------------------------------------------------+
>>> #convert df to rdd
>>> rdd = df.rdd.map(list)
>>> rdd.first()
[1, 'A', 'Date: 2018/01/01\\nDescr: This is a test des\\ncription\\n']
>>>
>>> #split description field
>>> rddSplit = rdd.map(lambda x: (x[0],x[1],re.split('\n(?=[A-Z])', x[2].encode().decode('unicode_escape'))))
>>> rddSplit.first()
(1, 'A', ['Date: 2018/01/01', 'Descr: This is a test des\ncription\n'])
>>>
>>> #create empty Pandas df
>>> df1 = pd.DataFrame()
>>>
>>> #insert rows
>>> for rdd in rddSplit.collect():
... a = {i.split(':')[0].strip():i.split(':')[1].strip('\n').replace('\n','\\n').strip() for i in rdd[2]}
... a['id'] = rdd[0]
... a['type'] = rdd[1]
... df2 = pd.DataFrame([a], columns=a.keys())
... df1 = pd.concat([df1, df2])
...
>>> df1
Date Descr Warning id type
0 2018/01/01 This is a test des\ncription NaN 1 A
0 2018/01/02 Another test desc\niption NaN 2 B
0 2018/01/03 NaN This is a warnin\ng, watch out 3 A
>>>
>>> #create spark df
>>> df3 = spark.createDataFrame(df1.fillna('')).replace('',None)
>>> df3.show(truncate=False)
+----------+----------------------------+------------------------------+---+----+
|Date |Descr |Warning |id |type|
+----------+----------------------------+------------------------------+---+----+
|2018/01/01|This is a test des\ncription|null |1 |A |
|2018/01/02|Another test desc\niption |null |2 |B |
|2018/01/03|null |This is a warnin\ng, watch out|3 |A |
+----------+----------------------------+------------------------------+---+----+

Related

How te create a new column with values matching conditions from different dataframes?

Is there a simple function (both on pandas or numpy) to create a new column with true or false values, based on matching criteria from different dataframes?
I'm actually trying to compare two dataframes that have the column email and see, for example, which emails match with the emails on the second data frame. The goal is to print a table that looks like this (where hola#lorem.com it's actually both on the first and second dataframe):
| id | email | match |
|:------|:------ |:-------|
| 1 | hola#lorem.com | true|
| 2 | adios#lorem.com | false|
| 3 | bye#lorem.com | false|
Thanks in advance for your help
pd.assign
df1 = df1.assign(match=df2["email"].isin(df1["email"]))
You can for example use the function isin:
df1['match'] = df1['email'].isin(df2['email'])
df2['match'] = df2['email'].isin(df1['email'])

Python,Pandas,DataFrame, add new column doing SQL GROUP_CONCAT equivalent

My question is very similar to the one asked but unanswered here
Replicating GROUP_CONCAT for pandas.DataFrame
I have a Pandas DataFame which I want to group concat into a data frame
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
+------+---------+
Becoming
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn |
| B | dawg |
+------+---------------------------------------+
As answeed in the original topic, it can be done via any of these:
df.groupby('team').apply(lambda x: ','.join(x.user))
df.groupby('team').apply(lambda x: list(x.user))
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})
But the resulting object is not a Pandas Dataframe anymore.
How can I get the GROUP_CONCAT results in the original Pandas DataFrame as a new column?
Cheers
You can apply list and join after grouping by, then reset_index to get the dataframe.
output_df = df.groupby('team')['user'].apply(lambda x: ",".join(list(x))).reset_index()
output_df.rename(columns={'user': 'group_concat(user)'})
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg
Let's break down the below code:
Firstly, groupby team and, use apply on the user to join it's elements using a ,
Then, reset the index, and rename the resulting dataframe (axis=1, refers to columns and not rows)
res = (df.groupby('team')['user']
.apply(lambda x: ','.join(str(i) for i in x))).reset_index().rename({'user':'group_concat(user)'},axis=1)
Output:
team group_concat(user)
0 A elmer,daffy,bugs,foghorn
1 B dawg

How do I add two columns with different rows with special condition?

Hi I have a PySpark dataframe. So, I would like to add two columns from different rows with special condition. One of the columns is a date type.
Here is the example of the data:
--------------------------------
| flag| date | diff |
--------------------------------
| 1 | 2014-05-31 | 0 |
--------------------------------
| 2 | 2014-06-02 | 2 |
--------------------------------
| 3 | 2016-01-14 | 591 |
--------------------------------
| 1 | 2016-07-08 | 0 |
--------------------------------
| 2 | 2016-07-12 | 4 |
--------------------------------
Currently I only know how to add the two columns, by using this code:
from pyspark.sql.functions import expr
dataframe.withColumn("new_column", expr("date_add(date_column, int_column)"))
The expected result:
There's a new Column, called "new_date" which is a result by adding the "diff" column to "date column".
The catch is there's a special condition: if the "flag" is 1, "date" and "diff" come from the same row, if not, the "date" comes from the previous row.
I am aware that in this scenario, my data has to be correctly sorted.
If anyone could help me, I would be very grateful. Thank you.
You just have to create a column with the previous date using Window and construct the new column depending on the value of 'flag'
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(F.col('date'))
dataframe = dataframe.withColumn('previous_date', F.lag('date', 1).over(w))
dataframe = dataframe.withColumn('new_date',
F.when(F.col('flag')==1,
F.expr("date_add(previous_date, diff)")
).otherwise(F.expr("date_add(date, diff)"))
).drop('previous_date')
Just in case you have the same issue with the answer of Xavier. The idea is the same, but I removed some unnecessary conditions for the Window and fixed the syntax error, as well as the date_add error I faced, when I tried his version.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([(1,datetime.date(2014,5,31),0),(2,datetime.date(2014,6,2),2),(3,datetime.date(2016,1,14),591),(1,datetime.date(2016,7,8),0),(2,datetime.date(2016,7,12),4)], ["flag","date","diff"])
w = Window.orderBy(col("date"))
df1 = df1.withColumn('previous_date', lag('date', 1).over(w))
df1 = df1.withColumn('new_date',when(col('flag')==1,\
expr('date_add(date, diff)'))\
.otherwise(expr('date_add(previous_date,diff)'))).drop('previous_date')
df1.show()
Output:
+----+----------+----+----------+
|flag| date|diff| new_date|
+----+----------+----+----------+
| 1|2014-05-31| 0|2014-05-31|
| 2|2014-06-02| 2|2014-06-02|
| 3|2016-01-14| 591|2016-01-14|
| 1|2016-07-08| 0|2016-07-08|
| 2|2016-07-12| 4|2016-07-12|
+----+----------+----+----------+

Efficient column processing in PySpark

I have a dataframe with a very large number of columns (>30000).
I'm filling it with 1 and 0 based on the first column like this:
for column in list_of_column_names:
df = df.withColumn(column, when(array_contains(df['list_column'], column), 1).otherwise(0))
However this process takes a lot of time. Is there a way to do this more efficiently? Something tells me that column processing can be parallelized.
Edit:
Sample input data
+----------------+-----+-----+-----+
| list_column | Foo | Bar | Baz |
+----------------+-----+-----+-----+
| ['Foo', 'Bak'] | | | |
| ['Bar', Baz'] | | | |
| ['Foo'] | | | |
+----------------+-----+-----+-----+
There is nothing specifically wrong with your code, other than very wide data:
for column in list_of_column_names:
df = df.withColumn(...)
only generates the execution plan.
Actual data processing will concurrent and parallelized, once the result is evaluated.
It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.
Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:
Sort column array using sort_array function.
Convert data to RDD.
Apply search for each column using binary search.
You might approach like this,
import pyspark.sql.functions as F
exprs = [F.when(F.array_contains(F.col('list_column'), column), 1).otherwise(0).alias(column)\
for column in list_column_names]
df = df.select(['list_column']+exprs)
withColumn is already distributed so a faster approach would be difficult to get other than what you already have. you can try defining a udf function as following
from pyspark.sql import functions as f
from pyspark.sql import types as t
def containsUdf(listColumn):
row = {}
for column in list_of_column_names:
if(column in listColumn):
row.update({column: 1})
else:
row.update({column: 0})
return row
callContainsUdf = f.udf(containsUdf, t.StructType([t.StructField(x, t.StringType(), True) for x in list_of_column_names]))
df.withColumn('struct', callContainsUdf(df['list_column']))\
.select(f.col('list_column'), f.col('struct.*'))\
.show(truncate=False)
which should give you
+-----------+---+---+---+
|list_column|Foo|Bar|Baz|
+-----------+---+---+---+
|[Foo, Bak] |1 |0 |0 |
|[Bar, Baz] |0 |1 |1 |
|[Foo] |1 |0 |0 |
+-----------+---+---+---+
Note: list_of_column_names = ["Foo","Bar","Baz"]

Read excel sheet with multiple header using Pandas

I have an excel sheet with multiple header like:
_________________________________________________________________________
____|_____| Header1 | Header2 | Header3 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 5 | 6 |9 |10 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
Now here you can see that first two columns do not have headers they are blank but other columns have headers like Header1, Header2 and Header3. So I want to read this sheet and merge it with other sheet with similar structure.
I want to merge it on first column 'ColX'. Right now I am doing this:
import pandas as pd
totalMergedSheet = pd.DataFrame([1,2,3,4,5], columns=['ColX'])
file = pd.ExcelFile('ExcelFile.xlsx')
for i in range (1, len(file.sheet_names)):
df1 = file.parse(file.sheet_names[i-1])
df2 = file.parse(file.sheet_names[i])
newMergedSheet = pd.merge(df1, df2, on='ColX')
totalMergedSheet = pd.merge(totalMergedSheet, newMergedSheet, on='ColX')
But I don't know its neither reading columns correctly and I think will not return the results in the way I want. So, I want the resulting frame should be like:
________________________________________________________________________________________________________
____|_____| Header1 | Header2 | Header3 | Header4 | Header5 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColK| ColL|ColM|ColN|ColO||ColP|ColQ|ColR|ColS|
1 | ds | 5 | 6 |9 |10 | ..................................................................................
2 | dh | ...................................................................................
3 | ge | ....................................................................................
4 | ew | ...................................................................................
5 | er | ......................................................................................
[See comments for updates and corrections]
Pandas already has a function that will read in an entire Excel spreadsheet for you, so you don't need to manually parse/merge each sheet. Take a look pandas.read_excel(). It not only lets you read in an Excel file in a single line, it also provides options to help solve the problem you're having.
Since you have subcolumns, what you're looking for is MultiIndexing. By default, pandas will read in the top row as the sole header row. You can pass a header argument into pandas.read_excel() that indicates how many rows are to be used as headers. In your particular case, you'd want header=[0, 1], indicating the first two rows. You might also have multiple sheets, so you can pass sheetname=None as well (this tells it to go through all sheets). The command would be:
df_dict = pandas.read_excel('ExcelFile.xlsx', header=[0, 1], sheetname=None)
This returns a dictionary where the keys are the sheet names, and the values are the DataFrames for each sheet. If you want to collapse it all into one DataFrame, you can simply use pandas.concat:
df = pandas.concat(df_dict.values(), axis=0)
Sometimes, indices are MultiIndex too (it is indeed the case in the OP). To account for that, pass the index_col= appropriately.
df_dict = pd.read_excel('Book1.xlsx', header=[0,1], index_col=[0,1], sheetname=None)

Categories

Resources