How to get referenced columns of a PySpark DataFrame? - python

Given a PySpark DataFrame is it possible to obtain a list of source columns that are being referenced by the DataFrame?
Perhaps a more concrete example might help explain what I'm after. Say I have a DataFrame defined as:
import pyspark.sql.functions as func
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
source_df = spark.createDataFrame(
[("pru", 23, "finance"), ("paul", 26, "HR"), ("noel", 20, "HR")],
["name", "age", "department"],
)
source_df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT name, age, department FROM people")
df = sqlDF.groupBy("department").agg(func.max("age").alias("max_age"))
df.show()
which returns:
+----------+--------+
|department|max_age |
+----------+--------+
| finance| 23|
| HR| 26|
+----------+--------+
The columns that are referenced by df are [department, age]. Is it possible to get that list of referenced columns programatically?
Thanks to Capturing the result of explain() in pyspark I know I can extract the plan as a string:
df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "formatted")
which returns:
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
+- Exchange (4)
+- HashAggregate (3)
+- Project (2)
+- Scan ExistingRDD (1)
(1) Scan ExistingRDD
Output [3]: [name#0, age#1L, department#2]
Arguments: [name#0, age#1L, department#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Project
Output [2]: [age#1L, department#2]
Input [3]: [name#0, age#1L, department#2]
(3) HashAggregate
Input [2]: [age#1L, department#2]
Keys [1]: [department#2]
Functions [1]: [partial_max(age#1L)]
Aggregate Attributes [1]: [max#22L]
Results [2]: [department#2, max#23L]
(4) Exchange
Input [2]: [department#2, max#23L]
Arguments: hashpartitioning(department#2, 200), ENSURE_REQUIREMENTS, [plan_id=60]
(5) HashAggregate
Input [2]: [department#2, max#23L]
Keys [1]: [department#2]
Functions [1]: [max(age#1L)]
Aggregate Attributes [1]: [max(age#1L)#12L]
Results [2]: [department#2, max(age#1L)#12L AS max_age#13L]
(6) AdaptiveSparkPlan
Output [2]: [department#2, max_age#13L]
Arguments: isFinalPlan=false
which is useful, however its not what I need. I need a list of the referenced columns. Is this possible?
Perhaps another way of asking the question is... is there a way to obtain the explain plan as an object that I can iterate over/explore?
UPDATE. Thanks to the reply from #matt-andruff I have gotten this:
df._jdf.queryExecution().executedPlan().treeString().split("+-")[-2]
which returns:
' Project [age#1L, department#2]\n '
from which I guess I could parse the information I'm after but this is a far from elegant way to do it, and is particularly error prone.
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.

There is an object for that unfortunately its a java object, and not translated to pyspark.
You can still access it with Spark constucts:
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(0).toString()
u'department#1621'
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(1).toString()
u'max_age#1632L'
You could loop through both the above apply to get the information you are looking for with something like:
plan = df._jdf.queryExecution().executedPlan()
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
iterator = steps[0].inputSet().iterator()
>>> iterator.next().toString()
u'department#1621'
>>> iterator.next().toString()
u'max#1642L'
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
projections = [ (steps[0].p(i).toJSON().encode('ascii','ignore')) for i in range(1,100) if not( isinstance(steps[0].p(i), type(None) )) and steps[0].p(i).nodeName().encode('ascii','ignore') == 'Project' ]
dd = spark.sparkContext.parallelize(projections)
df2 = spark.read.json(rdd)
>>> df2.show(1,False)
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|child|class |name|num-children|output|outputOrdering|outputPartitioning|projectList |rdd |
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|0 |org.apache.spark.sql.execution.ProjectExec|null|1 |null |null |null |[[[org.apache.spark.sql.catalyst.expressions.AttributeReference, long, [1620, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], age, true, 0, [people]]], [[org.apache.spark.sql.catalyst.expressions.AttributeReference, string, [1621, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], department, true, 0, [people]]]]|null|
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
df2.select(func.explode(func.col('projectList'))).select( func.col('col')[0]["name"] ) .show(100,False)
+-----------+
|col[0].name|
+-----------+
|age |
|department |
+-----------+
range --> Bit of a hack but apparently size doesn't work.I'm sure with more time I could refine the range hack.
You can then use json to pull the information programmatically.

I have something that, while not being an answer to my original question (see Matt Andruff's answer for that), could still be useful here. Its a way to get all the source columns referenced by a pyspark.sql.column.Column.
Simple repro:
from pyspark.sql import functions as f, SparkSession
SparkSession.builder.getOrCreate()
col = f.concat(f.col("A"), f.col("B"))
type(col)
col._jc.expr().references().toList().toString()
returns:
<class 'pyspark.sql.column.Column'>
"List('A, 'B)"
its definitely not perfect, it still requires you to parse out the column names from the string that is returned, but at least the information I'm after is available. There might be some more methods on the object returned from references() that makes it easier to parse the returned string but if there is, I haven't found it!
Here is a function I wrote to do the parsing
def parse_references(references: str):
return sorted(
"".join(
references.replace("'", "")
.replace("List(", "")
.replace(")", "")
.replace(")", "")
.split()
).split(",")
)
assert parse_references("List('A, 'B)") == ["A", "B"]

PySpark is not really designed for such lower-level tricks (which begs more for Scala that Spark is developed in and as such offers all there is inside).
This step where you access QueryExecution is the main entry point to the machinery of Spark SQL's query execution engine.
The issue is that py4j (that is used as a bridge between JVM and Python environments) makes it of no use on PySpark's side.
You can use the following if you need to access the final query plan (just before it's converted into RDDs):
df._jdf.queryExecution().executedPlan().prettyJson()
Review the QueryExecution API.
QueryExecutionListener
You should really consider Scala to intercept whatever you want about your queries and QueryExecutionListener seems a fairly viable starting point.
There is more but it's all over Scala :)
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.
I'm not surprised since you're throwing away the best possible answer: Scala. I'd recommend using it for a PoC and see what you can get and only then (if you have to) look out for a Python solution (which I think is doable yet highly error-prone).

You can try the below codes, this will give you a column list and its data type in the data frame.
for field in df.schema.fields:
print(field.name +" , "+str(field.dataType))

Related

PySpark DataFrame showing different results when using .select()

Why is .select() showing/parsing values differently to I don't use it?
I have this CSV:
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName,ConfStmtNextDueDate, ConfStmtLastMadeUpDate
"ATS CAR RENTALS LIMITED","10795807","","",", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD","ACCOUNTING FREEDOM","BEXLEYHEATH","","ENGLAND","DA6 8DS","Private Limited Company","Active","United Kingdom","","31/05/2017","31","5","28/02/2023","31/05/2021","TOTAL EXEMPTION FULL","28/06/2018","","0","0","0","0","49390 - Other passenger land transport","","","","0","0","http://business.data.gov.uk/id/company/10795807","","","","","","","","","","","","","","","","","","","","","12/06/2023","29/05/2022"
"ATS CARE LIMITED","10393661","","","UNIT 5 CO-OP BUILDINGS HIGH STREET","ABERSYCHAN","PONTYPOOL","TORFAEN","WALES","NP4 7AE","Private Limited Company","Active","United Kingdom","","26/09/2016","30","9","30/06/2023","30/09/2021","UNAUDITED ABRIDGED","24/10/2017","","0","0","0","0","87900 - Other residential care activities n.e.c.","","","","0","0","http://business.data.gov.uk/id/company/10393661","17/05/2018","ATS SUPPORT LIMITED","22/12/2017","ATS CARE LIMITED","","","","","","","","","","","","","","","","","09/10/2022","25/09/2021"
I'm reading the csv like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
_file = "/path/dir/BasicCompanyDataAsOneFile-2022-08-01.csv"
df = spark.read.csv(_file, header=True, quote='"', escape="\"")
Focusing on the CompanyCategory column, we should see Private Limited Company for both lines. But this is what I get instead when using select():
df.select("CompanyCategory").show(truncate=False)
+-----------------------+
|CompanyCategory |
+-----------------------+
|DA6 8DS |
|Private Limited Company|
+-----------------------+
df.select("CompanyCategory").collect()
[Row(CompanyCategory='DA6 8DS'),
Row(CompanyCategory='Private Limited Company')]
vs when not using select():
from pprint import pprint
for row in df.collect():
pprint(row.asDict())
{' CompanyNumber': '10795807',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CAR RENTALS LIMITED',
...}
{' CompanyNumber': '10393661',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CARE LIMITED',
...}
Using asDict() for readability.
SQL doing the same thing:
df.createOrReplaceTempView("companies")
spark.sql('select CompanyCategory from companies').show()
+--------------------+
| CompanyCategory|
+--------------------+
|Private Limited C...|
| DA6 8DS|
+--------------------+
As you can see when not using select() the CompanyCategory values are showing correctly. Why is this happening? What can I do to avoid this?
Context: I'm trying to creating dimension tables which is why I'm selecting a single column. The next phase is to drop duplicates, filter, sort, etc.
Edit:
Here are two example values in the actual CSV that are throwing things off:
CompanyName of """ BORA "" 2 LTD"
1st line address of ", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD"
Note:
These values from two separate distinct lines in the CSV.
These values are copy and pasted from the CSV opened in text editor like Notepad or VSCode).
Tried and failed:
df = spark.read.csv(_file, header=True) - completely picks up incorrect column.
df = spark.read.csv(_file, header=True, escape='\"') - exact same thing described in original question above. So same results.
df = spark.read.csv(_file, header=True, escape='""') - since the CSV escapes quotes using two double quotes, then I guess using two double quotes as escape param would do the trick? But getting following error:
Py4JJavaError: An error occurred while calling o276.csv.
: java.lang.RuntimeException: escape cannot be more than one character
When reading the csv, the parameters quote and escape are set to the same value ('"'=="\"" returns True in Python).
I would guess that configuring both parameters in this way will somehow disturb the parser that Spark uses to separate the single fields. After removing the escape parameter you can process the remaining " with regexp_replace:
from pyspark.sql import functions as F
df = spark.read.csv(<filename>, header=True, quote='"')
cols = [F.regexp_replace(F.regexp_replace(
F.regexp_replace("`" + col + "`", '^"', ''),
'"$', ''), '""', '"').alias(col) for col in df.columns]
df.select(cols).show(truncate=False)
Probably there is a smart regexp that can combine all three replace operations into one...
This is an issue when reading a single column from CSV file vs. when reading all the columns:
df = spark.read.csv('companies-house.csv', header=True, quote='"', escape="\"")
df.show() # correct output (all columns loaded)
df.toPandas() # same as pd.read_csv()
df.select('CompanyCategory').show() # wrong output (trying to load a single column)
df.cache() # all columns will be loaded and used in any subsequent call
df.select('CompanyCategory').show() # correct output
The first select() performs a different (optimized) read than the second, so one possible workaround would be to cache() the data immediately. This will however load all the columns, not just one (although pandas and COPY do the same).
The problematic part of the CSV is the RegAddress.POBox column where empty value is saved as ", instead of "",. You can check this by incrementally loading more columns:
df.unpersist() # undo cache() operation (for testing purposes only)
df.select(*[f"`{c}`" for c in df.columns[:3]], 'CompanyCategory').show() # wrong
df.select(*[f"`{c}`" for c in df.columns[:4]], 'CompanyCategory').show() # OK

itertools and combinations all coming back the same

I'm writing some code to bring back a unique ID for each event that comes in in a given version. The value can repeat in a future version as the prefix for the version will change. I have the version information but I'm struggling to bring back the uid. I found some code that seems to produce what I need, found here and have to implement it for what I want but I am facing an issue.
I have the information I need as a dataframe and when I run the code it returns all values as the same unique value. I suspect that the issue stems from how I am using the used set from the example and it isn't being properly stored hence why it returns the same info each time.
Is anyone able to provide some hint on where to look as I can't seem to work out how to persist the information to change it for each row. Side note, I can't use Pandas so I can't use the udf function in there and the uuid module is no good as the requirement is to keep it short to allow easy human typing for searching. I've posted the code below.
import itertools
import string
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
#udf(returnType=StringType())
def uid_generator(id_column):
valid_chars = set(string.ascii_lowercase + string.digits) - set('lio01')
used = set()
unique_id_generator = itertools.combinations(valid_chars, 6)
uid = "".join(next(unique_id_generator)).upper()
while uid in used:
uid = "".join(next(unique_id_generator))
return uid
used.add(uid)
#uuid_udf = udf(uuid_generator,)
df2 = df_uid_register_input.withColumn('uid', uid_generator(df_uid_register_input.record))
The output is:
In the function definition, you have the argument id_column, but you never use that argument in the function body. And it seems that you haven't tried to use the version column either.
What may be easier for you, is not to aim for true uniqueness, but use one of hash functions. Even though in theory they don't give unique results, but practically, it's just ridiculously unlikely that one would get the same hash for different inputs.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, 2),
(2, 1, 2),
(3, 1, 2)],
['record', 'job_id', 'version'])
df = df.select(
'*',
F.sha1(F.concat_ws('_', 'record', 'version')).alias('uid1'),
F.sha2(F.concat_ws('_', 'record', 'version'), 0).alias('uid2'),
F.md5(F.concat_ws('_', 'record', 'version')).alias('uid3'),
)
df.show()
# +------+------+-------+--------------------+--------------------+--------------------+
# |record|job_id|version| uid1| uid2| uid3|
# +------+------+-------+--------------------+--------------------+--------------------+
# | 1| 1| 2|486cbd63f94d703d2...|0c79023f435b2e9e6...|ab35e84a215f0f711...|
# | 2| 1| 2|f5d7b663eea5f2e69...|48fccc7ee00b72959...|5229803558d4b7895...|
# | 3| 1| 2|982bde375462792cb...|ad9a5c5fb1bc135d8...|dfe3a334fc99f298a...|
# +------+------+-------+--------------------+--------------------+--------------------+

How do filter with multiple contains in pyspark

I'm going to do a query with pyspark to filter row who contains at least one word in array. For example, the dataframe is:
"content" "other"
My father is big. ...
My mother is beautiful. ...
I'm going to travel. ...
I have an array:
array=["mother","father"]
And the output must be this:
"content" "other"
My father is big. ...
My mother is beautiful. ...
A simple filter for word in array.
I think this solution works. Let me know what you think.
import pyspark.sql.functions as f
phrases = ['bc', 'ij']
df = spark.createDataFrame([
('abcd',),
('efgh',),
('ijkl',)
], ['col1'])
(df
.withColumn('phrases', f.array([f.lit(element) for element in phrases]))
.where(f.expr('exists(phrases, element -> col1 like concat("%", element, "%"))'))
.drop('phrases')
.show()
)
output
+----+
|col1|
+----+
|abcd|
|ijkl|
+----+
Had the same thoughts as #ARCrow but using instr.
lst=["mother","father"]
DataFrame
data= [
(1,"My father is big."),
(2, "My mother is beautiful"),
(3,"I'm going to travel.")
]
df=spark.createDataFrame(data, ("id",'content'))
Solution
df=(df
.withColumn('phrases', f.array([f.lit(element) for element in lst]))
.where(f.expr('exists(phrases, element -> instr (content, element)>=1)'))
.drop('phrases')
)
df.show()
Outcome
+---+--------------------+
| id| content|
+---+--------------------+
| 1| My father is big.|
| 2|My mother is beau...|
+---+--------------------+
Taking some the same configuration as #wwnde,
data= [
(1,"My father is big."),
(2, "My mother is beautiful"),
(3,"I'm going to travel.")
]
df=spark.createDataFrame(data, ("id",'content'))
Solution
words = ["father", "mother"]
conditions = " or ".join([f"content like '%{word}%'" for word in words])
(
df
.filter(F.expr(conditions))
.show(truncate=False)
)
+---+----------------------+
|id |content |
+---+----------------------+
|1 |My father is big. |
|2 |My mother is beautiful|
+---+----------------------+
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First, we setup,
import pandas as pd
array=["mother","father"]
df = pd.DataFrame({"sentence": ["My father is big.", "My mother is beautiful.", "I'm going to travel. "]})
and then we can create a native Python function to express the logic:
from typing import List, Dict, Any, Iterable
def myfilter(df: List[Dict[str,Any]]) -> Iterable[Dict[str, Any]]:
for row in df:
for value in array:
if value in row["sentence"]:
yield row
and then test it on Pandas:
from fugue import transform
transform(df, myfilter, schema="*")
Because of works on Pandas, we can execute it on Spark by specifying the engine:
import fugue_spark
transform(df, myfilter, schema="*", engine="spark").show()
+---+--------------------+
| id| sentence|
+---+--------------------+
| 0| My father is big.|
| 1|My mother is beau...|
+---+--------------------+
Note we need .show() because Spark evaluates lazily. Schema is also a Spark requirement so Fugue interprets the "*" as all columns in = all columns out.
The fugue transform function can take both Pandas DataFrame inputs and Spark DataFrame inputs.
Edit:
You can replace the myfilter function above with a Pandas implementation like this:
def myfilter(df: pd.DataFrame) -> pd.DataFrame:
res = df.loc[df["sentence"].str.contains("|".join(array))]
return res
and Fugue will be able to port it to Spark the same way. Fugue knows how to adjust to the type hints and this will be faster than the native Python implementation because it takes advantage of Pandas being vectorized.

PySpark groupby elements with key of their occurence

I have this data in a DATAFRAME:
id,col
65475383,acacia
63975914,acacia
65475383,excelsa
63975914,better
I want to have a dictionary that will contain the column "word" and every id that is associated with it, something like this:
word:key
acacia: 65475383,63975914
excelsa: 65475383
better: 63975914
I tried groupBy, but that is a way to aggregate data, how to approach this problem?
I'm not sure if you intend to have the result as a Python dictionary or as a Dataframe (it is not clear from your question).
However, if you do want a Dataframe then one way to calculate that is:
from pyspark.sql.functions import collect_list
idsByWords = df \
.groupBy("col") \
.agg(collect_list("id").alias("ids")) \
.withColumnRenamed("col", "word")
This will result in:
idsByWords.show(truncate=False)
+-------+--------------------+
|word |ids |
+-------+--------------------+
|excelsa|[65475383] |
|better |[63975914] |
|acacia |[65475383, 63975914]|
+-------+--------------------+
Then you can turn that dataframe into a Python dictionary :
d = {r.asDict()["word"]: r.asDict()["ids"] for r in idsByWords.collect()}
To finally get:
{
'excelsa': [65475383],
'better': [63975914],
'acacia': [65475383, 63975914]
}
Note that collect may crash your driver program if it exceeds your driver memory.

How to convert RDD into Dataframe with Pyspark?

I have an RDD below, which I have received from a client. How can I convert this RDD into a Dataframe?
["Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')"]
Note: This is not really an answer, but I don't understand as to what OP is asking about. Writing this in comment section would not have been possible, but may be we can take it forward from here.
OP says that he/she receives an RDD (purportedly a single element) from Client -
["Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')"]
Now, OP wants to translate that to a DataFrame. To translate that, one has to de-string the Row object, but OP has to clarify what he needs.
from pyspark.sql import Row
rdd_from_client = [Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')]
df = sqlContext.createDataFrame(rdd_from_client)
df.show(truncate=False)
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+
|Moid|Tend |Tripid|Tstart |Xend |Xstart |Yend |Ystart |
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+
|2 |2007-05-28 08:53:16.040|11 |2007-05-28 08:53:14.040|9743.73|9738.73|114.553|103.246|
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+

Categories

Resources