How to convert RDD into Dataframe with Pyspark?

How to convert RDD into Dataframe with Pyspark? - python

I have an RDD below, which I have received from a client. How can I convert this RDD into a Dataframe?
["Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')"]

Note: This is not really an answer, but I don't understand as to what OP is asking about. Writing this in comment section would not have been possible, but may be we can take it forward from here.
OP says that he/she receives an RDD (purportedly a single element) from Client -
["Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')"]
Now, OP wants to translate that to a DataFrame. To translate that, one has to de-string the Row object, but OP has to clarify what he needs.
from pyspark.sql import Row
rdd_from_client = [Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')]
df = sqlContext.createDataFrame(rdd_from_client)
df.show(truncate=False)
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+
|Moid|Tend |Tripid|Tstart |Xend |Xstart |Yend |Ystart |
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+
|2 |2007-05-28 08:53:16.040|11 |2007-05-28 08:53:14.040|9743.73|9738.73|114.553|103.246|
+----+-----------------------+------+-----------------------+-------+-------+-------+-------+

Related

How to get referenced columns of a PySpark DataFrame?

Given a PySpark DataFrame is it possible to obtain a list of source columns that are being referenced by the DataFrame?
Perhaps a more concrete example might help explain what I'm after. Say I have a DataFrame defined as:
import pyspark.sql.functions as func
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
source_df = spark.createDataFrame(
[("pru", 23, "finance"), ("paul", 26, "HR"), ("noel", 20, "HR")],
["name", "age", "department"],
)
source_df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT name, age, department FROM people")
df = sqlDF.groupBy("department").agg(func.max("age").alias("max_age"))
df.show()
which returns:
+----------+--------+
|department|max_age |
+----------+--------+
| finance| 23|
| HR| 26|
+----------+--------+
The columns that are referenced by df are [department, age]. Is it possible to get that list of referenced columns programatically?
Thanks to Capturing the result of explain() in pyspark I know I can extract the plan as a string:
df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "formatted")
which returns:
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
+- Exchange (4)
+- HashAggregate (3)
+- Project (2)
+- Scan ExistingRDD (1)
(1) Scan ExistingRDD
Output [3]: [name#0, age#1L, department#2]
Arguments: [name#0, age#1L, department#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Project
Output [2]: [age#1L, department#2]
Input [3]: [name#0, age#1L, department#2]
(3) HashAggregate
Input [2]: [age#1L, department#2]
Keys [1]: [department#2]
Functions [1]: [partial_max(age#1L)]
Aggregate Attributes [1]: [max#22L]
Results [2]: [department#2, max#23L]
(4) Exchange
Input [2]: [department#2, max#23L]
Arguments: hashpartitioning(department#2, 200), ENSURE_REQUIREMENTS, [plan_id=60]
(5) HashAggregate
Input [2]: [department#2, max#23L]
Keys [1]: [department#2]
Functions [1]: [max(age#1L)]
Aggregate Attributes [1]: [max(age#1L)#12L]
Results [2]: [department#2, max(age#1L)#12L AS max_age#13L]
(6) AdaptiveSparkPlan
Output [2]: [department#2, max_age#13L]
Arguments: isFinalPlan=false
which is useful, however its not what I need. I need a list of the referenced columns. Is this possible?
Perhaps another way of asking the question is... is there a way to obtain the explain plan as an object that I can iterate over/explore?
UPDATE. Thanks to the reply from #matt-andruff I have gotten this:
df._jdf.queryExecution().executedPlan().treeString().split("+-")[-2]
which returns:
' Project [age#1L, department#2]\n '
from which I guess I could parse the information I'm after but this is a far from elegant way to do it, and is particularly error prone.
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.

There is an object for that unfortunately its a java object, and not translated to pyspark.
You can still access it with Spark constucts:
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(0).toString()
u'department#1621'
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(1).toString()
u'max_age#1632L'
You could loop through both the above apply to get the information you are looking for with something like:
plan = df._jdf.queryExecution().executedPlan()
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
iterator = steps[0].inputSet().iterator()
>>> iterator.next().toString()
u'department#1621'
>>> iterator.next().toString()
u'max#1642L'
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
projections = [ (steps[0].p(i).toJSON().encode('ascii','ignore')) for i in range(1,100) if not( isinstance(steps[0].p(i), type(None) )) and steps[0].p(i).nodeName().encode('ascii','ignore') == 'Project' ]
dd = spark.sparkContext.parallelize(projections)
df2 = spark.read.json(rdd)
>>> df2.show(1,False)
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|child|class |name|num-children|output|outputOrdering|outputPartitioning|projectList |rdd |
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|0 |org.apache.spark.sql.execution.ProjectExec|null|1 |null |null |null |[[[org.apache.spark.sql.catalyst.expressions.AttributeReference, long, [1620, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], age, true, 0, [people]]], [[org.apache.spark.sql.catalyst.expressions.AttributeReference, string, [1621, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], department, true, 0, [people]]]]|null|
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
df2.select(func.explode(func.col('projectList'))).select( func.col('col')[0]["name"] ) .show(100,False)
+-----------+
|col[0].name|
+-----------+
|age |
|department |
+-----------+
range --> Bit of a hack but apparently size doesn't work.I'm sure with more time I could refine the range hack.
You can then use json to pull the information programmatically.

I have something that, while not being an answer to my original question (see Matt Andruff's answer for that), could still be useful here. Its a way to get all the source columns referenced by a pyspark.sql.column.Column.
Simple repro:
from pyspark.sql import functions as f, SparkSession
SparkSession.builder.getOrCreate()
col = f.concat(f.col("A"), f.col("B"))
type(col)
col._jc.expr().references().toList().toString()
returns:
<class 'pyspark.sql.column.Column'>
"List('A, 'B)"
its definitely not perfect, it still requires you to parse out the column names from the string that is returned, but at least the information I'm after is available. There might be some more methods on the object returned from references() that makes it easier to parse the returned string but if there is, I haven't found it!
Here is a function I wrote to do the parsing
def parse_references(references: str):
return sorted(
"".join(
references.replace("'", "")
.replace("List(", "")
.replace(")", "")
.replace(")", "")
.split()
).split(",")
)
assert parse_references("List('A, 'B)") == ["A", "B"]

PySpark is not really designed for such lower-level tricks (which begs more for Scala that Spark is developed in and as such offers all there is inside).
This step where you access QueryExecution is the main entry point to the machinery of Spark SQL's query execution engine.
The issue is that py4j (that is used as a bridge between JVM and Python environments) makes it of no use on PySpark's side.
You can use the following if you need to access the final query plan (just before it's converted into RDDs):
df._jdf.queryExecution().executedPlan().prettyJson()
Review the QueryExecution API.
QueryExecutionListener
You should really consider Scala to intercept whatever you want about your queries and QueryExecutionListener seems a fairly viable starting point.
There is more but it's all over Scala :)
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.
I'm not surprised since you're throwing away the best possible answer: Scala. I'd recommend using it for a PoC and see what you can get and only then (if you have to) look out for a Python solution (which I think is doable yet highly error-prone).

You can try the below codes, this will give you a column list and its data type in the data frame.
for field in df.schema.fields:
print(field.name +" , "+str(field.dataType))

PySpark DataFrame showing different results when using .select()

Why is .select() showing/parsing values differently to I don't use it?
I have this CSV:
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName,ConfStmtNextDueDate, ConfStmtLastMadeUpDate
"ATS CAR RENTALS LIMITED","10795807","","",", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD","ACCOUNTING FREEDOM","BEXLEYHEATH","","ENGLAND","DA6 8DS","Private Limited Company","Active","United Kingdom","","31/05/2017","31","5","28/02/2023","31/05/2021","TOTAL EXEMPTION FULL","28/06/2018","","0","0","0","0","49390 - Other passenger land transport","","","","0","0","http://business.data.gov.uk/id/company/10795807","","","","","","","","","","","","","","","","","","","","","12/06/2023","29/05/2022"
"ATS CARE LIMITED","10393661","","","UNIT 5 CO-OP BUILDINGS HIGH STREET","ABERSYCHAN","PONTYPOOL","TORFAEN","WALES","NP4 7AE","Private Limited Company","Active","United Kingdom","","26/09/2016","30","9","30/06/2023","30/09/2021","UNAUDITED ABRIDGED","24/10/2017","","0","0","0","0","87900 - Other residential care activities n.e.c.","","","","0","0","http://business.data.gov.uk/id/company/10393661","17/05/2018","ATS SUPPORT LIMITED","22/12/2017","ATS CARE LIMITED","","","","","","","","","","","","","","","","","09/10/2022","25/09/2021"
I'm reading the csv like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
_file = "/path/dir/BasicCompanyDataAsOneFile-2022-08-01.csv"
df = spark.read.csv(_file, header=True, quote='"', escape="\"")
Focusing on the CompanyCategory column, we should see Private Limited Company for both lines. But this is what I get instead when using select():
df.select("CompanyCategory").show(truncate=False)
+-----------------------+
|CompanyCategory |
+-----------------------+
|DA6 8DS |
|Private Limited Company|
+-----------------------+
df.select("CompanyCategory").collect()
[Row(CompanyCategory='DA6 8DS'),
Row(CompanyCategory='Private Limited Company')]
vs when not using select():
from pprint import pprint
for row in df.collect():
pprint(row.asDict())
{' CompanyNumber': '10795807',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CAR RENTALS LIMITED',
...}
{' CompanyNumber': '10393661',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CARE LIMITED',
...}
Using asDict() for readability.
SQL doing the same thing:
df.createOrReplaceTempView("companies")
spark.sql('select CompanyCategory from companies').show()
+--------------------+
| CompanyCategory|
+--------------------+
|Private Limited C...|
| DA6 8DS|
+--------------------+
As you can see when not using select() the CompanyCategory values are showing correctly. Why is this happening? What can I do to avoid this?
Context: I'm trying to creating dimension tables which is why I'm selecting a single column. The next phase is to drop duplicates, filter, sort, etc.
Edit:
Here are two example values in the actual CSV that are throwing things off:
CompanyName of """ BORA "" 2 LTD"
1st line address of ", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD"
Note:
These values from two separate distinct lines in the CSV.
These values are copy and pasted from the CSV opened in text editor like Notepad or VSCode).
Tried and failed:
df = spark.read.csv(_file, header=True) - completely picks up incorrect column.
df = spark.read.csv(_file, header=True, escape='\"') - exact same thing described in original question above. So same results.
df = spark.read.csv(_file, header=True, escape='""') - since the CSV escapes quotes using two double quotes, then I guess using two double quotes as escape param would do the trick? But getting following error:
Py4JJavaError: An error occurred while calling o276.csv.
: java.lang.RuntimeException: escape cannot be more than one character

When reading the csv, the parameters quote and escape are set to the same value ('"'=="\"" returns True in Python).
I would guess that configuring both parameters in this way will somehow disturb the parser that Spark uses to separate the single fields. After removing the escape parameter you can process the remaining " with regexp_replace:
from pyspark.sql import functions as F
df = spark.read.csv(<filename>, header=True, quote='"')
cols = [F.regexp_replace(F.regexp_replace(
F.regexp_replace("`" + col + "`", '^"', ''),
'"$', ''), '""', '"').alias(col) for col in df.columns]
df.select(cols).show(truncate=False)
Probably there is a smart regexp that can combine all three replace operations into one...

This is an issue when reading a single column from CSV file vs. when reading all the columns:
df = spark.read.csv('companies-house.csv', header=True, quote='"', escape="\"")
df.show() # correct output (all columns loaded)
df.toPandas() # same as pd.read_csv()
df.select('CompanyCategory').show() # wrong output (trying to load a single column)
df.cache() # all columns will be loaded and used in any subsequent call
df.select('CompanyCategory').show() # correct output
The first select() performs a different (optimized) read than the second, so one possible workaround would be to cache() the data immediately. This will however load all the columns, not just one (although pandas and COPY do the same).
The problematic part of the CSV is the RegAddress.POBox column where empty value is saved as ", instead of "",. You can check this by incrementally loading more columns:
df.unpersist() # undo cache() operation (for testing purposes only)
df.select(*[f"`{c}`" for c in df.columns[:3]], 'CompanyCategory').show() # wrong
df.select(*[f"`{c}`" for c in df.columns[:4]], 'CompanyCategory').show() # OK

Parse data in a new dataframe with correct headers taken from within the data

I have a CSV that has been returned and the data is in a god awful state, I need to parse both the header and then the data out from each row.
This is an example of one row:
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9| _c10| _c11| _c12| _c13| _c14| _c15| _c16| _c17| _c18| _c19| _c20| _c21| _c22| _c23| _c24| _c25| _c26| _c27| _c28| _c29| _c30| _c31| _c32| _c33| _c34| _c35| _c36| _c37| _c38| _c39| _c40| _c41| _c42| _c43| _c44| _c45| _c46| _c47| _c48| _c49| _c50| _c51| _c52| _c53| _c54| _c55| _c56| _c57| _c58| _c59| _c60| _c61| _c62| _c63| _c64| _c65| _c66| _c67| _c68| _c69| _c70| _c71| _c72| _c73| _c74| _c75| _c76| _c77| _c78| _c79| _c80| _c81| _c82| _c83| _c84| _c85| _c86| _c87| _c88| _c89| _c90| _c91| _c92| _c93| _c94| _c95| _c96| _c97| _c98| _c99| _c100| _c101| _c102| _c103| _c104| _c105| _c106| _c107| _c108| _c109| _c110| _c111| _c112| _c113| _c114| _c115| _c116| _c117| _c118| _c119| _c120| _c121| _c122| _c123| _c124| _c125| _c126| _c127| _c128| _c129| _c130| _c131| _c132| _c133| _c134| _c135| _c136| _c137| _c138| _c139| _c140| _c141| _c142| _c143| _c144| _c145| _c146| _c147| _c148| _c149| _c150|
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
|{"MANDT":"400"|"LEDNR":"00"|"OBJNR":"KS660000...|"GJAHR":"2022"|"WRTTP":"04"|"VERSN":"000"|"KSTAR":"0051040100"|"HRKFT":""|"VRGNG":"COIN"|"VBUND":""|"PARGB":""|"BEKNZ":"H"|"TWAER":"THB"|"PERBL":"016"|"MEINH":""|"WTG001":-1854554.89|"WTG002":0.00|"WTG003":0.00|"WTG004":0.00|"WTG005":0.00|"WTG006":0.00|"WTG007":0.00|"WTG008":0.00|"WTG009":0.00|"WTG010":0.00|"WTG011":0.00|"WTG012":0.00|"WTG013":0.00|"WTG014":0.00|"WTG015":0.00|"WTG016":0.00|"WOG001":-1854554.89|"WOG002":0.00|"WOG003":0.00|"WOG004":0.00|"WOG005":0.00|"WOG006":0.00|"WOG007":0.00|"WOG008":0.00|"WOG009":0.00|"WOG010":0.00|"WOG011":0.00|"WOG012":0.00|"WOG013":0.00|"WOG014":0.00|"WOG015":0.00|"WOG016":0.00|"WKG001":-1854554.89|"WKG002":0.00|"WKG003":0.00|"WKG004":0.00|"WKG005":0.00|"WKG006":0.00|"WKG007":0.00|"WKG008":0.00|"WKG009":0.00|"WKG010":0.00|"WKG011":0.00|"WKG012":0.00|"WKG013":0.00|"WKG014":0.00|"WKG015":0.00|"WKG016":0.00|"WKF001":0.00|"WKF002":0.00|"WKF003":0.00|"WKF004":0.00|"WKF005":0.00|"WKF006":0.00|"WKF007":0.00|"WKF008":0.00|"WKF009":0.00|"WKF010":0.00|"WKF011":0.00|"WKF012":0.00|"WKF013":0.00|"WKF014":0.00|"WKF015":0.00|"WKF016":0.00|"PAG001":0.00|"PAG002":0.00|"PAG003":0.00|"PAG004":0.00|"PAG005":0.00|"PAG006":0.00|"PAG007":0.00|"PAG008":0.00|"PAG009":0.00|"PAG010":0.00|"PAG011":0.00|"PAG012":0.00|"PAG013":0.00|"PAG014":0.00|"PAG015":0.00|"PAG016":0.00|"MEG001":0.000|"MEG002":0.000|"MEG003":0.000|"MEG004":0.000|"MEG005":0.000|"MEG006":0.000|"MEG007":0.000|"MEG008":0.000|"MEG009":0.000|"MEG010":0.000|"MEG011":0.000|"MEG012":0.000|"MEG013":0.000|"MEG014":0.000|"MEG015":0.000|"MEG016":0.000|"MEF001":0.000|"MEF002":0.000|"MEF003":0.000|"MEF004":0.000|"MEF005":0.000|"MEF006":0.000|"MEF007":0.000|"MEF008":0.000|"MEF009":0.000|"MEF010":0.000|"MEF011":0.000|"MEF012":0.000|"MEF013":0.000|"MEF014":0.000|"MEF015":0.000|"MEF016":0.000|"MUV001":""|"MUV002":""|"MUV003":""|"MUV004":""|"MUV005":""|"MUV006":""|"MUV007":""|"MUV008":""|"MUV009":""|"MUV010":""|"MUV011":""|"MUV012":""|"MUV013":""|"MUV014":""|"MUV015":""|"MUV016":""|"BELTP":"1"|"TIMESTMP":101246...|"BUKRS":"6611"|"FKBER":""|"SEGMENT":""|"GEBER":""|"GRANT_NBR":""|"BUDGET_PD":""}|
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
The first part for example MANDT is the column header and the bit after the : is the value. I basically need to
A) Loop all the columns and change the headers so they relate to the bit prior to the :
B) then populate the rows with the second part after.
I've attempted a small piece of code just to edit all the columns like below
from pyspark.sql.functions import split
for colname in COSPDF.columns:
print(colname)
COSPDF = COSPDF.withColumn(col(colname), lower(colname))
and I receive an error TypeError: 'str' object is not callable
I've then done the "lazy" thing and found some code like below
from pyspark.sql.functions import split
split_df = COSPDF.select(split(COSPDF._c0, ':').alias('split_text'))
split_df.selectExpr("split_text[0] as left").show() # left of delim
split_df.selectExpr("split_text[1] as right").show() # right of delim
However this code only works one column that I have to "specify" which doesn't work when the CSV has 123 columns, I'm not doing it 123 times. Any assistance would really help with this please, it's had me stuck for hours.
UPDATED
Some rows from the original file:
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040100""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""H""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":-1854554.89","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":-1854554.89","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":-1854554.89","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10124662650000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040100""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""S""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":7424891.07","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":7424891.07","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":7424891.07","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10160936750000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040105""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""H""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":-509518.63","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":-509518.63","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":-509518.63","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10124662700000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"

Simply, You need to put Header name in Pandas Dataframe like...
df.columns = ["Column_Name1", "Column_Name2", "Column_Name3", "Column_Name4" and so on..]
And, If you want to use loop to append name for each col then you need iterate over the list and append based on the index and length of the list

First read csv and get each key value pair by iterating over the columns
import pandas as pd
read_df = pd.read_csv(<your csv file path>)
dict_of_pairs = {pairs: read_df[pairs] for pairs in read_df}
Write it in another file
write_df = pd.DataFrame({k: pd.Series(v) for k, v in dict_of_pairs.items()}) // this will allow you to write even if some column has no values in it
writer = pd.ExcelWriter(write_path, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Somename for your sheet', index=False)
Hope this answers your question.....

Store string in a column as nested JSON to a JSON file - Pyspark

I have a pyspark dataframe, this is what it looks like
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|member_uuid |Timestamp |updated |member_id |easy_id |
+------------------------------------+-------------------+-------------+--------------------------------+---------+
|027130fe-584d-4d8e-9fb0-b87c984a0c20|2020-02-11 19:15:32|password_hash|ajuypjtnlzmk4na047cgav27jma6_STG|993269700|
I transformed the above dataframe to this,
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|attribute|operation|params |timestamp |
+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
|profile |UPDATE |{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"}|2020-02-11 19:15:32|
Using the following code,
ll = ['member_uuid', 'member_id', 'easy_id', 'field']
df = df.withColumn('timestamp', col('Timestamp')).withColumn('attribute', lit('profile')).withColumn('operation', lit(col_name)) \
.withColumn('field', col('updated')).withColumn('params', F.to_json(struct([x for x in ll])))
df = df.select('attribute', 'operation', 'params', 'timestamp')
I have save this dataframe df to a text file after converting it to JSON.
I tried using the following code to do the same,
df_final.toJSON().coalesce(1).saveAsTextFile('file')
The file contains,
{"attribute":"profile","operation":"UPDATE","params":"{\"member_uuid\":\"027130fe-584d-4d8e-9fb0-b87c984a0c20\",\"member_id\":\"ajuypjtnlzmk4na047cgav27jma6_STG\",\"easy_id\":993269700,\"field\":\"password_hash\"}","timestamp":"2020-02-11T19:15:32.000Z"}
I want it to save in this format,
{"attribute":"profile","operation":"UPDATE","params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":993269700,"field":"password_hash"},"timestamp":"2020-02-11T19:15:32.000Z"}
to_json saves the value in the params columns as a string, is there a way to keep the json context here so I can save it as the desired output?

Don't use to_json to create params column in dataframe.
The trick here is just create struct and write to the file (using .saveAsTextFile (or) .write.json()) Spark will create JSON for the Struct field.
if we already created json object and writing in json format Spark will add \ to escape the quotes already exists in Json string.
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([("027130fe-584d-4d8e-9fb0-b87c984a0c20","2020-02-11 19:15:32","password_hash","ajuypjtnlzmk4na047cgav27jma6_STG","993269700")],["member_uuid","Timestamp","updated","member_id","easy_id"])
df1=df.withColumn("attribute",lit("profile")).withColumn("operation",lit("UPDATE"))
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").write.format("json").mode("overwrite").save("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}
df1.selectExpr("struct(member_uuid,member_id,easy_id) as params","attribute","operation","timestamp").toJSON().saveAsTextFile("<path>")
#{"params":{"member_uuid":"027130fe-584d-4d8e-9fb0-b87c984a0c20","member_id":"ajuypjtnlzmk4na047cgav27jma6_STG","easy_id":"993269700"},"attribute":"profile","operation":"UPDATE","timestamp":"2020-02-11 19:15:32"}

A simple way to handle it is to just do a replace operation on the file
sourceData=open('file').read().replace('"{','{').replace('}"','}').replace('\\','')
with open('file','w') as final:
final.write(sourceData)
This might not be what you are looking for, but will achieve the end result.

Pyspark merge multiple columns into a json column

I asked the question a while back for python, but now I need to do the same thing in PySpark.
I have a dataframe (df) like so:
|cust_id|address |store_id|email |sales_channel|category|
-------------------------------------------------------------------
|1234567|123 Main St|10SjtT |idk#gmail.com|ecom |direct |
|4567345|345 Main St|10SjtT |101#gmail.com|instore |direct |
|1569457|876 Main St|51FstT |404#gmail.com|ecom |direct |
and I would like to combine the last 4 fields into one metadata field that is a json like so:
|cust_id|address |metadata |
-------------------------------------------------------------------------------------------------------------------
|1234567|123 Main St|{'store_id':'10SjtT', 'email':'idk#gmail.com','sales_channel':'ecom', 'category':'direct'} |
|4567345|345 Main St|{'store_id':'10SjtT', 'email':'101#gmail.com','sales_channel':'instore', 'category':'direct'}|
|1569457|876 Main St|{'store_id':'51FstT', 'email':'404#gmail.com','sales_channel':'ecom', 'category':'direct'} |
Here's the code I used to do this in python:
cols = [
'store_id',
'store_category',
'sales_channel',
'email'
]
df1 = df.copy()
df1['metadata'] = df1[cols].to_dict(orient='records')
df1 = df1.drop(columns=cols)
but I would like to translate this to PySpark code to work with a spark dataframe; I do NOT want to use pandas in Spark.

Use to_json function to create json object!
Example:
from pyspark.sql.functions import *
#sample data
df=spark.createDataFrame([('1234567','123 Main St','10SjtT','idk#gmail.com','ecom','direct')],['cust_id','address','store_id','email','sales_channel','category'])
df.select("cust_id","address",to_json(struct("store_id","category","sales_channel","email")).alias("metadata")).show(10,False)
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","category":"direct","sales_channel":"ecom","email":"idk#gmail.com"}|
+-------+-----------+----------------------------------------------------------------------------------------+
to_json by passing list of columns:
ll=['store_id','email','sales_channel','category']
df.withColumn("metadata", to_json(struct([x for x in ll]))).drop(*ll).show()
#result
+-------+-----------+----------------------------------------------------------------------------------------+
|cust_id|address |metadata |
+-------+-----------+----------------------------------------------------------------------------------------+
|1234567|123 Main St|{"store_id":"10SjtT","email":"idk#gmail.com","sales_channel":"ecom","category":"direct"}|
+-------+-----------+----------------------------------------------------------------------------------------+

#Shu gives a good answer, here's a variant that works out slightly better for my use case. I'm going from Kafka -> Spark -> Kafka and this one liner does exactly what I want. The struct(*) will pack up all the fields in the dataframe.
# Packup the fields in preparation for sending to Kafka sink
kafka_df = df.selectExpr('cast(id as string) as key', 'to_json(struct(*)) as value')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to convert RDD into Dataframe with Pyspark? - python

I have an RDD below, which I have received from a client. How can I convert this RDD into a Dataframe? ["Row(Moid=2, Tripid='11', Tstart='2007-05-28 08:53:14.040', Tend='2007-05-28 08:53:16.040', Xstart='9738.73', Ystart='103.246', Xend='9743.73', Yend='114.553')"]

Related

How to get referenced columns of a PySpark DataFrame?

PySpark DataFrame showing different results when using .select()

Parse data in a new dataframe with correct headers taken from within the data

Store string in a column as nested JSON to a JSON file - Pyspark

Pyspark merge multiple columns into a json column

Categories

Resources