PySpark DataFrame showing different results when using .select()

PySpark DataFrame showing different results when using .select() - python

Why is .select() showing/parsing values differently to I don't use it?
I have this CSV:
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName,ConfStmtNextDueDate, ConfStmtLastMadeUpDate
"ATS CAR RENTALS LIMITED","10795807","","",", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD","ACCOUNTING FREEDOM","BEXLEYHEATH","","ENGLAND","DA6 8DS","Private Limited Company","Active","United Kingdom","","31/05/2017","31","5","28/02/2023","31/05/2021","TOTAL EXEMPTION FULL","28/06/2018","","0","0","0","0","49390 - Other passenger land transport","","","","0","0","http://business.data.gov.uk/id/company/10795807","","","","","","","","","","","","","","","","","","","","","12/06/2023","29/05/2022"
"ATS CARE LIMITED","10393661","","","UNIT 5 CO-OP BUILDINGS HIGH STREET","ABERSYCHAN","PONTYPOOL","TORFAEN","WALES","NP4 7AE","Private Limited Company","Active","United Kingdom","","26/09/2016","30","9","30/06/2023","30/09/2021","UNAUDITED ABRIDGED","24/10/2017","","0","0","0","0","87900 - Other residential care activities n.e.c.","","","","0","0","http://business.data.gov.uk/id/company/10393661","17/05/2018","ATS SUPPORT LIMITED","22/12/2017","ATS CARE LIMITED","","","","","","","","","","","","","","","","","09/10/2022","25/09/2021"
I'm reading the csv like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
_file = "/path/dir/BasicCompanyDataAsOneFile-2022-08-01.csv"
df = spark.read.csv(_file, header=True, quote='"', escape="\"")
Focusing on the CompanyCategory column, we should see Private Limited Company for both lines. But this is what I get instead when using select():
df.select("CompanyCategory").show(truncate=False)
+-----------------------+
|CompanyCategory |
+-----------------------+
|DA6 8DS |
|Private Limited Company|
+-----------------------+
df.select("CompanyCategory").collect()
[Row(CompanyCategory='DA6 8DS'),
Row(CompanyCategory='Private Limited Company')]
vs when not using select():
from pprint import pprint
for row in df.collect():
pprint(row.asDict())
{' CompanyNumber': '10795807',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CAR RENTALS LIMITED',
...}
{' CompanyNumber': '10393661',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CARE LIMITED',
...}
Using asDict() for readability.
SQL doing the same thing:
df.createOrReplaceTempView("companies")
spark.sql('select CompanyCategory from companies').show()
+--------------------+
| CompanyCategory|
+--------------------+
|Private Limited C...|
| DA6 8DS|
+--------------------+
As you can see when not using select() the CompanyCategory values are showing correctly. Why is this happening? What can I do to avoid this?
Context: I'm trying to creating dimension tables which is why I'm selecting a single column. The next phase is to drop duplicates, filter, sort, etc.
Edit:
Here are two example values in the actual CSV that are throwing things off:
CompanyName of """ BORA "" 2 LTD"
1st line address of ", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD"
Note:
These values from two separate distinct lines in the CSV.
These values are copy and pasted from the CSV opened in text editor like Notepad or VSCode).
Tried and failed:
df = spark.read.csv(_file, header=True) - completely picks up incorrect column.
df = spark.read.csv(_file, header=True, escape='\"') - exact same thing described in original question above. So same results.
df = spark.read.csv(_file, header=True, escape='""') - since the CSV escapes quotes using two double quotes, then I guess using two double quotes as escape param would do the trick? But getting following error:
Py4JJavaError: An error occurred while calling o276.csv.
: java.lang.RuntimeException: escape cannot be more than one character

When reading the csv, the parameters quote and escape are set to the same value ('"'=="\"" returns True in Python).
I would guess that configuring both parameters in this way will somehow disturb the parser that Spark uses to separate the single fields. After removing the escape parameter you can process the remaining " with regexp_replace:
from pyspark.sql import functions as F
df = spark.read.csv(<filename>, header=True, quote='"')
cols = [F.regexp_replace(F.regexp_replace(
F.regexp_replace("`" + col + "`", '^"', ''),
'"$', ''), '""', '"').alias(col) for col in df.columns]
df.select(cols).show(truncate=False)
Probably there is a smart regexp that can combine all three replace operations into one...

This is an issue when reading a single column from CSV file vs. when reading all the columns:
df = spark.read.csv('companies-house.csv', header=True, quote='"', escape="\"")
df.show() # correct output (all columns loaded)
df.toPandas() # same as pd.read_csv()
df.select('CompanyCategory').show() # wrong output (trying to load a single column)
df.cache() # all columns will be loaded and used in any subsequent call
df.select('CompanyCategory').show() # correct output
The first select() performs a different (optimized) read than the second, so one possible workaround would be to cache() the data immediately. This will however load all the columns, not just one (although pandas and COPY do the same).
The problematic part of the CSV is the RegAddress.POBox column where empty value is saved as ", instead of "",. You can check this by incrementally loading more columns:
df.unpersist() # undo cache() operation (for testing purposes only)
df.select(*[f"`{c}`" for c in df.columns[:3]], 'CompanyCategory').show() # wrong
df.select(*[f"`{c}`" for c in df.columns[:4]], 'CompanyCategory').show() # OK

Related

How to get referenced columns of a PySpark DataFrame?

Given a PySpark DataFrame is it possible to obtain a list of source columns that are being referenced by the DataFrame?
Perhaps a more concrete example might help explain what I'm after. Say I have a DataFrame defined as:
import pyspark.sql.functions as func
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
source_df = spark.createDataFrame(
[("pru", 23, "finance"), ("paul", 26, "HR"), ("noel", 20, "HR")],
["name", "age", "department"],
)
source_df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT name, age, department FROM people")
df = sqlDF.groupBy("department").agg(func.max("age").alias("max_age"))
df.show()
which returns:
+----------+--------+
|department|max_age |
+----------+--------+
| finance| 23|
| HR| 26|
+----------+--------+
The columns that are referenced by df are [department, age]. Is it possible to get that list of referenced columns programatically?
Thanks to Capturing the result of explain() in pyspark I know I can extract the plan as a string:
df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "formatted")
which returns:
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
+- Exchange (4)
+- HashAggregate (3)
+- Project (2)
+- Scan ExistingRDD (1)
(1) Scan ExistingRDD
Output [3]: [name#0, age#1L, department#2]
Arguments: [name#0, age#1L, department#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Project
Output [2]: [age#1L, department#2]
Input [3]: [name#0, age#1L, department#2]
(3) HashAggregate
Input [2]: [age#1L, department#2]
Keys [1]: [department#2]
Functions [1]: [partial_max(age#1L)]
Aggregate Attributes [1]: [max#22L]
Results [2]: [department#2, max#23L]
(4) Exchange
Input [2]: [department#2, max#23L]
Arguments: hashpartitioning(department#2, 200), ENSURE_REQUIREMENTS, [plan_id=60]
(5) HashAggregate
Input [2]: [department#2, max#23L]
Keys [1]: [department#2]
Functions [1]: [max(age#1L)]
Aggregate Attributes [1]: [max(age#1L)#12L]
Results [2]: [department#2, max(age#1L)#12L AS max_age#13L]
(6) AdaptiveSparkPlan
Output [2]: [department#2, max_age#13L]
Arguments: isFinalPlan=false
which is useful, however its not what I need. I need a list of the referenced columns. Is this possible?
Perhaps another way of asking the question is... is there a way to obtain the explain plan as an object that I can iterate over/explore?
UPDATE. Thanks to the reply from #matt-andruff I have gotten this:
df._jdf.queryExecution().executedPlan().treeString().split("+-")[-2]
which returns:
' Project [age#1L, department#2]\n '
from which I guess I could parse the information I'm after but this is a far from elegant way to do it, and is particularly error prone.
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.

There is an object for that unfortunately its a java object, and not translated to pyspark.
You can still access it with Spark constucts:
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(0).toString()
u'department#1621'
>>> df._jdf.queryExecution().executedPlan().apply(0).output().apply(1).toString()
u'max_age#1632L'
You could loop through both the above apply to get the information you are looking for with something like:
plan = df._jdf.queryExecution().executedPlan()
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
iterator = steps[0].inputSet().iterator()
>>> iterator.next().toString()
u'department#1621'
>>> iterator.next().toString()
u'max#1642L'
steps = [ plan.apply(i) for i in range(1,100) if not isinstance(plan.apply(i), type(None)) ]
projections = [ (steps[0].p(i).toJSON().encode('ascii','ignore')) for i in range(1,100) if not( isinstance(steps[0].p(i), type(None) )) and steps[0].p(i).nodeName().encode('ascii','ignore') == 'Project' ]
dd = spark.sparkContext.parallelize(projections)
df2 = spark.read.json(rdd)
>>> df2.show(1,False)
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|child|class |name|num-children|output|outputOrdering|outputPartitioning|projectList |rdd |
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
|0 |org.apache.spark.sql.execution.ProjectExec|null|1 |null |null |null |[[[org.apache.spark.sql.catalyst.expressions.AttributeReference, long, [1620, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], age, true, 0, [people]]], [[org.apache.spark.sql.catalyst.expressions.AttributeReference, string, [1621, 4ad48da6-03cf-45d4-9b35-76ac246fadac, org.apache.spark.sql.catalyst.expressions.ExprId], department, true, 0, [people]]]]|null|
+-----+------------------------------------------+----+------------+------+--------------+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+
df2.select(func.explode(func.col('projectList'))).select( func.col('col')[0]["name"] ) .show(100,False)
+-----------+
|col[0].name|
+-----------+
|age |
|department |
+-----------+
range --> Bit of a hack but apparently size doesn't work.I'm sure with more time I could refine the range hack.
You can then use json to pull the information programmatically.

I have something that, while not being an answer to my original question (see Matt Andruff's answer for that), could still be useful here. Its a way to get all the source columns referenced by a pyspark.sql.column.Column.
Simple repro:
from pyspark.sql import functions as f, SparkSession
SparkSession.builder.getOrCreate()
col = f.concat(f.col("A"), f.col("B"))
type(col)
col._jc.expr().references().toList().toString()
returns:
<class 'pyspark.sql.column.Column'>
"List('A, 'B)"
its definitely not perfect, it still requires you to parse out the column names from the string that is returned, but at least the information I'm after is available. There might be some more methods on the object returned from references() that makes it easier to parse the returned string but if there is, I haven't found it!
Here is a function I wrote to do the parsing
def parse_references(references: str):
return sorted(
"".join(
references.replace("'", "")
.replace("List(", "")
.replace(")", "")
.replace(")", "")
.split()
).split(",")
)
assert parse_references("List('A, 'B)") == ["A", "B"]

PySpark is not really designed for such lower-level tricks (which begs more for Scala that Spark is developed in and as such offers all there is inside).
This step where you access QueryExecution is the main entry point to the machinery of Spark SQL's query execution engine.
The issue is that py4j (that is used as a bridge between JVM and Python environments) makes it of no use on PySpark's side.
You can use the following if you need to access the final query plan (just before it's converted into RDDs):
df._jdf.queryExecution().executedPlan().prettyJson()
Review the QueryExecution API.
QueryExecutionListener
You should really consider Scala to intercept whatever you want about your queries and QueryExecutionListener seems a fairly viable starting point.
There is more but it's all over Scala :)
What I'm really after is a failsafe, reliable, API-supported way to get this information. I'm starting to think it isn't possible.
I'm not surprised since you're throwing away the best possible answer: Scala. I'd recommend using it for a PoC and see what you can get and only then (if you have to) look out for a Python solution (which I think is doable yet highly error-prone).

You can try the below codes, this will give you a column list and its data type in the data frame.
for field in df.schema.fields:
print(field.name +" , "+str(field.dataType))

How to add column values based on another, dissimilar dataframe? - python, pandas

I have an Excel file (.xlsx) with content similar to the 'xlsx' dataframe below, and a csv-file ('csv' below).
Over the last week (or so) a fair amount of web-searching leads me to believe that python & pandas might be a good way to add the "os" info from the csv to the xlsx.
Unfortunately I've never done any python coding and, by extension, neither have I any experience with pandas :)
Nevertheless, inspired by many, many examples (the latest being https://stackoverflow.com/a/46550754) I've tried my luck.
Unfortunately my luck seems to have deserted me in this case.
I think that one of the issues I'm facing is that there's no direct correlation between the 2 files. One has (uppercased argh) fqdns and the other (nicely lowercased) bare hostnames.
Also, the xlsx data has a few lines of "pre-amble" in the first columns followed by some headings on row 10.
So, I've tried with various incantations involving .lower() and startswith() and also tried to wrap my mind around map() and lambdas, but, being rather unfamiliar wiht the vagaries of python (and pandas) I keep garnering all sorts of less-than-ideal exceptions that I'm also struggling with. I may be taking on a bit much, what with trying to get a grasp on python, pandas and jq (for a different issue) at the same time?
Out of shame, I'll refrain from exposing my amateurish python-hacking.
What I have:
xlsx = pd.DataFrame([
["FQDN1", "4.3.2.1", "finfo1", "FINFO1"],
["FQDN2", "4.3.2.2", "finfo2", "FINFO2"],
["FQDN3", "4.3.2.3", "finfo3", "FINFO3"],
["FQDN4", "4.3.2.4", "finfo4", "FINFO4"],
])
csv = pd.DataFrame([
["host1", "hw1", "OS1"],
["host2", "hw2", "OS2"],
["host3", "hw3", "OS3"],
],
columns=[ "Host", "hwinfo", "os"]
)
What I'm attempting to achieve:
new = pd.DataFrame([
["FQDN1", "4.3.2.1", "finfo1", "FINFO1", "OS1"],
["FQDN2", "4.3.2.2", "finfo2", "FINFO2", "OS2"],
["FQDN3", "4.3.2.3", "finfo3", "FINFO3", "OS3"],
["FQDN4", "4.3.2.4", "finfo4", "FINFO4", ""],
],
columns=["Host", "IP", "info1", "info2", "OS"]
)
Edit:
The numerals in the various fields are merely an exhibition of my lack of imagination. The host names (and FQDNs) are (more or less) all over the place.
Also fixed the missing column in the new DF.
The join(?) condition would be that the xlsx FQDN starts with the csv hostname (case insensitive).
As an exhibit here's a couple of my (probably rather naive) attempts:
xlsx.loc[
( xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'] )
) ] = csv.loc[
( xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'] )
), csv['os']].values
and
xlsx['Unnamed: 5'] = np.where( (xlsx['Unnamed: 1'].str.lower().str.startswith(csv['Host']) | xlsx['Unnamed: 2'].str.lower().str.startswith(csv['Host'])), csv['os'],"")
(strictly speaking, the "magic" should leave non-matching rows untouched - or return the original content)
An additional plot twist is that the csv "Host" column occasionally has an IP# instead of a host name - hance my attempts to check for both.

It looks like you want to create a key using the digit from both Host values. You can use the .str method to slice off the last character -1 and use that as a key for joining the os from the second df.
import pandas as pd
xlsx = pd.DataFrame([
["FQDN1", "4.3.2.1", "finfo1", "FINFO1"],
["FQDN2", "4.3.2.2", "finfo2", "FINFO2"],
["FQDN3", "4.3.2.3", "finfo3", "FINFO3"],
["FQDN4", "4.3.2.4", "finfo4", "FINFO4"]],
columns=["Host", "IP", "info1", "info2"]
)
csv = pd.DataFrame([
["host1", "hw1", "OS1"],
["host2", "hw2", "OS2"],
["host3", "hw3", "OS3"],
],
columns=[ "Host", "hwinfo", "os"]
)
new = (
xlsx.assign(key=xlsx['Host'].str[-1])
.merge(csv.assign(key=csv['Host'].str[-1])[['key','os']],
on='key',
how='left')
)
print(new)
Output
Host IP info1 info2 key os
0 FQDN1 4.3.2.1 finfo1 FINFO1 1 OS1
1 FQDN2 4.3.2.2 finfo2 FINFO2 2 OS2
2 FQDN3 4.3.2.3 finfo3 FINFO3 3 OS3
3 FQDN4 4.3.2.4 finfo4 FINFO4 4 NaN

Parse data in a new dataframe with correct headers taken from within the data

I have a CSV that has been returned and the data is in a god awful state, I need to parse both the header and then the data out from each row.
This is an example of one row:
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9| _c10| _c11| _c12| _c13| _c14| _c15| _c16| _c17| _c18| _c19| _c20| _c21| _c22| _c23| _c24| _c25| _c26| _c27| _c28| _c29| _c30| _c31| _c32| _c33| _c34| _c35| _c36| _c37| _c38| _c39| _c40| _c41| _c42| _c43| _c44| _c45| _c46| _c47| _c48| _c49| _c50| _c51| _c52| _c53| _c54| _c55| _c56| _c57| _c58| _c59| _c60| _c61| _c62| _c63| _c64| _c65| _c66| _c67| _c68| _c69| _c70| _c71| _c72| _c73| _c74| _c75| _c76| _c77| _c78| _c79| _c80| _c81| _c82| _c83| _c84| _c85| _c86| _c87| _c88| _c89| _c90| _c91| _c92| _c93| _c94| _c95| _c96| _c97| _c98| _c99| _c100| _c101| _c102| _c103| _c104| _c105| _c106| _c107| _c108| _c109| _c110| _c111| _c112| _c113| _c114| _c115| _c116| _c117| _c118| _c119| _c120| _c121| _c122| _c123| _c124| _c125| _c126| _c127| _c128| _c129| _c130| _c131| _c132| _c133| _c134| _c135| _c136| _c137| _c138| _c139| _c140| _c141| _c142| _c143| _c144| _c145| _c146| _c147| _c148| _c149| _c150|
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
|{"MANDT":"400"|"LEDNR":"00"|"OBJNR":"KS660000...|"GJAHR":"2022"|"WRTTP":"04"|"VERSN":"000"|"KSTAR":"0051040100"|"HRKFT":""|"VRGNG":"COIN"|"VBUND":""|"PARGB":""|"BEKNZ":"H"|"TWAER":"THB"|"PERBL":"016"|"MEINH":""|"WTG001":-1854554.89|"WTG002":0.00|"WTG003":0.00|"WTG004":0.00|"WTG005":0.00|"WTG006":0.00|"WTG007":0.00|"WTG008":0.00|"WTG009":0.00|"WTG010":0.00|"WTG011":0.00|"WTG012":0.00|"WTG013":0.00|"WTG014":0.00|"WTG015":0.00|"WTG016":0.00|"WOG001":-1854554.89|"WOG002":0.00|"WOG003":0.00|"WOG004":0.00|"WOG005":0.00|"WOG006":0.00|"WOG007":0.00|"WOG008":0.00|"WOG009":0.00|"WOG010":0.00|"WOG011":0.00|"WOG012":0.00|"WOG013":0.00|"WOG014":0.00|"WOG015":0.00|"WOG016":0.00|"WKG001":-1854554.89|"WKG002":0.00|"WKG003":0.00|"WKG004":0.00|"WKG005":0.00|"WKG006":0.00|"WKG007":0.00|"WKG008":0.00|"WKG009":0.00|"WKG010":0.00|"WKG011":0.00|"WKG012":0.00|"WKG013":0.00|"WKG014":0.00|"WKG015":0.00|"WKG016":0.00|"WKF001":0.00|"WKF002":0.00|"WKF003":0.00|"WKF004":0.00|"WKF005":0.00|"WKF006":0.00|"WKF007":0.00|"WKF008":0.00|"WKF009":0.00|"WKF010":0.00|"WKF011":0.00|"WKF012":0.00|"WKF013":0.00|"WKF014":0.00|"WKF015":0.00|"WKF016":0.00|"PAG001":0.00|"PAG002":0.00|"PAG003":0.00|"PAG004":0.00|"PAG005":0.00|"PAG006":0.00|"PAG007":0.00|"PAG008":0.00|"PAG009":0.00|"PAG010":0.00|"PAG011":0.00|"PAG012":0.00|"PAG013":0.00|"PAG014":0.00|"PAG015":0.00|"PAG016":0.00|"MEG001":0.000|"MEG002":0.000|"MEG003":0.000|"MEG004":0.000|"MEG005":0.000|"MEG006":0.000|"MEG007":0.000|"MEG008":0.000|"MEG009":0.000|"MEG010":0.000|"MEG011":0.000|"MEG012":0.000|"MEG013":0.000|"MEG014":0.000|"MEG015":0.000|"MEG016":0.000|"MEF001":0.000|"MEF002":0.000|"MEF003":0.000|"MEF004":0.000|"MEF005":0.000|"MEF006":0.000|"MEF007":0.000|"MEF008":0.000|"MEF009":0.000|"MEF010":0.000|"MEF011":0.000|"MEF012":0.000|"MEF013":0.000|"MEF014":0.000|"MEF015":0.000|"MEF016":0.000|"MUV001":""|"MUV002":""|"MUV003":""|"MUV004":""|"MUV005":""|"MUV006":""|"MUV007":""|"MUV008":""|"MUV009":""|"MUV010":""|"MUV011":""|"MUV012":""|"MUV013":""|"MUV014":""|"MUV015":""|"MUV016":""|"BELTP":"1"|"TIMESTMP":101246...|"BUKRS":"6611"|"FKBER":""|"SEGMENT":""|"GEBER":""|"GRANT_NBR":""|"BUDGET_PD":""}|
+--------------+------------+--------------------+--------------+------------+-------------+--------------------+----------+--------------+----------+----------+-----------+-------------+-------------+----------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+--------------------+--------------+----------+------------+----------+--------------+---------------+
The first part for example MANDT is the column header and the bit after the : is the value. I basically need to
A) Loop all the columns and change the headers so they relate to the bit prior to the :
B) then populate the rows with the second part after.
I've attempted a small piece of code just to edit all the columns like below
from pyspark.sql.functions import split
for colname in COSPDF.columns:
print(colname)
COSPDF = COSPDF.withColumn(col(colname), lower(colname))
and I receive an error TypeError: 'str' object is not callable
I've then done the "lazy" thing and found some code like below
from pyspark.sql.functions import split
split_df = COSPDF.select(split(COSPDF._c0, ':').alias('split_text'))
split_df.selectExpr("split_text[0] as left").show() # left of delim
split_df.selectExpr("split_text[1] as right").show() # right of delim
However this code only works one column that I have to "specify" which doesn't work when the CSV has 123 columns, I'm not doing it 123 times. Any assistance would really help with this please, it's had me stuck for hours.
UPDATED
Some rows from the original file:
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040100""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""H""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":-1854554.89","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":-1854554.89","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":-1854554.89","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10124662650000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040100""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""S""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":7424891.07","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":7424891.07","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":7424891.07","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10160936750000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"
"{""MANDT"":""400""","""LEDNR"":""00""","""OBJNR"":""KS66000011001070""","""GJAHR"":""2022""","""WRTTP"":""04""","""VERSN"":""000""","""KSTAR"":""0051040105""","""HRKFT"":""""","""VRGNG"":""COIN""","""VBUND"":""""","""PARGB"":""""","""BEKNZ"":""H""","""TWAER"":""THB""","""PERBL"":""016""","""MEINH"":""""","""WTG001"":-509518.63","""WTG002"":0.00","""WTG003"":0.00","""WTG004"":0.00","""WTG005"":0.00","""WTG006"":0.00","""WTG007"":0.00","""WTG008"":0.00","""WTG009"":0.00","""WTG010"":0.00","""WTG011"":0.00","""WTG012"":0.00","""WTG013"":0.00","""WTG014"":0.00","""WTG015"":0.00","""WTG016"":0.00","""WOG001"":-509518.63","""WOG002"":0.00","""WOG003"":0.00","""WOG004"":0.00","""WOG005"":0.00","""WOG006"":0.00","""WOG007"":0.00","""WOG008"":0.00","""WOG009"":0.00","""WOG010"":0.00","""WOG011"":0.00","""WOG012"":0.00","""WOG013"":0.00","""WOG014"":0.00","""WOG015"":0.00","""WOG016"":0.00","""WKG001"":-509518.63","""WKG002"":0.00","""WKG003"":0.00","""WKG004"":0.00","""WKG005"":0.00","""WKG006"":0.00","""WKG007"":0.00","""WKG008"":0.00","""WKG009"":0.00","""WKG010"":0.00","""WKG011"":0.00","""WKG012"":0.00","""WKG013"":0.00","""WKG014"":0.00","""WKG015"":0.00","""WKG016"":0.00","""WKF001"":0.00","""WKF002"":0.00","""WKF003"":0.00","""WKF004"":0.00","""WKF005"":0.00","""WKF006"":0.00","""WKF007"":0.00","""WKF008"":0.00","""WKF009"":0.00","""WKF010"":0.00","""WKF011"":0.00","""WKF012"":0.00","""WKF013"":0.00","""WKF014"":0.00","""WKF015"":0.00","""WKF016"":0.00","""PAG001"":0.00","""PAG002"":0.00","""PAG003"":0.00","""PAG004"":0.00","""PAG005"":0.00","""PAG006"":0.00","""PAG007"":0.00","""PAG008"":0.00","""PAG009"":0.00","""PAG010"":0.00","""PAG011"":0.00","""PAG012"":0.00","""PAG013"":0.00","""PAG014"":0.00","""PAG015"":0.00","""PAG016"":0.00","""MEG001"":0.000","""MEG002"":0.000","""MEG003"":0.000","""MEG004"":0.000","""MEG005"":0.000","""MEG006"":0.000","""MEG007"":0.000","""MEG008"":0.000","""MEG009"":0.000","""MEG010"":0.000","""MEG011"":0.000","""MEG012"":0.000","""MEG013"":0.000","""MEG014"":0.000","""MEG015"":0.000","""MEG016"":0.000","""MEF001"":0.000","""MEF002"":0.000","""MEF003"":0.000","""MEF004"":0.000","""MEF005"":0.000","""MEF006"":0.000","""MEF007"":0.000","""MEF008"":0.000","""MEF009"":0.000","""MEF010"":0.000","""MEF011"":0.000","""MEF012"":0.000","""MEF013"":0.000","""MEF014"":0.000","""MEF015"":0.000","""MEF016"":0.000","""MUV001"":""""","""MUV002"":""""","""MUV003"":""""","""MUV004"":""""","""MUV005"":""""","""MUV006"":""""","""MUV007"":""""","""MUV008"":""""","""MUV009"":""""","""MUV010"":""""","""MUV011"":""""","""MUV012"":""""","""MUV013"":""""","""MUV014"":""""","""MUV015"":""""","""MUV016"":""""","""BELTP"":""1""","""TIMESTMP"":10124662700000.0","""BUKRS"":""6611""","""FKBER"":""""","""SEGMENT"":""""","""GEBER"":""""","""GRANT_NBR"":""""","""BUDGET_PD"":""""}"

Simply, You need to put Header name in Pandas Dataframe like...
df.columns = ["Column_Name1", "Column_Name2", "Column_Name3", "Column_Name4" and so on..]
And, If you want to use loop to append name for each col then you need iterate over the list and append based on the index and length of the list

First read csv and get each key value pair by iterating over the columns
import pandas as pd
read_df = pd.read_csv(<your csv file path>)
dict_of_pairs = {pairs: read_df[pairs] for pairs in read_df}
Write it in another file
write_df = pd.DataFrame({k: pd.Series(v) for k, v in dict_of_pairs.items()}) // this will allow you to write even if some column has no values in it
writer = pd.ExcelWriter(write_path, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Somename for your sheet', index=False)
Hope this answers your question.....

Pyspark: Regex_replace commas between quotes

I'm struggling with replacing with regexp_replace in Pyspark. I have to following string column:
"1233455666, 'ThisIsMyAdress, 1234AB', 24234234234"
A better overview of the string:
Id
Address
Code
1233455666
'ThisIsMyAdress, 1234AB'
24234234234
The total string that I receive and process is comma separated, like the example in the beginning. Unfortunately I can't change this format of delivered data. To handle the data well I want to replace the comma between the quotes with nothing.
The only requirement is using regexp_replace.
I've tried the code below, and many more. But with these code the comma separation will break as well. Then the string is one big string with removed comma's.
.withColumn("ColCommasRemoved" , regexp_replace( col("X"), "[,]", ""))
which gave me this output:
"1233455666 'ThisIsMyAdress 1234AB' 24234234234"
The output what I want to achieve:
"1233455666, 'ThisIsMyAdress 1234AB', 24234234234"

Using regexp_replace:
from pyspark.sql import functions as F
df = spark.createDataFrame([("1233455666, 'ThisIsMyAdress, 1234AB', 24234234234",)], ["X"])
result = df.withColumn(
"ColCommasRemoved",
F.split(F.regexp_replace("X", ",(?=[^']*'[^']*(?:'[^']*'[^']*)*$)", ""), ",")
).select(
F.col("ColCommasRemoved")[0].alias("ID"),
F.col("ColCommasRemoved")[1].alias("Address"),
F.col("ColCommasRemoved")[2].alias("Code")
)
result.show()
#+----------+------------------------+------------+
#|ID |Address |Code |
#+----------+------------------------+------------+
#|1233455666| 'ThisIsMyAdress 1234AB'| 24234234234|
#+----------+------------------------+------------+
Or if you want to split directly the original column by , and ignore those inside quotes:
result = df.withColumn(
"split",
F.split(F.col("X"), ",(?=(?:[^']*'[^']*')*[^']*$)")
)
result.show(truncate=False)
#+-------------------------------------------------+-----------------------------------------------------+
#|X |split |
#+-------------------------------------------------+-----------------------------------------------------+
#|1233455666, 'ThisIsMyAdress, 1234AB', 24234234234|[1233455666, 'ThisIsMyAdress, 1234AB', 24234234234]|
#+-------------------------------------------------+-----------------------------------------------------+

Pandas KeyError: value not in index

I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.

Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)

I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())

please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))

I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark DataFrame showing different results when using .select() - python

Related

How to get referenced columns of a PySpark DataFrame?

How to add column values based on another, dissimilar dataframe? - python, pandas

Parse data in a new dataframe with correct headers taken from within the data

Pyspark: Regex_replace commas between quotes

Pandas KeyError: value not in index

Categories

Resources