Why is .select() showing/parsing values differently to I don't use it?
I have this CSV:
CompanyName, CompanyNumber,RegAddress.CareOf,RegAddress.POBox,RegAddress.AddressLine1, RegAddress.AddressLine2,RegAddress.PostTown,RegAddress.County,RegAddress.Country,RegAddress.PostCode,CompanyCategory,CompanyStatus,CountryOfOrigin,DissolutionDate,IncorporationDate,Accounts.AccountRefDay,Accounts.AccountRefMonth,Accounts.NextDueDate,Accounts.LastMadeUpDate,Accounts.AccountCategory,Returns.NextDueDate,Returns.LastMadeUpDate,Mortgages.NumMortCharges,Mortgages.NumMortOutstanding,Mortgages.NumMortPartSatisfied,Mortgages.NumMortSatisfied,SICCode.SicText_1,SICCode.SicText_2,SICCode.SicText_3,SICCode.SicText_4,LimitedPartnerships.NumGenPartners,LimitedPartnerships.NumLimPartners,URI,PreviousName_1.CONDATE, PreviousName_1.CompanyName, PreviousName_2.CONDATE, PreviousName_2.CompanyName,PreviousName_3.CONDATE, PreviousName_3.CompanyName,PreviousName_4.CONDATE, PreviousName_4.CompanyName,PreviousName_5.CONDATE, PreviousName_5.CompanyName,PreviousName_6.CONDATE, PreviousName_6.CompanyName,PreviousName_7.CONDATE, PreviousName_7.CompanyName,PreviousName_8.CONDATE, PreviousName_8.CompanyName,PreviousName_9.CONDATE, PreviousName_9.CompanyName,PreviousName_10.CONDATE, PreviousName_10.CompanyName,ConfStmtNextDueDate, ConfStmtLastMadeUpDate
"ATS CAR RENTALS LIMITED","10795807","","",", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD","ACCOUNTING FREEDOM","BEXLEYHEATH","","ENGLAND","DA6 8DS","Private Limited Company","Active","United Kingdom","","31/05/2017","31","5","28/02/2023","31/05/2021","TOTAL EXEMPTION FULL","28/06/2018","","0","0","0","0","49390 - Other passenger land transport","","","","0","0","http://business.data.gov.uk/id/company/10795807","","","","","","","","","","","","","","","","","","","","","12/06/2023","29/05/2022"
"ATS CARE LIMITED","10393661","","","UNIT 5 CO-OP BUILDINGS HIGH STREET","ABERSYCHAN","PONTYPOOL","TORFAEN","WALES","NP4 7AE","Private Limited Company","Active","United Kingdom","","26/09/2016","30","9","30/06/2023","30/09/2021","UNAUDITED ABRIDGED","24/10/2017","","0","0","0","0","87900 - Other residential care activities n.e.c.","","","","0","0","http://business.data.gov.uk/id/company/10393661","17/05/2018","ATS SUPPORT LIMITED","22/12/2017","ATS CARE LIMITED","","","","","","","","","","","","","","","","","09/10/2022","25/09/2021"
I'm reading the csv like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
_file = "/path/dir/BasicCompanyDataAsOneFile-2022-08-01.csv"
df = spark.read.csv(_file, header=True, quote='"', escape="\"")
Focusing on the CompanyCategory column, we should see Private Limited Company for both lines. But this is what I get instead when using select():
df.select("CompanyCategory").show(truncate=False)
+-----------------------+
|CompanyCategory |
+-----------------------+
|DA6 8DS |
|Private Limited Company|
+-----------------------+
df.select("CompanyCategory").collect()
[Row(CompanyCategory='DA6 8DS'),
Row(CompanyCategory='Private Limited Company')]
vs when not using select():
from pprint import pprint
for row in df.collect():
pprint(row.asDict())
{' CompanyNumber': '10795807',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CAR RENTALS LIMITED',
...}
{' CompanyNumber': '10393661',
...
'CompanyCategory': 'Private Limited Company',
'CompanyName': 'ATS CARE LIMITED',
...}
Using asDict() for readability.
SQL doing the same thing:
df.createOrReplaceTempView("companies")
spark.sql('select CompanyCategory from companies').show()
+--------------------+
| CompanyCategory|
+--------------------+
|Private Limited C...|
| DA6 8DS|
+--------------------+
As you can see when not using select() the CompanyCategory values are showing correctly. Why is this happening? What can I do to avoid this?
Context: I'm trying to creating dimension tables which is why I'm selecting a single column. The next phase is to drop duplicates, filter, sort, etc.
Edit:
Here are two example values in the actual CSV that are throwing things off:
CompanyName of """ BORA "" 2 LTD"
1st line address of ", 1ST FLOOR ,WESTHILL HOUSE 2B DEVONSHIRE ROAD"
Note:
These values from two separate distinct lines in the CSV.
These values are copy and pasted from the CSV opened in text editor like Notepad or VSCode).
Tried and failed:
df = spark.read.csv(_file, header=True) - completely picks up incorrect column.
df = spark.read.csv(_file, header=True, escape='\"') - exact same thing described in original question above. So same results.
df = spark.read.csv(_file, header=True, escape='""') - since the CSV escapes quotes using two double quotes, then I guess using two double quotes as escape param would do the trick? But getting following error:
Py4JJavaError: An error occurred while calling o276.csv.
: java.lang.RuntimeException: escape cannot be more than one character
When reading the csv, the parameters quote and escape are set to the same value ('"'=="\"" returns True in Python).
I would guess that configuring both parameters in this way will somehow disturb the parser that Spark uses to separate the single fields. After removing the escape parameter you can process the remaining " with regexp_replace:
from pyspark.sql import functions as F
df = spark.read.csv(<filename>, header=True, quote='"')
cols = [F.regexp_replace(F.regexp_replace(
F.regexp_replace("`" + col + "`", '^"', ''),
'"$', ''), '""', '"').alias(col) for col in df.columns]
df.select(cols).show(truncate=False)
Probably there is a smart regexp that can combine all three replace operations into one...
This is an issue when reading a single column from CSV file vs. when reading all the columns:
df = spark.read.csv('companies-house.csv', header=True, quote='"', escape="\"")
df.show() # correct output (all columns loaded)
df.toPandas() # same as pd.read_csv()
df.select('CompanyCategory').show() # wrong output (trying to load a single column)
df.cache() # all columns will be loaded and used in any subsequent call
df.select('CompanyCategory').show() # correct output
The first select() performs a different (optimized) read than the second, so one possible workaround would be to cache() the data immediately. This will however load all the columns, not just one (although pandas and COPY do the same).
The problematic part of the CSV is the RegAddress.POBox column where empty value is saved as ", instead of "",. You can check this by incrementally loading more columns:
df.unpersist() # undo cache() operation (for testing purposes only)
df.select(*[f"`{c}`" for c in df.columns[:3]], 'CompanyCategory').show() # wrong
df.select(*[f"`{c}`" for c in df.columns[:4]], 'CompanyCategory').show() # OK
I have a pandas DataFrame that's being read in from a CSV that has hostnames of computers including the domain they belong to along with a bunch of other columns. I'm trying to strip out the Domain information such that I'm left with ONLY the Hostname.
DataFrame ex:
name
domain1\computername1
domain1\computername45
dmain3\servername1
dmain3\computername3
domain1\servername64
....
I've tried using both str.strip() and str.replace() with a regex as well as a string literal, but I can't seem to correctly target the domain information correctly.
Examples of what I've tried thus far:
df['name'].str.strip('.*\\')
df['name'].str.replace('.*\\', '', regex = True)
df['name'].str.replace(r'[.*\\]', '', regex = True)
df['name'].str.replace('domain1\\\\', '', regex = False)
df['name'].str.replace('dmain3\\\\', '', regex = False)
None of these seem to make any changes when I spit the DataFrame out using logging.debug(df)
You are already close to the answer, just use:
df['name'] = df['name'].str.replace(r'.*\\', '', regex = True)
which just adds using r-string from one of your tried code.
Without using r-string here, the string is equivalent to .*\\ which will be interpreted to only one \ in the final regex. However, with r-string, the string will becomes '.*\\\\' and each pair of \\ will be interpreted finally as one \ and final result becomes 2 slashes as you expect.
Output:
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
Name: name, dtype: object
You can use .str.split:
df["name"] = df["name"].str.split("\\", n=1).str[-1]
print(df)
Prints:
name
0 computername1
1 computername45
2 servername1
3 computername3
4 servername64
No regex approach with ntpath.basename:
import pandas as pd
import ntpath
df = pd.DataFrame({'name':[r'domain1\computername1']})
df["name"] = df["name"].apply(lambda x: ntpath.basename(x))
Results: computername1.
With rsplit:
df["name"] = df["name"].str.rsplit('\\').str[-1]
I am very new to Python/PySpark and currently using it with Databricks.
I have the following list
dummyJson= [
('{"name":"leo", "object" : ["191.168.192.96", "191.168.192.99"]}',),
('{"name":"anne", "object" : ["191.168.192.103", "191.168.192.107"]}',),
]
When I tried to
jsonRDD = sc.parallelize(dummyJson)
then
put it in dataframe
spark.read.json(jsonRDD)
it does not parse the JSON correctly. The resulting dataframe is one column with _corrupt_record as the header.
Looking at the elements in dummyJson, it looks like there are extra / unnecessary comma just before the closing parantheses on each element/record.
How can I remove this comma from each of the element of this list?
Thanks
If you can fix the input format at the source, that would be ideal.
But for your given case, you may fix it by taking the objects out of the tuple.
>>> dJson = [i[0] for i in dummyJson]
>>> jsonRDD = sc.parallelize(dJson)
>>> jsonDF = spark.read.json(jsonRDD)
>>> jsonDF.show()
+----+--------------------+
|name| object|
+----+--------------------+
| leo|[191.168.192.96, ...|
|anne|[191.168.192.103,...|
+----+--------------------+
I have a dataframe (in Pyspark) that has one of the row values as a dictionary:
df.show()
And it looks like:
+----+---+-----------------------------+
|name|age|info |
+----+---+-----------------------------+
|rob |26 |{color: red, car: volkswagen}|
|evan|25 |{color: blue, car: mazda} |
+----+---+-----------------------------+
Based on the comments to give more:
df.printSchema()
The types are strings
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- dict: string (nullable = true)
Is it possible to take the keys from the dictionary (color and car) and make them columns in the dataframe, and have the values be the rows for those columns?
Expected Result:
+----+---+-----------------------------+
|name|age|color |car |
+----+---+-----------------------------+
|rob |26 |red |volkswagen |
|evan|25 |blue |mazda |
+----+---+-----------------------------+
I didn't know I had to use df.withColumn() and somehow iterate through the dictionary to pick each one and then make a column out of it? I've tried to find some answers so far, but most were using Pandas, and not Spark, so I'm not sure if I can apply the same logic.
Your strings:
"{color: red, car: volkswagen}"
"{color: blue, car: mazda}"
are not in a python friendly format. They can't be parsed using json.loads, nor can it be evaluated using ast.literal_eval.
However, if you knew the keys ahead of time and can assume that the strings are always in this format, you should be able to use pyspark.sql.functions.regexp_extract:
For example:
from pyspark.sql.functions import regexp_extract
df.withColumn("color", regexp_extract("info", "(?<=color: )\w+(?=(,|}))", 0))\
.withColumn("car", regexp_extract("info", "(?<=car: )\w+(?=(,|}))", 0))\
.show(truncate=False)
#+----+---+-----------------------------+-----+----------+
#|name|age|info |color|car |
#+----+---+-----------------------------+-----+----------+
#|rob |26 |{color: red, car: volkswagen}|red |volkswagen|
#|evan|25 |{color: blue, car: mazda} |blue |mazda |
#+----+---+-----------------------------+-----+----------+
The pattern is:
(?<=color: ): A positive look-behind for the literal string "color: "
\w+: One or more word characters
(?=(,|})): A positive look-ahead for either a literal comma or close curly brace.
Here is how to generalize this for more than two keys, and handle the case where the key does not exist in the string.
from pyspark.sql.functions import regexp_extract, when, col
from functools import reduce
keys = ["color", "car", "year"]
pat = "(?<=%s: )\w+(?=(,|}))"
df = reduce(
lambda df, c: df.withColumn(
c,
when(
col("info").rlike(pat%c),
regexp_extract("info", pat%c, 0)
)
),
keys,
df
)
df.drop("info").show(truncate=False)
#+----+---+-----+----------+----+
#|name|age|color|car |year|
#+----+---+-----+----------+----+
#|rob |26 |red |volkswagen|null|
#|evan|25 |blue |mazda |null|
#+----+---+-----+----------+----+
In this case, we use pyspark.sql.functions.when and pyspark.sql.Column.rlike to test to see if the string contains the pattern, before we try to extract the match.
If you don't know the keys ahead of time, you'll either have to write your own parser or try to modify the data upstream.
As you can see with the printSchema function your dictionary is understood by "Spark" as a string. The function that slices a string and creates new columns is split () so a simple solution to this problem could be.
Create a UDF that is capable of:
Convert the dictionary string into a comma separated string (removing the keys from the dictionary but keeping the order of the values)
Apply a split and create two new columns from the new format of our dictionary
The code:
#udf()
def transform_dict(dict_str):
str_of_dict_values = dict_str.\
replace("}", "").\
replace("{", ""). \
replace("color:", ""). \
replace(" car: ", ""). \
strip()
# output example: 'red,volkswagen'
return str_of_dict_values
# Create new column with our UDF with the dict values converted to str
df = df.withColumn('info_clean', clean("info"))
# Split these values and store in a tmp variable
split_col = split(df['info_clean'], ',')
# Create new columns with the split values
df = df.withColumn('color', split_col.getItem(0))
df = df.withColumn('car', split_col.getItem(1))
This solution is only correct if we assume that the dictionary elements always come in the same order, and also the keys are fixed.
For other more complex cases we could create a dictionary in the UDF function and form the string of list of values by explicitly invoking each of the dictionary keys, so we would ensure that the order in the output chain is maintained.
I feel the most scalable solution is the following one, using the general keys to be passed through the lambda function:
from pyspark.sql.functions import explode,map_keys,col
keysDF = df.select(explode(map_keys(df.info))).distinct()
keysList = keysDF.rdd.map(lambda x:x[0]).collect()
keyCols = list(map(lambda x: col("info").getItem(x).alias(str(x)), keysList))
df.select(df.name, df.age, *keyCols).show()
Suppose I have a spark dataframe,
data.show()
ID URL
1 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed
2 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed
3 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed
4 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed
5 None
I want to write a regex operation to it, where I want to parse the URL for a particular scenario. The scenario would be would be to parse out after &q and before next &. I am able to write this in python for a python dataframe as follows,
re.sub(r"\s+", " ", re.search(r'/?q=([^&]*)', data['url'][i]).group(1).replace('+', ' ')
I want to write the same in pyspark.
If a write something like,
re.sub(r"\s+", " ", re.search(r'/?q=([^&]*)', data.select(data.url.alias("url")).collect()).group(1).replace('+', ' '))
or
re.sub(r"\s+", " ", re.search(r'/?q=([^&]*)', data.select(data['url']).collect()).group(1).replace('+', ' '))
I am getting the following error,
TypeError: expected string or buffer
One option is that to convert the data to pandas using,
data.toPandas() and then do the operations. But my data is huge and converting it to pandas makes it slow. Is there a way I can write this directly to a new column in spark dataframe where I can have like,
ID URL word
1 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed To Be Parsed out
2 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed To Be Parsed out
3 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed To Be Parsed out
4 https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed To Be Parsed out
5 None None
How can we do this to add it as a new table in pyspark dataframe? which applies to every row of the dataframe?
As mentioned by #David in the comment, you could use udf and withColumn:
Scala code:
import org.apache.spark.sql.functions._
val getWord: (String => String) = (url: String) => {
if (url != null) {
"""/?q=([^&]*)""".r
.findFirstIn(url)
.get
.replaceAll("q=", "")
.replaceAll("\\+", " ")
}
else
null
}
val udfGetWord = udf(getWord)
df.withColumn("word", udfGetWord($"url")).show()
Pyspark Code:
#Create dataframe with sample data
df = spark.createDataFrame([(1,'https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed'),(2,'https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed'),(3,'https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed'),(4,'https://www.sitename.com/&q=To+Be+Parsed+out&oq=Dont+Need+to+be+parsed'),(5,'None')],['id','url'])
Use substr to cut desired string using location index and instr to identify location of search pattern.
regexp_replace used to replace '+' sign with space.
df.selectExpr("id",
"url",
"regexp_replace(substr(url,instr(url,'&q')+3, instr(url,'&oq') - instr(url,'&q') - 3 ),'\\\+',' ') AS word")\
.show()
#+---+--------------------+----------------+
#| id| url| word|
#+---+--------------------+----------------+
#| 1|https://www.siten...|To Be Parsed out|
#| 2|https://www.siten...|To Be Parsed out|
#| 3|https://www.siten...|To Be Parsed out|
#| 4|https://www.siten...|To Be Parsed out|
#| 5| None| |
#+---+--------------------+----------------+
If search pattern does not exists in search string blank would be returned. This can be handles using case statement.