I created data frame in PySpark by reading data from HDFS like this:
df = spark.read.parquet('path/to/parquet')
I expect the data frame to have two column of strings:
+------------+------------------+
|my_column |my_other_column |
+------------+------------------+
|my_string_1 |my_other_string_1 |
|my_string_2 |my_other_string_2 |
|my_string_3 |my_other_string_3 |
|my_string_4 |my_other_string_4 |
|my_string_5 |my_other_string_5 |
|my_string_6 |my_other_string_6 |
|my_string_7 |my_other_string_7 |
|my_string_8 |my_other_string_8 |
+------------+------------------+
However, I get my_column column with some strings starting with [Ljava.lang.Object;, looking like this:
>> df.show(truncate=False)
+-----------------------------+------------------+
|my_column |my_other_column |
+-----------------------------+------------------+
|[Ljava.lang.Object;#7abeeeb6 |my_other_string_1 |
|[Ljava.lang.Object;#5c1bbb1c |my_other_string_2 |
|[Ljava.lang.Object;#6be335ee |my_other_string_3 |
|[Ljava.lang.Object;#153bdb33 |my_other_string_4 |
|[Ljava.lang.Object;#1a23b57f |my_other_string_5 |
|[Ljava.lang.Object;#3a101a1a |my_other_string_6 |
|[Ljava.lang.Object;#33846636 |my_other_string_7 |
|[Ljava.lang.Object;#521a0a3d |my_other_string_8 |
+-----------------------------+------------------+
>> df.printSchema()
root
|-- my_column: string (nullable = true)
|-- my_other_column: string (nullable = true)
As you can see, my_other_column column is looking as expected. Is there any way, how to convert objects in my_column column to humanly readable strings?
Jaroslav,
I tried with the following code, and have used a sample parquet file from here. I am able to get the desired output from the dataframe, can u please chk your code using the code snippet below and also sample file referred above to see if there's any other issue:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read a Parquet file").getOrCreate()
df = spark.read.parquet('E:\\...\\..\\userdata1.parquet')
df.show(10)
df.printSchema()
Replace the path to your HDFS location.
Dataframe output for your reference:
Related
I currently have a geopandas dataframe that looks like this
|----|-------|-----|------------------------------------------------|
| id | name | ... | geometry |
|----|-------|-----|------------------------------------------------|
| 1 | poly1 | ... | 0101000020E6100000A6D52A40F1E16690764A7D... |
|----|-------|-----|------------------------------------------------|
| 2 | poly2 | ... | 0101000020E610000065H7D2A459A295J0A67AD2... |
|----|-------|-----|------------------------------------------------|
And when getting ready to write it to postgis, I am getting the following error:
/python3.7/site-packages/geopandas/geodataframe.py:1321: UserWarning: Geometry column does not contain geometry.
warnings.warn("Geometry column does not contain geometry.")
Is there a way to convert this geometry column to a geometry type so that when it is appending to the existing table with geometry type column errors can be avoided. I've tried:
df['geometry'] = gpd.GeoSeries.to_wkt(df['geometry'])
But there are errors parsing the existing geometry column. Is there a correct way I am missing?
The syntax needs to be changed as below
df['geometry'] = df.geometry.apply(lambda x: x.wkt).apply(lambda x: re.sub('"(.*)"', '\\1', x))
I have a pyspark dataframe, with text column.
I wanted to map the values which with a regex expression.
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-RH', 'RH'))
df = df.withColumn('mapped_col', regexp_replace('mapped_col', '.*-FI, 'FI'))
Plus I wanted to map specifics values according to a dictionnary, I did the following (mapper is from create_map()):
df = df.withColumn("mapped_col",mapper.getItem(F.col("action")))
Finaly the values which has not been mapped by the dictionnary or the regex expression, will be set null. I do not know how to do this part in accordance to the two others.
Is it possible to have like a dictionnary of regex expression so I can regroup the two 'functions'?
{".*-RH": "RH", ".*FI" : "FI"}
Original Output Example
+-----------------------------+
|message |
+-----------------------------+
|GDF2009 |
|GDF2014 |
|ADS-set |
|ADS-set |
|XSQXQXQSDZADAA5454546a45a4-FI|
|dadaccpjpifjpsjfefspolamml-FI|
|dqdazdaapijiejoajojp565656-RH|
|kijipiadoa
+-----------------------------+
Expected Output Example
+-----------------------------+-----------------------------+
|message |status|
+-----------------------------+-----------------------------+
|GDF2009 | GDF
|GDF2014 | GDF
|ADS/set | ADS
|ADS-set | ADS
|XSQXQXQSDZADAA5454546a45a4-FI| FI
|dadaccpjpifjpsjfefspolamml-FI| FI
|dqdazdaapijiejoajojp565656-RH| RH
|kijipiadoa | null or ??
So first 4th line are mapped with a dict, and the other are mapped using regex. Unmapped are null or ??
Thank you,
You can achieve it using contains function:
from pyspark.sql.types import StringType
df = spark.createDataFrame(
["GDF2009", "GDF2014", "ADS-set", "ADS-set", "XSQXQXQSDZADAA5454546a45a4-FI", "dadaccpjpifjpsjfefspolamml-FI",
"dqdazdaapijiejoajojp565656-RH", "kijipiadoa"], StringType()).toDF("message")
df.show()
names = ("GDF", "ADS", "FI", "RH")
def c(col, names):
return [f.when(f.col(col).contains(i), i).otherwise("") for i in names]
df.select("message", f.concat_ws("", f.array_remove(f.array(*c("message", names)), "")).alias("status")).show()
output:
+--------------------+
| message|
+--------------------+
| GDF2009|
| GDF2014|
| ADS-set|
| ADS-set|
|XSQXQXQSDZADAA545...|
|dadaccpjpifjpsjfe...|
|dqdazdaapijiejoaj...|
| kijipiadoa|
+--------------------+
+--------------------+------+
| message|status|
+--------------------+------+
| GDF2009| GDF|
| GDF2014| GDF|
| ADS-set| ADS|
| ADS-set| ADS|
|XSQXQXQSDZADAA545...| FI|
|dadaccpjpifjpsjfe...| FI|
|dqdazdaapijiejoaj...| RH|
| kijipiadoa| |
+--------------------+------+
I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
| 13/08/2010| 342|
| 6/10/2011| 334|
| 27/11/2011| 269|
| 12/01/2012| 303|
| 22/02/2012| 286|
| 31/07/2012| 276|
| 25/04/2013| 222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date".
However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
|-- Offence Count: integer (nullable = true)
|-- Reported Date: string (nullable = true)
|-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method?
Thank you
If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.
here is the date data type approach.
Imports
import pyspark.sql.functions as f
Creating your Dataframe
l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 = spark.createDataFrame(l1).toDF("dates","sum")
dfl1.show()
+----------+---+
| dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+
Now, You can use to_timestamp or to_date apis of functions package
dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))
dfl2.show()
+----------+---+-----+
| dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+
Now, group by on years.
dfl2.groupBy('years').sum('sum').show()
+-----+--------+
|years|sum(sum)|
+-----+--------+
| 2013| 222|
| 2012| 865|
| 2010| 342|
| 2011| 603|
+-----+--------+
Showing into multiple steps for understanding but you can combine extract year and group by in one step.
Happy to extend if you need some other help.
I wonder what is the most efficient way to extract a column in pyspark dataframe and turn them into a new dataframe? The following code runs without any problem with small datasets, but runs very slow and even causes out-of-memory error. I wonder how can I improve the efficiency of this code?
pdf_edges = sdf_grp.rdd.flatMap(lambda x: x).collect()
edgelist = reduce(lambda a, b: a + b, pdf_edges, [])
sdf_edges = spark.createDataFrame(edgelist)
In pyspark dataframe sdf_grp, The column "pairs" contains information as below
+-------------------------------------------------------------------+
|pairs |
+-------------------------------------------------------------------+
|[[39169813, 24907492], [39169813, 19650174]] |
|[[10876191, 139604770]] |
|[[6481958, 22689674]] |
|[[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]]|
|[[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]] |
+-------------------------------------------------------------------+
with a schema of
root
|-- pairs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- node1: integer (nullable = false)
| | |-- node2: integer (nullable = false)
I'd like to convert them into a new dataframe sdf_edges looks like below
+---------+---------+
| node1| node2|
+---------+---------+
| 39169813| 24907492|
| 39169813| 19650174|
| 10876191|139604770|
| 6481958| 22689674|
| 73450939|114203936|
| 73450939| 21226555|
| 73450939| 24367554|
| 66306616| 32911686|
| 66306616| 19319140|
| 66306616| 48712544|
+---------+---------+
The most efficient way to extract columns is avoiding collect(). When you call collect(), all the data is transfered to the driver and processed there. At better way to achieve what you want is using the explode() function. Have a look at the example below:
from pyspark.sql import types as T
import pyspark.sql.functions as F
schema = T.StructType([
T.StructField("pairs", T.ArrayType(
T.StructType([
T.StructField("node1", T.IntegerType()),
T.StructField("node2", T.IntegerType())
])
)
)
])
df = spark.createDataFrame(
[
([[39169813, 24907492], [39169813, 19650174]],),
([[10876191, 139604770]], ) ,
([[6481958, 22689674]] , ) ,
([[73450939, 114203936], [73450939, 21226555], [73450939, 24367554]],),
([[66306616, 32911686], [66306616, 19319140], [66306616, 48712544]],)
], schema)
df = df.select(F.explode('pairs').alias('exploded')).select('exploded.node1', 'exploded.node2')
df.show(truncate=False)
Output:
+--------+---------+
| node1 | node2 |
+--------+---------+
|39169813|24907492 |
|39169813|19650174 |
|10876191|139604770|
|6481958 |22689674 |
|73450939|114203936|
|73450939|21226555 |
|73450939|24367554 |
|66306616|32911686 |
|66306616|19319140 |
|66306616|48712544 |
+--------+---------+
Well, I just solve it with the below
sdf_edges = sdf_grp.select('pairs').rdd.flatMap(lambda x: x[0]).toDF()
I'm trying to concatenate two data frames and write said data-frame to an excel file. The concatenation is performed somewhat successfully, but I'm having a difficult time eliminating the index row that also gets appended.
I would appreciate it if someone could highlight what it is I'm doing wrong. I thought providing the "index = False" argument at every excel call would eliminate the issue, but it has not.
enter image description here
Hopefully you can see the image, if not please let me know.
# filenames
file_name = "C:\\Users\\ga395e\\Desktop\\TEST_FILE.xlsx"
file_name2 = "C:\\Users\\ga395e\\Desktop\\TEST_FILE_2.xlsx"
#create data frames
df = pd.read_excel(file_name, index = False)
df2 = pd.read_excel(file_name2,index =False)
#filter frame
df3 = df2[['WDDT', 'Part Name', 'Remove SN']]
#concatenate values
df4 = df3['WDDT'].map(str) + '-' +df3['Part Name'].map(str) + '-' + 'SN:'+ df3['Remove SN'].map(str)
test=pd.DataFrame(df4)
test=test.transpose()
df = pd.concat([df, test], axis=1)
df.to_excel("C:\\Users\\ga395e\\Desktop\\c.xlsx", index=False)
Thanks
so as the other users also wrote I dont see the index in your image as well because in this case you would have an output which would be like the following:
| Index | Column1 | Column2 |
|-------+----------+----------|
| 0 | Entry1_1 | Entry1_2 |
| 1 | Entry2_1 | Entry2_2 |
| 2 | Entry3_1 | Entry3_2 |
if you pass the index=False option the index will be removed:
| Column1 | Column2 |
|----------+----------|
| Entry1_1 | Entry1_2 |
| Entry2_1 | Entry2_2 |
| Entry3_1 | Entry3_2 |
| | |
which looks like it your case. Your problem be could related to the concatenation and the transposed matrix.
Did you check here you temporary dataframe before exporting it?
You might want to check if pandas imports the time column as a time index
if you want to delete those time columns you could use df.drop and pass an array of columns into this function, e.g. with df.drop(df.columns[:3]). Does this maybe solve your problem?