So I have a pyspark dataframe that I want to add another column to using the value from the Section_1 column and find its corresponding value in a python dictionary. So basically use the value from the Section_1 cell as the key and then fill in the value from the python dictionary in the new column like below.
Original dataframe
DataId
ObjId
Name
Object
Section_1
My data
Data name
Object name
rd.111
rd.123
Python Dictionary
object_map= {'rd.123' : 'rd.567'}
Where section 1 has a value of rd.123 and I will search in the dictionary for the key 'rd.123' and want to return that value of rd.567 and place that in the new column
Desired DataFrame
DataId
ObjId
Name
Object
Section_1
Section_2
My data
Data name
Object name
rd.111
rd.123
rd.567
Right now I got this error with my current code and I dont really know what I did wrong as I am not to familiar with pyspark
There is an incorrect call to a Column object in your code. Please
review your code.
Here is my code that I am currently using where object_map is the python dictionary.
test_df = output.withColumn('Section_2', object_map.get(output.Section_1.collect()))
You can try this (adapted from this answer with added null handling):
from itertools import chain
from pyspark.sql.functions import create_map, lit, when
object_map = {'rd.123': 'rd.567'}
mapping_expr = create_map([lit(x) for x in chain(*object_map.items())])
df1 = df.filter(df['Section_1'].isNull()).withColumn('Section_2', F.lit(None))
df2 = df.filter(df['Section_1'].isNotNull()).withColumn(
'Section_2',
when(
df['Section_1'].isNotNull(),
mapping_expr[df['Section_1']]
)
)
result = df1.unionAll(df2)
Related
Following code is supposed to create a dataframe df2 with two columns - first column storing the name of each column of df and the second column storing the max length of each column of df. But I'm getting the error shown below:
Question: What I may be doing wrong here, and how can we fix the error?
NameError: name 'row' is not defined
from pyspark.sql.functions import col, length, max
from pyspark.sql import Row
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Apologies Nam, Please find the below-working snippet. There was a line missing in the original answer, I've updated the same.
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:
Let me know if you face any other issue
I have a dataframe containing a column like:
df['metrics'] =
[{id=1,name=XYZ,value=3}, {id=2,name=KJH,value=2}]
[{id=4,name=ABC,value=7}, {id=8,name=HGS,value=9}]
The column is a String type, and I am trying to explode the column using :
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType
array_item_schema = spark.read.json(df.rdd.map(lambda row: row['metrics'])).schema
json_array_schema = ArrayType(array_item_schema, True)
arrays_df = df.select(F.from_json('metrics', json_array_schema).alias('json_arrays'))
objects_df = arrays_df.select(F.explode('json_arrays').alias('objects'))
However, I have a null value returned when I try
objects_df.show()
The output I am looking for is a separated list of each element in the 'metrics' column, with column names showing id, name, value, in the same dataframe, and don't know where to start to decode it. Thanks for the help!
You can schema_of_json function to get schema from JSON string and pass it to from_json function get struct type.
json_array_schema = schema_of_json(str(df.select("metrics").first()[0]))
arrays_df = df.select(from_json('metrics', json_array_schema).alias('json_arrays'))
I am having below scenario in python code:
valuesarray = ['None','FULL'] --> values of the column 'GLASS' of dataframe df
keysarray = ['GLASS','SUBJECT']
I have value 'GLASS' coming from some upper code. I am assigning it to keyvalue
keyvalue = keysarray[0]
I am trying to get all the records for glass matching None and Full from the dataframe.
df= df[df.keyvalue.isin(valuesarray)]
But, getting error --> keyvalue is not a column for the dataframe.
How to access value of variable keyvalue in this scenario? any idea?
I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks
Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)
I have an index(list) called newSeries0 and I'd like to do the following.
for seriesName in newSeries0:
seriesName=fred.get_series_first_release(seriesName)
seriesName=pd.DataFrame(seriesName)
seriesName=seriesName.resample('D').fillna('ffill')
seriesName.rename(columns={'value': str(seriesName)}, inplace=True)
In other words, I'd like to create a dataframe from each name in newSeries (using this fred api) which has the (variable) name of that newSeries. Each dataframe is forward filled and the column name of the data is changed to the name of the data series.
Is zip or map involved?
In the end I'd like to have
a=dataframe of a
b=dataframe of b
c=dataframe of c
...
where a,b,c... are the names of the data series in my index(list) newSeries0, so when I call a I get the dataframe of a.
Just use dictionary like :-
dataframe_dict = {}
for seriesName in newSeries0:
seriesName=fred.get_series_first_release(seriesName)
seriesName=seriesName.resample('D').fillna('ffill')
dataframe_dict[seriesName]=seriesName
df = pd.DataFrame()
for name in dataframe_dict:
df[name] = dataframe_dict[name]