pyspark - creating Row instance inside createDataFrame() method

pyspark - creating Row instance inside createDataFrame() method - python

Following code is supposed to create a dataframe df2 with two columns - first column storing the name of each column of df and the second column storing the max length of each column of df. But I'm getting the error shown below:
Question: What I may be doing wrong here, and how can we fix the error?
NameError: name 'row' is not defined
from pyspark.sql.functions import col, length, max
from pyspark.sql import Row
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])

Apologies Nam, Please find the below-working snippet. There was a line missing in the original answer, I've updated the same.
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:
Let me know if you face any other issue

Related

Indexing by row name

Can someone please help me with this. I want to call rows by name, so I used set_index on the 1st column in the dataframe to index the rows by name instead of using integers for indexing.
# Set 'Name' column as index on a Dataframe
df1 = df1.set_index("Name", inplace = True)
df1
Output:
AttributeError: 'NoneType' object has no attribute 'set_index'
Then I run the following code:
result = df1.loc["ABC4"]
result
Output:
AttributeError: 'NoneType' object has no attribute 'loc'
I don't usually run a second code that depends on the 1st before fixing the error, but originally I run them together in one Jupyter notebook cell. Now I see that the two code cells have problems.
Please let me know where I went wrong. Thank you!

Maybe you should define your dataframe?
import pandas as pd
df1 = pd.DataFrame("here's your dataframe")
df1.set_index("Name")
or just
import pandas as pd
df1 = pd.DataFrame("here's your dataframe").set_index("Name")
df1

Your variable "df1" is not defined anywhere before doing something with it.
Try this:
# Set 'Name' column as index on a Dataframe
df1 = ''
df1 = df1.set_index("Name", inplace = True)
If its defined before, its value is NONE. So check this variable first.
The rest of the code "SHOULD" work afterwards.

Get value from Pyspark Column and compare it to a Python dictionary

So I have a pyspark dataframe that I want to add another column to using the value from the Section_1 column and find its corresponding value in a python dictionary. So basically use the value from the Section_1 cell as the key and then fill in the value from the python dictionary in the new column like below.
Original dataframe
DataId
ObjId
Name
Object
Section_1
My data
Data name
Object name
rd.111
rd.123
Python Dictionary
object_map= {'rd.123' : 'rd.567'}
Where section 1 has a value of rd.123 and I will search in the dictionary for the key 'rd.123' and want to return that value of rd.567 and place that in the new column
Desired DataFrame
DataId
ObjId
Name
Object
Section_1
Section_2
My data
Data name
Object name
rd.111
rd.123
rd.567
Right now I got this error with my current code and I dont really know what I did wrong as I am not to familiar with pyspark
There is an incorrect call to a Column object in your code. Please
review your code.
Here is my code that I am currently using where object_map is the python dictionary.
test_df = output.withColumn('Section_2', object_map.get(output.Section_1.collect()))

You can try this (adapted from this answer with added null handling):
from itertools import chain
from pyspark.sql.functions import create_map, lit, when
object_map = {'rd.123': 'rd.567'}
mapping_expr = create_map([lit(x) for x in chain(*object_map.items())])
df1 = df.filter(df['Section_1'].isNull()).withColumn('Section_2', F.lit(None))
df2 = df.filter(df['Section_1'].isNotNull()).withColumn(
'Section_2',
when(
df['Section_1'].isNotNull(),
mapping_expr[df['Section_1']]
)
)
result = df1.unionAll(df2)

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!

With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]

Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

How to remove column with number as index name?

I have the following dataframe:
I tried to drop the data of -1 column by using
df = df.drop(columns=['-1'])
However, it is giving me the following error:
I was able to drop the column if the column name is some language character using this similar coding script, but not a number. What am I doing wrong?

You can test real columns names by converting them to list:
print (df.columns.tolist())
I think you need droping number -1 instead string '-1':
df = df.drop(columns=[-1])
Or another solution with same ouput:
df = df.drop(-1, axis=1)
EDIT:
If need select all columns without first use DataFrame.iloc for select by position, first : means select all rows and second 1: all columns with omit first:
df = df.iloc[:, 1:]

If you are just trying to remove the first column, another approach that would be independent of the column name is this:
df = df[df.columns[1:]]

you can do it simply by using the following code:
first check the name of the column by using following:
df.columns
then if the output is like:
Index(['-1', '0'], dtype='object')
use drop command to delete the column
df.drop(['-1'], axis =1, inplace = True)
guess this should help for future as well

How to use def-return or for-in statements on dataframes to avoid repetitions in code in python /pandas

could someone please look at below code and advice what I have done wrong.
I have 2 panda dataframes - df and x1
Both have the same columns and column names
I have to execute below set of codes for df.Date_Appointment, x1.Date_Appointment and similary for df.Date_Scheduled and x1.Date_Scheduled. As such created a list for columns and dataframes.
I am trying to write a single code but obviously I am doing something wrong. Please advice.
import pandas as pd
df = pd.read_csv(file1.csv)
x1 = pd.read_csv(file2.csv)
# x1 is a dataframe created after filtering on one column.
# df and x1 have same number of columns and column names
# x1 is a subset of df ``
dataframe = ['df','x1']
column = ['Date_Appointment', 'Date_Scheduled']
def df_det (dataframe.column):
(for df_det in dataframe.column :
d_da = df_det.describe()
mean_da = df_det.value_counts().mean()
median_da = df_det.value_counts().median()
mode_da = df_det.value_counts().mode()
print('Details of all appointments', '\n',
d_da, '\n',
'Mean = ', mean_da,'\n',
'Median = ', median_da,'\n',
'Mode = ',mode_da,'\n'))
Please indicate the steps.
Thank you in advance.

It looks like you function should have two arguments -- dataframe and column -- both of which are lists, so I made the names plural.
Then you need to loop over each argument. Note that you are also assigning a dataframe in the function the same name as your function, so I changed the name of the function.
dataframes = [dataframe1, dataframe2]
columns = ['Date_Appointment', 'Date_Scheduled']
def summary_stats(dataframes, columns):
for df in dataframes:
for col in cols:
df_det = df.loc[:, col]
# print summary stats about df_det

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark - creating Row instance inside createDataFrame() method - python

Related

Indexing by row name

Get value from Pyspark Column and compare it to a Python dictionary

Changing column values for a value in an adjacent column in the same dataframe using Python

How to remove column with number as index name?

How to use def-return or for-in statements on dataframes to avoid repetitions in code in python /pandas

Categories

Resources