I am trying to add an if condition in F. when in a pyspark column
my code:
df = df.withColumn("column_fruits",F.when(F.col('column_fruits') == "Berries"
if("fruit_color")== "red":
return "cherries"
elif("fruit_color") == "pink":
return "strawberries"
else:
return "balackberries").otherwise("column_fruits")
I want to first filter out berries and change fruit names according to color. And all the remaining fruits remain the same.
Can anyone tell me if this is a valid way of writing withColumn code?
This would work
df.withColumn("column_fruits", F.when((F.col('column_fruits') == "Berries") & (F.col('fruit_color') == "red"), "cherries")\
.when((F.col('column_fruits') == "Berries") & (F.col('fruit_color') == "pink"), "strawberries")\
.otherwise("blackberries"))\
.show()
Sample Input/Output:
Related
I have the following pandas dataframe:
I am trying to write some conditional python statements, where if we have issue_status of 10 or 40 AND market_phase of 0 AND tade_state of (which is what we have in all of the cases in the above screenshot). Then I want to call a function called resolve_collision_mp(...).
Can I write the conditional in Python as follows?
# Collision for issue_status == 10
if market_info_df['issue_status'].eq('10').all() and market_info_df['market_phase'].eq('0').all() \
and market_info_df['trading_state'] == ' ': # need to change this, can't have equality for dataframe, need loc[...]
return resolve_collision_mp_10(market_info_df)
# Collision for issue_status == 40
if market_info_df['issue_status'].eq('40').all() and market_info_df['market_phase'].eq('0').all() \
and not market_info_df['trading_state']:
return resolve_collision_mp_40(market_info_df)
I don't think the above is correct, any help would be much appreciated!
You can use .apply() with the relevant conditions,
df['new_col'] = df.apply(lambda row: resolve_collision_mp_10(row) if (row['issue_status'] == 10 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
df['new_col'] = df.apply(lambda row: resolve_collision_mp_40(row) if (row['issue_status'] == 40 and row['market_phase'] == 0 and row['tade_state'] = '') else None, axis=1)
Note: I am assuming that you are trying to create a new column with the return values of the resolve_collision_mp_10 and resolve_collision_mp_40 functions.
I would to compare a group of a values between them with a moving window. I try to explain in a better way: I have a column on pandas dataframe and I would to test if 5 rows in a sequence are the same, but I would to do this examination in a moving window, that is to say I would to compare the row from 0 to 5, then the row from 1 to 6 and so on, in order to do certain changes. I would to know how I could do it in a better way than mine, because I used iterrows method.
my method:
for idx, row in df[2:-2].iterrows():
previous2 = df.loc[idx-2, 'speed_limit']
previous1 = df.loc[idx-1, 'speed_limit']
now = row['speed_limit']
next1 = df.loc[idx+1, 'speed_limit']
next2 = df.loc[idx+2, 'speed_limit']
if (next1==next2) & (previous1 == previous2) & (previous1 == next1) & (now!=previous1):
df.at[idx, 'speed_limit'] = previous1
Thank you for your patience. I would appreciate any suggestions. I wish you a great day.
I want to post my solution based on numpy select and pandas shift, that is faster than previous.
def noise_remove(df, speed_limit_column):
speed_data_column = df[speed_limit_column]
previous_1 = speed_data_column.shift(-1)
previous_2 = speed_data_column.shift(-2)
next_1 = speed_data_column.shift(+1)
next_2 = speed_data_column.shift(+2)
conditions = [(previous_1 == previous_2) &
(next_1 == next_2) &
(previous_1 == next_1) &
(speed_data_column == previous_1),
(previous_1 == previous_2) &
(next_1 == next_2) &
(previous_1 == next_1) &
(speed_data_column != previous_1)]
choices = [speed_data_column, previous_1]
df[speed_limit_column] = np.select(conditions, choices, default=speed_data_column)
return df
If you have some suggestions, I'll appreciate. Have you a great day!
every month I get a dataframe , so every month I will have to do some adjusts to the dataframe, I would like to create a function for just apply it on every dataframe without create the code again.
I have for the first dataframe, called enero:
for i in range(0,len(enero)):
if enero.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
enero.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif enero.loc[i,"PROVEEDOR"] == "PEPITO" and enero.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
enero.loc[i,"MARCA"]="PINTURAS"
For the second dataframe, called febrero:
for i in range(0,len(febrero)):
if febrero.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
febrero.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif febrero.loc[i,"PROVEEDOR"] == "PEPITO" and febrero.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
febrero.loc[i,"MARCA"]="PINTURAS"
So, as not to repeat the code every month, I would like to create a function:
def ajustemarca(df,VENDEDOR_CLIENTE,MARCA,PROVEEDOR):
for i in range(0,len(df)):
if df.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
df.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif df.loc[i,"PROVEEDOR"] == "PEPITO" and df.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
df.loc[i,"MARCA"]="PINTURAS"
return df.loc[i,"MARCA"]
Then, I am calling the function:
enero.apply(ajustemarca)
febrero.apply(ajustemarca)
But, it does not work. How can I do this function?
I share an answer that someone wrote here, but it was deleted :(
The answer was perfect and now the code work:
def ajustemarca(df):
for i in range(0,len(df)):
if df.loc[i,"VENDEDOR_CLIENTE"] == "ARTURO":
df.loc[i,"MARCA"]="MAQUILA PINTUCO"
elif df.loc[i,"PROVEEDOR"] == "PEPITO." and df.loc[i,"VENDEDOR_CLIENTE"] != "ARTURO":
df.loc[i,"MARCA"]="PINTURAS"
ajustemarca(enero)
ajustemarca(febrero)
I would like to learn how to use df.loc and for-loop to calculate new columns for the dataframe below
Problem: from df_G, for T = 400, take value of each Go_j as input
Then add new column "G_ads_400" in dataframe df = df['Adsorption_energy_eV'] - Go_h2o
df_G
df
here is my code for each Temperature
Go_co2 = df_G.loc[df_G.index == "400" & df_G.Go_CO2]
Go_o2= df_G.loc[df_G.index == "400" & df_G.Go_O2]
Go_co= df_G.loc[df_G.index == "400" & df_G.Go_CO]
df.loc[df['Adsorbates'] == "CO2", "G_ads_400"] = df.Adsorption_energy_eV-Go_co2
df.loc[df['Adsorbates'] == "CO", "G_ads_400"] = df.Adsorption_energy_eV-Go_co
df.loc[df['Adsorbates'] == "O2", "G_ads_400"] = df.Adsorption_energy_eV-Go_o2
I am not sure why I kept having error and I would like to know how to put it in a for-loop so it looks less messy
I have an Access Query which I want to convert into Python Script:
SELECT
[Functional_Details].Customer_No,
Sum([Functional_Details].[SUM(Incoming_Hours)]) AS [SumOfSUM(Incoming_Hours)],
Sum([Functional_Details].[SUM(Incoming_Minutes)]) AS [SumOfSUM(Incoming_Minutes)],
Sum([Functional_Details].[SUM(Incoming_Seconds)]) AS [SumOfSUM(Incoming_Seconds)],
[Functional_Details].Rate,
[Functional_Details].Customer_Type
FROM [Functional_Details]
WHERE(
(([Functional_Details].User_ID) Not In ("IND"))
AND
(([Functional_Details].Incoming_ID)="Airtel")
AND
(([Functional_Details].Incoming_Category)="Foreign")
AND
(([Functional_Details].Outgoing_ID)="Airtel")
AND
(([Functional_Details].Outgoing_Category)="Foreign")
AND
(([Functional_Details].Current_Operation)="NO")
AND
(([Functional_Details].Active)="NO")
)
GROUP BY [Functional_Details].Customer_No, [Functional_Details].Rate, [Functional_Details].Customer_Type
HAVING ((([Functional_Details].Customer_Type)="Check"));
I have Functional_Details stored in a dataframe: df_functional_details
I am not able to understand how to proceed with the python script.
So far I have tried:
df_fd_temp=df_functional_details.copy()
if(df_fd_temp['User_ID'] != 'IND'
and df_fd_temp['Incoming_ID'] == 'Airtel'
and df_fd_temp['Incoming_Category'] == 'Foreign'
and df_fd_temp['Outgoing_ID'] == 'Airtel'
and df_fd_temp['Outgoing_Category'] == 'Foreign'
and df_fd_temp['Current_Operation'] == 'NO'
and df_fd_temp['Active'] == 'NO'):
df_fd_temp.groupby(['Customer_No','Rate','Customer_Type']).groups
df_fd_temp[df_fd_temp['Customer_Type'].str.contains("Check")]
First, select the rows where the conditions apply (note the parentheses and & instead of and):
df_fd_temp = df_fd_temp[(df_fd_temp['User_ID'] != 'IND') &
(df_fd_temp['Incoming_ID'] == 'Airtel') &
(df_fd_temp['Incoming_Category'] == 'Foreign') &
(df_fd_temp['Outgoing_ID'] == 'Airtel') &
(df_fd_temp['Outgoing_Category'] == 'Foreign') &
(df_fd_temp['Current_Operation'] == 'NO') &
(df_fd_temp['Active'] == 'NO')]
Then, do the group-by logic:
df_grouped = df_fd_temp.groupby(['Customer_No','Rate','Customer_Type'])
You now have a groupby object, which you can further manipulate and filter:
df_grouped.filter(lambda x: "Check" in x['Customer_Type'])
You might need to tweak the group filtering based on what your actual dataset looks like.
Further reading:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html