How can I use multiple .contains() inside a .when() in pySpark? [duplicate] - python

This question already has answers here:
PySpark: multiple conditions in when clause
(4 answers)
Closed 3 years ago.
I am trying to create classes in a new column, based on existing words in another column. For that, I need to include multiple .contains() conditions. But none of the one I tried work.
def classes_creation(data):
df = data.withColumn("classes", when(data.where(F.col("MISP_RFW_Title").like('galleys') | F.col("MISP_RFW_Title").like('coffee')),"galleys") ).otherwise(lit(na))
return df
# RETURNS ERROR
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys").contains("word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY
def classes_creation(data):
df = data.withColumn("classes", when(col("MISP_RFW_Title").contains("galleys" | "word"), 'galleys').otherwise(lit(na))
return df
# RETURNS COLUMN OF NA ONLY

If I understood your requirements correctly, you can use regex for matching with rlike
data.withColumn("classes", when(col("MISP_RFW_Title").rlike("galleys|word"), 'galleys').otherwise('a'))
or maybe if you have different columns, you can use something like this
data.withColumn("classes", when((col("MISP_RFW_Title").contains("galleys")|col("MISP_RFW_Title").contains("word")), 'galleys').otherwise('a'))

Related

Multiple special character transformations on dataframe using Pandas [duplicate]

This question already has answers here:
Faster method of extracting characters for multiple columns in dataframe
(2 answers)
How to extract part of a string in Pandas column and make a new column
(3 answers)
Reference - What does this regex mean?
(1 answer)
Closed 2 months ago.
I wish to keep everything before the hyphen in one column, and keep everything before the colon in another column using Pandas.
Data
ID Type Stat
AA - type2 AAB:AB33:77:000 Y
CC - type3 CCC:AB33:77:000 N
Desired
ID Type
AA AAB
CC CCC
Doing
separator = '-'
result_1 = my_str.split(separator, 1)[0]
Any suggestion is appreciated
We can try using str.extract here:
df["ID"] = df["ID"].str.extract(r'(\w+)')
df["Type"] = df["Type"].str.extract(r'(\w+)')
I would say
func1 = lambda _: _['ID'].split('- ')[0]
func2 = lambda _: _['Type'].split(':')[0]
data\
.assign(ID=func1)\
.assign(Type=func2)
References
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html

change a value based on other value in dataframe [duplicate]

This question already has answers here:
Pandas DataFrame: replace all values in a column, based on condition
(8 answers)
Conditional Replace Pandas
(7 answers)
Closed 1 year ago.
If product type == option, I replace the value in the PRICE column with the value of the STRIKE column.
How can I do this without using the for loop? (to make it faster)
Now I have the following but it's slow:
for i in range(df.shape[0]):
if df.loc[i,'type'] == 'Option:
df.loc[i,'PRICE'] = df.loc[i,'STRIKE']
Use .loc in a vectorized fashion
df.loc[df['type'] == 'Option', 'PRICE'] = df['STRIKE']
mask = (df.type == 'Option')
df[mask].PRICE = df[mask].STRIKE
see:
https://www.geeksforgeeks.org/boolean-indexing-in-pandas/

Overwriting values using .loc [duplicate]

This question already has answers here:
Try to replace a specific value in a dataframe, but does not overwritte it
(1 answer)
Changing values in pandas dataframe does not work
(1 answer)
Closed 2 years ago.
I want to conditionally overwrite some values for a given column in my DataFrame using this command
enq.dropna().loc[q16.apply(lambda x: x[:3].lower()) == 'oui', q16_] = 'OUI' # q16 = enq[column_name].dropna()
which has the form
df.dropna().loc[something == something_else, column_name] = new_value
I don't get any error but when I check the result, I see that nothing has changed.
Thanks for reading and helping.
Your problem is because dropna() is a new dataframe which is a copy of df, you have to do it in two steps:
enq.dropna(inplace=True)
enq.loc[q16.apply(lambda x: x[:3].lower()) == 'oui', q16_] = 'OUI'

remove rows from dataframe where contents could be a choice of strings [duplicate]

This question already has answers here:
dropping rows from dataframe based on a "not in" condition [duplicate]
(2 answers)
Closed 4 years ago.
so i can do something like:
data = df[ df['Proposal'] != 'C000' ]
to remove all Proposals with string C000, but how can i do something like:
data = df[ df['Proposal'] not in ['C000','C0001' ]
to remove all proposals that match either C000 or C0001 (etc. etc.)
You can try this,
df = df.drop(df[df['Proposal'].isin(['C000','C0001'])].index)
Or to select the required ones,
df = df[~df['Proposal'].isin(['C000','C0001'])]
import numpy as np
data = df.loc[np.logical_not(df['Proposal'].isin({'C000','C0001'})), :]
# or
data = df.loc[ ~df['Proposal'].isin({'C000','C0001'}) , :]

Python/Pandas - Query a MultiIndex Column [duplicate]

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Categories

Resources