Using casefold() with dataframe Column Names and .contains method - python

How do I look for instances in the dataframe where the 'Campaign' column contains b0.
I would like to not alter the dataframe values but instead just view them as if they were lowercase.
df.loc.str.casefold()[df['Campaign'].str.casefold().contains('b0')]
I recently inquired about doing this in the instance of matching a specific string like below, but what I am asking above I am finding to be more difficult.
df['Record Type'].str.lower() == 'keyword'

Try with
df.loc[df['Campaign'].str.contains('b0',case=False)]

Alternatively, if you want to create a subset of the dataframe:
df_subset = df[(df[('Campaign')].str.casefold().str.contains('b0', na=False))]

Related

Python function that evaluates each row by a list

I am using Python to clean address data and standardize abbreviations, etc. so that it can be compared against other address data. I finally have 2 dataframes in Pandas. I would like to compare each row in the first df, named df, against a list created from another list of addresses in a df of similar structure, second_df. If the address from df is on the list, then I would like to create a column to note this, maybe a boolean, but best case the string 'found'. I have used isin and it did not work.
For example, suppose my data looks like the sample data below. I would like to compare each row in df['concat'] to the entire list list to see if the address in df['concat'] column appears in the second_df list.
read = pd.read_excel('fullfilepath.xlsx')
second_df = pd.read_excel('anotherfilepath.xlsx')
df = read[['column1','column2', 'concat']]
list = second_df.concat.tolist()
EDIT based on tdy comment as my original answer didn't have the value for False option in where statement.
Try sth like this:
df["isFound"] = np.where(df['concat'].isin(second_df["concat"]), "found", "notfound")
Should be exactly what you need

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

Filtering Pandas Dataframe Based on List of Column Names

I have a pandas dataframe which has may be 1000 Columns. However I do not need so many columns> I need columns only if they match/starts/contains specific strings.
So lets say I have a dataframe columns like
df.columns =
HYTY, ABNH, CDKL, GHY#UIKI, BYUJI##hy BYUJI#tt BBNNII#5 FGATAY#J ....
I want to select columns whose name are only like HYTY, CDKL, BYUJI* & BBNNI*
So what I was trying to do is to create a list of regular expressions like:
import re
relst = ['HYTY', 'CDKL*', 'BYUJI*', 'BBNI*']
my_w_lst = [re.escape(s) for s in relst]
mask_pattrn = '|'.join(my_w_lst)
Then I create the logical vector to give me a list of TRUE/FALSE to say whether the string is present or not. However, not understanding how to get the dataframe of only those true selected columns from this.
Any help will be appreciated.
Using what you already have you can pass your mask to filter like:
df.filter(regex=mask_pattrn)
Use re.findall(). It will give you a list of columns to pass to df[mylist]
We can do startswith
relst = ['CDKL', 'BYUJI', 'BBNI']
subdf = df.loc[:,df.columns.str.startswith(tuple(relst))|df.columns.isin(['HYTY'])]

How to add an integer-represented column in Pandas dataframe

I need to add an integer-represented column in a pandas dataframe. For example if a have a dataframe with names and genders as the following:
I would need to add a new column with an integer value depending of the gender. Expected out put would be as follows:
df['Gender_code']=df['Gender'].transform(lambda gender: 1 if gender=='Female' else 0)
Explanation: Using transform(), you can apply a function to all values of any column. Here, I applied the function defined using lambda to column 'Gender'
For just two gender you can do a comparison:
df['Gender_code'] = df['Gender'].eq('Female').astype(int)
In the general case, you can resolve to factorize:
df['Gender_code'] = df['Gender'].factorize()[0]

Collecting the result of PySpark Dataframe filter into a variable

I am using the PySpark dataframe. My dataset contains three attributes, id, name and address. I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
The output I am getting is the following: [Row(id='382')]
I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce" with "John"
From the docs for pyspark.sql.DataFrame.collect(), the function:
Returns all the records as a list of Row.
The fields in a pyspark.sql.Row can be accessed like dictionary values.
So for your example:
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]
You can access the id field by doing:
id_vals = [r['id'] for r in ID]
#['382']
But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.
EDIT
Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when().
This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce". The second argument is what should be returned if the boolean expression is True. For this example, I am using f.lit(replacement_value).
For example:
import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
"name",
f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)

Categories

Resources