Python - keep rows in dataframe based on partial string match - python

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.

So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('#', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com

You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc#gmail.com'],
['mailbox2', 'def#yahoo.com'],
['mailbox3', 'ghi#msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"#{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com

Solution
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc#gmail.com', 'def#yahoo.com', 'ghi#msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc#gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
Output:
> {'Email_ID': ('abc#gmail.com', 'ghi#msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Some notes:
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have a a list of rows already, you just have to do the filtering part

Related

Concatenate specific columns in pandas

Im trying to concatenate 4 different datasets onto pandas python. I can concatenated them but it results in several of the same column names. How do I only produce only one column of the same name, then multiples?
concatenated_dataframes = pd.concat(
[
dice.reset_index(drop=True),
json.reset_index(drop=True),
flexjobs.reset_index(drop=True),
indeed.reset_index(drop=True),
simply.reset_index(drop=True),
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dice.columns),
list(json.columns),
list(flexjobs.columns),
list (indeed.columns),
list(simply.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
df= concatenated_dataframes
This results in
UNNAMED: 0 TITLE COMPANY DESCRIPTION LOCATION TITLE JOBLOCATION POSTEDDATE DETAILSPAGEURL COMPANYPAGEURL COMPANYLOGOURL SALARY CLIENTBRANDID COMPANYNAME EMPLOYMENTTYPE SUMMARY SCORE EASYAPPLY EMPLOYERTYPE WORKFROMHOMEAVAILABILITY ISREMOTE UNNAMED: 0 TITLE SALARY JOBTYPE LOCATION DESCRIPTION UNNAMED: 0 TITLE SALARY JOBTYPE DESCRIPTION LOCATION UNNAMED: 0 COMPANY DESCRIPTION LOCATION SALARY TITLE
Again, how do i combined all the 'titles' in one column, all the 'location' in one column, and so on? Instead of have multiple of them.
I think we can get away with making a blank dataframe that just has the columns we will want at the end and then concat() everything onto it.
import numpy as np
import pandas as pd
all_columns = list(dice.columns) + list(json.columns) + list(flexjobs.columns) + list(indeed.columns) + list(simply.columns)
all_unique_columns = np.unique(np.array(all_columns)) # this will, as the name suggests, give an end list of just the unique columns. You could run print(all_unique_columns) to make sure it has what you want
df = pd.DataFrame(columns=all_unique_columns)
df = pd.concat([dice, json, flexjobs, indeed, simply],axis=0)
It's a little tricky not having reproducible examples of the dataframes that you have. I tested this on a small mock-up example I put together, but let me know if it works for your more complex example.

How to use the filtered data in Pandas?

I am new to Pandas. Below is part of a code. I am trying to use the df_filtered which is the filtered data having codenum column value =AB123. However from Line 14 if I use df_filtered instead of excel_data_df , it's not giving results. The desired columns are getting picked correctly. But the value filtering is not happening - In codenum col value should be = AB123. But the value filtering is not happening and I am getting the entire excel converted to json with chosen columns. Please help understand how to consider/use df_filtered data from Line 14.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[4,5,6,18,19], #index starts from 0
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
df_filtered = excel_data_df.loc[lambda x: x['codenum'] == 'AB123'] # Row Filtering -- need to use this further
excel_data_df.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns -- Line Num 14
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
First, a small tip; you can remove the use/need of lambda by doing
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if you want to get rid of the lambda.
Afterwards, as pointed out in the comments, make sure that it contains samples after the filtering;
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if df_filtered.shape[0]: #it contains samples
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe
print("Filtered data does not contain data")
Try the following code below:
df_filtered = excel_data_df[excel_data_df['codenum'] == 'AB123']
If it still not working then "codenum" may not have this value that you are trying to filter out.
Thanks all for your inputs. Initially it was returning empty dataframe as suggested in above answers and comments. Posting the edited working code based on your inputs for anyone's future reference.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[3,5,6,18,19], #index starts from 0 ## edit 1: corrected index to right column index
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
print(excel_data_df['codenum'].unique()) ##edit 1: returns unique values including AB123
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=='AB123'] # Row Filtering ##edit 1
print(df_filtered) ##edit 1 - to check if expected results are present in filtered data
df_filtered.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns
if df_filtered.shape[0]: #it contains samples ## edit 1
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
df_filtered = df_filtered[cols_to_keep] # columns to keep
df_filtered # columns to keep
json_str = df_filtered.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe ##edit 1
print("Filtered data does not contain data")
The pandas df.loc will return the filtered result.
In your code, you tried to make a filtered but df.loc is not a filter maker.
See the example, the df.loc return the filter result from the origin df.
import pandas as pd
df = pd.DataFrame([[1, "AB123"], [4, "BC123"], [7, "CD123"]],columns=['A', 'B'])
print(df)
# A B
#0 1 AB123
#1 4 BC123
#2 7 CD123
print(df.loc[lambda x: x["B"] == "AB123"])
# A B
#0 1 AB123

Panda Dataframe adding more columns to .csv when editing certain values

I am using Panda Dataframe to store some information for my code. In my code,
Initial State of csv:
...............
ID,Name
...............
Adding Data into dataframe:
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = df.append(name_desc, ignore_index=True)
This was my panda dataframe upon creating the database:
....................
,ID,Name
0,23523223,BlahBlah
....................
Below is my code that searches through the ID column to locate the row with the stated ID (name_desc["ID"]).
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
The problem I encountered was after I have edited the name, I get a resultant db that looks like:
................................
Unnamed: 0 ID Name
0 0 23523223 BlahBlah
................................
If I continously execute:
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
I get this db:
..................................
,Unnamed: 0,Unnamed: 0.1,ID,Name
0,0,0,235283335,Dinese
..................................
I can't figure out why I have extra columns being added in the front of my database as I make edits.
I think you have a problem that is related to the df creation. The example you provided here does not return what you are showing:
BlahBlah = 'foo'
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = pd.DataFrame(data=name_desc, index=[0])
print(df.columns) # it returns an Index(['ID', 'Name'], dtype='object')
print(len(df.columns)) # it returns 2, the number of your df columns
If you can, try to find what instruction adds the extra column to your code. Otherwise you can remove the column using drop and remove the column with '' as name. inplace is used to actually modify the dataframe. if inplace is not added, you just create a view of the dataframe without actually modifying it:
df.drop(columns = [''], inplace = True)
Finally, I post in the following the full example. My assumption is that your df is somehow created with the empty column at the beginnig, so I also add it in the dictionary:
BlahBlah = 'foo'
name_desc = {'':'',"ID": 23523223, "Name": BlahBlah} # I added an empty column
df = pd.DataFrame(data=name_desc, index = [0])
print(df.columns) # Index(['', 'ID', 'Name'], dtype='object')
df.drop(columns = [''],inplace = True)
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
print(df.columns) # Index(['ID', 'Name'], dtype='object')

First row to header with pandas

I have the following pandas dataframe df :
import pandas as pd
from io import StringIO
s = '''\
"Unnamed: 0","Unnamed: 1"
Objet,"Unités vendues"
Chaise,3
Table,2
Tabouret,1
'''
df = pd.read_csv(StringIO(s))
which looks as:
Unnamed: 0 Unnamed: 1
0 Objet Unités vendues
1 Chaise 3
2 Table 2
3 Tabouret 1
My target is to make the first row as header.
I use :
headers = df.iloc[0]
df.columns = [headers]
However, the "0" appears in index column name (which is normal, because this 0 was in the first row).
0 Objet Unités vendues
1 Chaise 3
2 Table 2
I tried to delete it in many way, but nothing work :
Neither del df.index.name from this post
Neither df.columns.name = None from this post or this one (which is the same situation)
How can I have this expected output :
Objet Unités vendues
1 Chaise 3
2 Table 2
what about defining that when you load your table in the first place?
pd.read_csv('filename', header = 1)
otherwise I guess you can just do this:
df.drop('0', axis = 1)
What worked for me.
Replace:
headers = df.iloc[0]
df.columns = [headers]
with:
headers = df.iloc[0].values
df.columns = headers
df.drop(index=0, axis=0, inplace=True)
Using .values returns the values from the row Series as a list which does not include the index value.
Reassigning the column headers then works as expected, without the 0.
Row 0 still exists so it should be removed with df.drop.
Having my data in U and my column names in Un I came up with this algorithm.
If you can shorten it, please do so.
U = pd.read_csv('U.csv', header = None) #.to_numpy()
Un = pd.read_csv('namesU.csv', header=None).T # Read your names csv, in my case they are in one column
Un = Un.append(U) # append the data U to the names
Un.reset_index(inplace=True, drop=True) # reset the index and drop the old one, so you don't have duplicated indices
Un.columns = [Un.iloc[0]] # take the names from the first row
Un.drop(index=0, inplace=True) # drop the first row
Un.reset_index(inplace=True, drop=True) # Return the index counter to start from 0
Another option:
Un = pd.read_csv('namesY.csv', header=None) # Read your names csv, in my case they are in one column
Un = list( Un[0] )
Un = pd.DataFrame(U, columns=[Un])
Using the skiprows parameter did the job for me: i.e. skiprows=N
where N = the number of rows to skip (in the above example, 1), so:
df = pd.read_csv('filename', skiprows=1)

Parsing JSON in Pandas

I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.
df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()

Categories

Resources