Append dataframe column values to an array by comparing the column values

Append dataframe column values to an array by comparing the column values - python

I have a data-frame with mail id and result.
Mail id result
0 xyz#gmail.com fail
1 xyz#yahoo.com pass
2 pqr#gmail.com not attempted
3 tuv#gmail.com not attempted
4 123#gmail.com fail
5 ABC#gmail.com not attempted
From the above data, need to get mail into an array by the result
Ex: If result is equal to 'Fail' then failed one's mail id should get in an array/list called failed.. , Similarly for 'not attempted'.
failed = ['xyz#gmail.com', '123#gmail.com']
not attempted = ['pqr#gmail.com', 'tuv#gmail.com', 'ABC#gmail.com']

You can filter values separately:
failed = df.loc[df['result'].eq('fail'), 'Mail id'].tolist()
notattempted = df.loc[df['result'].eq('not attempted'), 'Mail id'].tolist()
Or create Series with aggregate lists and then select by index:
s = df.groupby('result')['Mail id'].agg(list)
failed = s.loc['failed']
notattempted = s.loc['not attempted']
failed = s['failed']
notattempted = s['not attempted']

Please try this.
import pandas as pd
// let df is the basic dataframe
result_df = pd.DataFrame()
// get all different result values
result_df['idx_str'] = list(set(df['result']))
def separate_by_result(idx_str):
str_df = df[df['result'] == idx_str]
return list(str_df['Mail id'])
// show result as one new column
result_df['result_list'] = result_df['idx_str'].apply(separate_by_result)
result_df will be what you want.

Related

Filter pandas DataFrame column based on multiple conditions returns empty dataframe

I am having trouble in filtering databased on a multiple conditions.
[dataframe image][1]
[1]: https://i.stack.imgur.com/TN9Nd.png
When I filter it based on multiple condition, I am getting empty DataFrame.
user_ID_existing = input("Enter User ID:")
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = str(input("Choose from the above category:"))
info = pd.read_excel("Test.xlsx")
data = pd.DataFrame(info)
df = data[((data.ID == user_ID_existing) & (data.Category == user_Category_existing))]
print(df)
if I replace the variables user_ID_existing and user_Category_existing with values, I am getting the rows. I even tried with numpy and only getting empty dataframe
filtered_values = np.where((data['ID'] == user_ID_existing) & (data['Category'].str.contains(user_Category_existing)))
print(filtered_values)
print(data.loc[filtered_values])

input always returs a string but since the column ID read by pandas has a number dtype, when you filter it by a string, you're then getting an empty dataframe.
You need to use int to convert the value/ID (entered by the user) to a number.
Try this :
user_ID_existing = int(input("Enter User ID:"))
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = input("Choose from the above category:")
data = pd.read_excel("Test.xlsx")
df = data[(data["ID"].eq(user_ID_existing))
& (data["Category"].eq(user_Category_existing))].copy()
print(df)

Use headers from one dataframe to find the row indexes from a second

I'm trying to use a reference table to take action based on the headers of a file I am importing into a dataframe.
print(df_reference_table)
file_type source col_name1 col_name2 col_name3 ...
Status G081 TAIL MDS BASE ...
LIMS-EV Serial Number Mission Design Location ...
IMDS ACFT Designator CMD ...
print(df_import_table.columns.values)
['TAIL' 'MDS' 'BASE']
cols_in = df_import_table.columns.value
I'm looking for something that will return [Status, G081], the script would add/delete/rename columns as needed so they match. My source documents have different numbers of columns and I have no control over the format/names/length before it gets to me.
I've tried the following:
In:
t = df_import_table.columns.values
df_reference_table.loc[t]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
l = list(df_import_table.columns.values)
df_reference_table.loc[l]
Out:
KeyError:['TAIL' 'MDS' 'BASE'] not in index
In:
t = df_import_table.columns.values
df_reference_table.index[df_reference_table.columns == t].tolist()
Out:
ValueError: Lengths must match to compare
Basically, I want to do the reverse of -
df_format.loc['Status','G081'].tolist()

Use a boolean mask:
# Set 'file_type' and 'source' as index if it's not already the case
df_reference_table = df_reference_table.set_index(['file_type', 'source'])
cols = df_import_table.columns.tolist())
mask = df_reference_table.eq(cols).any(axis=1)
print(df_reference_table[mask].index.to_flat_index()[0])
# Output:
('Status', 'G081')

Fill pandas dataframe with a for loop

I have 4 dataframes for 4 newspapers (newspaper1,newspaper2,newspaper3,newspaper4])
which have a single column for author name.
Now I'd like to merge these 4 dataframes into one, which has 5 columns: author, and newspaper1,newspaper2,newspaper3,newspaper4 which contain 1/0 value (1 for author writing for that newspaper)
import pandas as pd
listOfMedia =[newspaper1,newspaper2,newspaper3,newspaper4]
merged = pd.DataFrame(columns=['author','newspaper1','newspaper2', 'newspaper4', 'newspaper4'])
while this loop does what I intended (fills the merged df author columns with the name):
for item in listOfMedia:
merged.author = item.author
I can't figure out how to fill the newspapers columns with the 1/0 values...
for item in listOfMedia:
if item == newspaper1:
merged['newspaper1'] = '1'
elif item == newspaper2:
merged['newspaper2'] = '1'
elif item == newspaper3:
merged['newspaper3'] = '1'
else:
merged['newspaper4'] = '1'
I keep getting error
During handling of the above exception, another exception occurred:
TypeError: attrib() got an unexpected keyword argument 'convert'
Tried to google that error but didn't help me identify what the problem is.
What am I missing here? I also think there must be smarter way to fill the newspaper/author matrix, however don't seem to be able to figure out even this simple way. I am using jupyter notebook.

Actually you are setting all rows to 1 so use:
for col in merged.columns:
merged[col].values[:] = 1

I've taken a guess at what I think your dataframes look like.
newspaper1 = pd.DataFrame({'author': ['author1', 'author2', 'author3']})
newspaper2 = pd.DataFrame({'author': ['author1', 'author2', 'author4']})
newspaper3 = pd.DataFrame({'author': ['author1', 'author2', 'author5']})
newspaper4 = pd.DataFrame({'author': ['author1', 'author2', 'author6']})
Firstly we will copy the dataframes so we don't affect the originals:
newspaper1_temp = newspaper1.copy()
newspaper2_temp = newspaper2.copy()
newspaper3_temp = newspaper3.copy()
newspaper4_temp = newspaper4.copy()
Next we replace the index of each dataframe with the author name:
newspaper1_temp.index = newspaper1['author']
newspaper2_temp.index = newspaper2['author']
newspaper3_temp.index = newspaper3['author']
newspaper4_temp.index = newspaper4['author']
Then we concatenate these dataframes (matching them together by the index we set):
merged = pd.concat([newspaper1_temp, newspaper2_temp, newspaper3_temp, newspaper4_temp], axis =1)
merged.columns = ['newspaper1', 'newspaper2', 'newspaper3', 'newspaper4']
And finally we replace NaN's with 0 and then non-zero entries (they will still have the author names in them) as 1:
merged = merged.fillna(0)
merged[merged != 0] = 1

Python - keep rows in dataframe based on partial string match

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.

So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('#', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com

You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc#gmail.com'],
['mailbox2', 'def#yahoo.com'],
['mailbox3', 'ghi#msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"#{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com

Solution
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc#gmail.com', 'def#yahoo.com', 'ghi#msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc#gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
Output:
> {'Email_ID': ('abc#gmail.com', 'ghi#msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Some notes:
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have a a list of rows already, you just have to do the filtering part

Extracting string with the help of function

Actually I have data frames of clickstream with about 4 million rows. I have many columns and two of them are based on URL and Domain. I have a dictionary and want to use it as a condition. For example: If the domain is equal to amazon.de and Url contains a Keyword pillow then the column will have a value pillow. And so on.
dictionary_keywords = {"amazon.de": "pillow", "rewe.com": "apple"}
ID Domain URL
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow
2 rewe.de www.rewe.de/apple
The expected output should be the new column:
ID Domain URL New_Col
1 amazon.de www.amazon.de/ssssssss/exapmle/pillow pillow
2 rewe.de www.rewe.de/apple apple
I can manually use .str.contain method but need to define a function which takes the dictionary key and value as a condition.
Something like this df[df['domain] == 'amazon.de'] & df[df['url'].str.contains('pillow')
But I am not sure. I am new in this.

The way I prefer to solve this kind of problem is by using df.apply() by row (axis=1) with a custom function to deal with the logic.
import pandas as pd
dictionary_keywords = {"amazon.de": "Pillow", "rewe.de": "Apple"}
df = pd.DataFrame({
'Domain':['amazon.de','rewe.de'],
'URL':['www.amazon.de/ssssssss/exapmle/pillow', 'www.rewe.de/apple']
})
def f(row):
global dictionary_keywords
try:
url = row['URL'].lower()
domain = url.split('/')[0].strip('www.')
if dictionary_keywords[domain].lower() in url:
return dictionary_keywords[domain]
except Exception as e:
print(row.name, e)
return None #or False, or np.nan
df['New_Col'] = df.apply(f, axis=1)
Output:
print(df)
Domain URL New_Col
0 amazon.de www.amazon.de/ssssssss/exapmle/pillow Pillow
1 rewe.de www.rewe.de/apple Apple

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Append dataframe column values to an array by comparing the column values - python

Related

Filter pandas DataFrame column based on multiple conditions returns empty dataframe

Use headers from one dataframe to find the row indexes from a second

Fill pandas dataframe with a for loop

Python - keep rows in dataframe based on partial string match

Extracting string with the help of function

Categories

Resources