Panda Dataframe adding more columns to .csv when editing certain values - python

I am using Panda Dataframe to store some information for my code. In my code,
Initial State of csv:
...............
ID,Name
...............
Adding Data into dataframe:
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = df.append(name_desc, ignore_index=True)
This was my panda dataframe upon creating the database:
....................
,ID,Name
0,23523223,BlahBlah
....................
Below is my code that searches through the ID column to locate the row with the stated ID (name_desc["ID"]).
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
The problem I encountered was after I have edited the name, I get a resultant db that looks like:
................................
Unnamed: 0 ID Name
0 0 23523223 BlahBlah
................................
If I continously execute:
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
I get this db:
..................................
,Unnamed: 0,Unnamed: 0.1,ID,Name
0,0,0,235283335,Dinese
..................................
I can't figure out why I have extra columns being added in the front of my database as I make edits.

I think you have a problem that is related to the df creation. The example you provided here does not return what you are showing:
BlahBlah = 'foo'
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = pd.DataFrame(data=name_desc, index=[0])
print(df.columns) # it returns an Index(['ID', 'Name'], dtype='object')
print(len(df.columns)) # it returns 2, the number of your df columns
If you can, try to find what instruction adds the extra column to your code. Otherwise you can remove the column using drop and remove the column with '' as name. inplace is used to actually modify the dataframe. if inplace is not added, you just create a view of the dataframe without actually modifying it:
df.drop(columns = [''], inplace = True)
Finally, I post in the following the full example. My assumption is that your df is somehow created with the empty column at the beginnig, so I also add it in the dictionary:
BlahBlah = 'foo'
name_desc = {'':'',"ID": 23523223, "Name": BlahBlah} # I added an empty column
df = pd.DataFrame(data=name_desc, index = [0])
print(df.columns) # Index(['', 'ID', 'Name'], dtype='object')
df.drop(columns = [''],inplace = True)
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
print(df.columns) # Index(['ID', 'Name'], dtype='object')

Related

How to use the filtered data in Pandas?

I am new to Pandas. Below is part of a code. I am trying to use the df_filtered which is the filtered data having codenum column value =AB123. However from Line 14 if I use df_filtered instead of excel_data_df , it's not giving results. The desired columns are getting picked correctly. But the value filtering is not happening - In codenum col value should be = AB123. But the value filtering is not happening and I am getting the entire excel converted to json with chosen columns. Please help understand how to consider/use df_filtered data from Line 14.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[4,5,6,18,19], #index starts from 0
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
df_filtered = excel_data_df.loc[lambda x: x['codenum'] == 'AB123'] # Row Filtering -- need to use this further
excel_data_df.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns -- Line Num 14
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
First, a small tip; you can remove the use/need of lambda by doing
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if you want to get rid of the lambda.
Afterwards, as pointed out in the comments, make sure that it contains samples after the filtering;
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=="AB123"]
if df_filtered.shape[0]: #it contains samples
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
excel_data_df = excel_data_df[cols_to_keep] # columns to keep
excel_data_df # columns to keep
json_str = excel_data_df.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe
print("Filtered data does not contain data")
Try the following code below:
df_filtered = excel_data_df[excel_data_df['codenum'] == 'AB123']
If it still not working then "codenum" may not have this value that you are trying to filter out.
Thanks all for your inputs. Initially it was returning empty dataframe as suggested in above answers and comments. Posting the edited working code based on your inputs for anyone's future reference.
PathLink = os.path.join(path, 'test' + '.json') #Path Name Formation
excel_data_df = pandas.read_excel('input.xlsx',
sheet_name='input_sheet1',
usecols=[3,5,6,18,19], #index starts from 0 ## edit 1: corrected index to right column index
names=['codenum', 'name',
'class',
'school',
'city'],
dtype={'codenum': str,
'name': str,
'school': str,
'city': str}) # Excel Read and Column Filtering
print(excel_data_df['codenum'].unique()) ##edit 1: returns unique values including AB123
df_filtered = excel_data_df.loc[excel_data_df["codenum"]=='AB123'] # Row Filtering ##edit 1
print(df_filtered) ##edit 1 - to check if expected results are present in filtered data
df_filtered.columns = ['Code', 'Student Name', 'Class', 'School Name','City Name'] #renaming columns
if df_filtered.shape[0]: #it contains samples ## edit 1
cols_to_keep = ['Student Name', 'Class', 'School Name','City Name'] # columns to keep
df_filtered = df_filtered[cols_to_keep] # columns to keep
df_filtered # columns to keep
json_str = df_filtered.to_json(PathLink,orient='records',indent=2) #json converted file
else: #it does not contain any samples i.e empty dataframe ##edit 1
print("Filtered data does not contain data")
The pandas df.loc will return the filtered result.
In your code, you tried to make a filtered but df.loc is not a filter maker.
See the example, the df.loc return the filter result from the origin df.
import pandas as pd
df = pd.DataFrame([[1, "AB123"], [4, "BC123"], [7, "CD123"]],columns=['A', 'B'])
print(df)
# A B
#0 1 AB123
#1 4 BC123
#2 7 CD123
print(df.loc[lambda x: x["B"] == "AB123"])
# A B
#0 1 AB123

Parse JSON in a Pandas DataFrame

I have some data in a pandas DataFrame, but one of the columns contains multi-line JSON. I am trying to parse that JSON out into a separate DataFrame along with the CustomerId. Here you will see my DataFrame...
df
Out[1]:
Id object
CustomerId object
CallInfo object
Within the CallInfo column, the data looks like this...
[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]
I want to create a new DataFrame called df_norm which contains the CustomerId, CallDate, and CallLength.
I have tried several ways but couldn't find a working solution. Can anyone help me with this?
Mock up code example...
import pandas as pd
import json
Id = [1, 2, 3]
CustomerId = [700001, 700002, 700003]
CallInfo = ['[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]', '[{"CallDate":"2021-07-09","CallLength":102}]', '[{"CallDate":"2021-07-11","CallLength":226},{"CallDate":"2021-07-11","CallLength":216}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"Id": Id,
"CustomerId": CustomerId,
"CallInfo": CallInfo
})
print(df)
This should work. Create a new list of rows and then toss that into the pd.DataFrame constructor:
new_rows = [{
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': item['CallDate'],
'CallLength': item['CallLength']}
for _, row in df.iterrows() for item in json.loads(row['CallInfo'])]
new_df = pd.DataFrame(new_rows)
print(new_df)
EDIT: to account for None values in CallInfo column:
new_rows = []
for _, row in df.iterrows():
call_date = None
call_length = None
if row['CallInfo'] is not None: # Or additional checks, e.g. == "" or something...
for item in json.loads(row['CallInfo']):
call_date = item['CallDate']
call_length = item['CallLength']
new_rows.append({
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': call_date,
'CallLength': call_length})

convert a list in rows of dataframe in one column to simple string

I have a dataframe which has list in one column that I want to convert into a simple string
id data_words_nostops
26561364 [andrographolide, major, labdane, diterpenoid]
26561979 [dgat, plays, critical, role, hepatic, triglyc]
26562217 [despite, success, imatinib, inhibiting, bcr]
DESIRED OUTPUT
id data_words_nostops
26561364 andrographolide, major, labdane, diterpenoid
26561979 dgat, plays, critical, role, hepatic, triglyc
26562217 despite, success, imatinib, inhibiting, bcr
Try this :
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Complete code :
import pandas as pd
l1 = ['26561364', '26561979', '26562217']
l2 = [['andrographolide', 'major', 'labdane', 'diterpenoid'],['dgat', 'plays', 'critical', 'role', 'hepatic', 'triglyc'],['despite', 'success', 'imatinib', 'inhibiting', 'bcr']]
df = pd.DataFrame(list(zip(l1, l2)),
columns =['id', 'data_words_nostops'])
df['data_words_nostops'] = df['data_words_nostops'].apply(lambda row : ','.join(row))
Output :
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
df["data_words_nostops"] = df.apply(lambda row: row["data_words_nostops"][0], axis=1)
You can use pandas str join for this:
df["data_words_nostops"] = df["data_words_nostops"].str.join(",")
df
id data_words_nostops
0 26561364 andrographolide,major,labdane,diterpenoid
1 26561979 dgat,plays,critical,role,hepatic,triglyc
2 26562217 despite,success,imatinib,inhibiting,bcr
I tried the following as well
df_ready['data_words_nostops_Joined'] = df_ready.data_words_nostops.apply(', '.join)

Python - keep rows in dataframe based on partial string match

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.
So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('#', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc#gmail.com'],
['mailbox2', 'def#yahoo.com'],
['mailbox3', 'ghi#msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"#{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
Solution
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc#gmail.com', 'def#yahoo.com', 'ghi#msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc#gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
Output:
> {'Email_ID': ('abc#gmail.com', 'ghi#msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Some notes:
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have a a list of rows already, you just have to do the filtering part

Parsing JSON in Pandas

I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.
df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()

Categories

Resources