Parsing JSON in Pandas - python

I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.

df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()

Related

KEGG Drug database Python script

I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.

Why i concat() 3 df and still NaN valaue?

my code:
dfs = [df_uk_rfmt, df_uk_clv, df_uk_prod_pen]
final_df = pd.concat(dfs, axis = 1)
final_df.head()
And my new df looks like this:
but when I using Microsoft Query, Some NaN has value for example on CustomerID 12748 on this pic:
PS. All df index = CustomerID
My purpose is to join 3 data frames with full outer.
Thank you so much for your help.
Before defining dfs you need to make sure you do not have MultiIndex. So, do this:
df_uk_rfmt = df_uk_rfmt.reset_index()
df_uk_clv = df_uk_clv.reset_index()
df_uk_prod_pen = df_uk_prod_pen.reset_index()
Then
dfs = [df_uk_rfmt, df_uk_clv, df_uk_prod_pen]
final_df = pd.concat(dfs, axis = 1)
final_df.head()

dataframe works if I take data from another dataframe but not if add it directly

I am hoping someone can help me understand why a dataframe works one way but not the other. I am relatively new to Python but do have a decent understanding of Pandas. However I can't figure out why I can't add data to the empty dataframe directly first without getting data from another dataframe. I know there are plenty of ways around this. I am just curious as to why.
Thank you for any light you may be able to shed on this.
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'col2'] = 2
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
print(df)
Produces:
iset
0 1001
Empty DataFrame
Columns: [Import Set No, col2]
Index: []
And:
import pandas as pd
source_df = pd.DataFrame({'iset':['1001']})
print(source_df)
column_names = ['Import Set No','col2']
df = pd.DataFrame(columns = column_names)
df.loc[:,'Import Set No'] = source_df.loc[:,'iset']
df.loc[:,'col2'] = 2
print(df)
Produces:
iset
0 1001
Index(['Import Set No', 'col2'], dtype='object')
Import Set No col2
0 1001 2

Pandas concat axis 1 failing(infinite loading)

I am trying to concat csv file and MA_3,MA_5 columns(they have NAN).
csv file = https://drive.google.com/file/d/1-219dqlmhFA6-YtD8xRigo_ZVoAIJ21v/view?usp=sharing
code is this.
df = pd.read_csv('/content/mydrive/MyDrive/data/005930.KS.csv')
MA_3, MA_5 = pd.Series([]), pd.Series([])
for i, adj_close in enumerate(df['Adj Close']):
MA_3 = MA_3.append(pd.Series([pd.Series.mean(ds['Adj Close'][i:i+3])]))
MA_5 = MA_5.append(pd.Series([pd.Series.mean(ds['Adj Close'][i:i+5])]))
MA_3 = pd.concat([pd.DataFrame({'MA3':['','']}), MA_3.to_frame('MA3').iloc[:-2,:]])
MA_5 = pd.concat([pd.DataFrame({'MA5':['','','','']}), MA_5.to_frame('MA5').iloc[:-4,:]])
MA = pd.concat([MA_3, MA_5], axis=1, ignore_index=True)
df = pd.concat([df, MA], axis=1, ignore_index=True)
MA_3.shape is same as MA_5.shape but it doesn't work. It doesn't raise error but infinitely loading occurs.(axis=0 does work.) I want to solve this problem. thank you.
Testing your code I get an InvalidIndexError: Reindexing only valid with uniquely valued Index objects error (no infinite loading).
The error occurs since the indices of the MA_3 and MA_5 dataframes are not unique. You can simply reset the indices before the concat operation:
MA_3.reset_index(drop=True, inplace=True)
MA_5.reset_index(drop=True, inplace=True)
MA = pd.concat([MA_3, MA_5], axis=1, ignore_index=True)
The option drop=True is set, so that the old indices (which are not required in this case) are not added as a new column.
Note: If I understand correctly, your goal is to add two columns with a running mean of Adj Close to your dataframe. A simpler way to achieve this, is with using the pandas rolling function:
df = pd.read_csv(in_dir + file)
df['MA_3'] = df['Adj Close'].rolling(window=3).mean()
df['MA_5'] = df['Adj Close'].rolling(window=5).mean()
Note 2: Above I assume, that it is df['Adj Close'][i:i+3] in the for-loop (not ds['Adj Close'][i:i+3], same for the next line)

Python - keep rows in dataframe based on partial string match

I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains
I read both the dataframes from an excel sheet
xls = pd.ExcelFile(input_file_shared_mailbox)
df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)
i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]
print(df1)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
print(df2)
approved_domain
0 msn.com
1 gmail.com
and i want df3 which basically shows
print (df3)
Mailbox Email_Id
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax
df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]
But get this error
TypeError: unhashable type: 'list'
i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.
So these are the steps you will need to follow to do what you want done for your two data frames
1.Split your email_address column into two separate columns
df1['add'], df1['domain'] = df1['email_address'].str.split('#', 1).str
2.Then drop your add column to keep your data frame clean
df1 = df1.drop('add',axis =1)
3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column
df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
4. Drop the 'domain' column in df_new
df_new = df_new.drop('domain',axis = 1)
This is what the result will be
mailbox email_address
1 mailbox2 def#yahoo.com
2 mailbox3 ghi#msn.com
You can use dynamically created regular expression to search for the valid domain in the list and eventually filtering them out.
Here is the code for our reference.
# -*- coding: utf-8 -*-
import pandas as pd
import re
mailbox_list = [
['mailbox1', 'abc#gmail.com'],
['mailbox2', 'def#yahoo.com'],
['mailbox3', 'ghi#msn.com']]
valid_domains = ['msn.com', 'gmail.com']
df1 = pd.DataFrame(mailbox_list, columns=['Mailbox', 'EmailID'])
df2 = pd.DataFrame(valid_domains)
valid_list = []
for index, row in df1.iterrows():
for idx, record in df2.iterrows():
if re.search(rf"#{record[0]}", row[1], re.IGNORECASE):
valid_list.append([row[0], row[1]])
df3 = pd.DataFrame(valid_list, columns=['Mailbox', 'EmailID'])
print(df3)
The output of this is:
Mailbox EmailID
0 mailbox1 abc#gmail.com
1 mailbox3 ghi#msn.com
Solution
df1 = {'MailBox': ['mailbox1', 'mailbox2', 'mailbox3'], 'Email_Id': ['abc#gmail.com', 'def#yahoo.com', 'ghi#msn.com']}
df2 = {'approved_domain':['msn.com', 'gmail.com']}
mailboxes, emails = zip( # unzip the columns
*filter( # filter
lambda i: any([ # i = ('mailbox1', 'abc#gmail.com')
approved_domain in i[1] for approved_domain in df2['approved_domain']
]),
zip(df1['MailBox'], df1['Email_Id']) # zip the columns
)
)
df3 = {
'MailBox': mailboxes,
'Email_I': emails
}
print(df3)
Output:
> {'Email_ID': ('abc#gmail.com', 'ghi#msn.com'), 'MailBox': ('mailbox1', 'mailbox3')}
Some notes:
Big chunk of this code is basically just for parsing the data structure. The zipping and unzipping is only there to convert the list of columns to a list of rows and back. If you have a a list of rows already, you just have to do the filtering part

Categories

Resources