How to select only few columns from pandas dataframe - python

I have one json file about ansible inventory where I need to select few columns as dataframe and send email notification.
The following is code I tried:
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('d:/facts.json') as f:
d = json.load(f)
mydata = json_normalize(d['ansible_facts'])
mydata.head(1)`
Its printing entire records (actually each json will have only one record), but I need to show/select/display only two columns from dataframe. can some one suggest please how to view dataframe with selected columns
Update 1:
I am able to generate required columns now,but only certain column working, but when i mention certain columns, then its saying "not in index"
And also can i have own column custom header lable while printing ?
Working
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('d:/facts.json') as f:
d = json.load(f)
mydata = json_normalize(d['ansible_facts'])
mydata.columns = mydata.columns.to_series().apply(lambda x: x.strip())
df1=mydata[['ansible_architecture','ansible_distribution']]
But when i mention column as hostname,ansible_distribution, its saying not in index.
Not working
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('d:/facts.json') as f:
d = json.load(f)
mydata = json_normalize(d['ansible_facts'])
mydata.columns = mydata.columns.to_series().apply(lambda x: x.strip())
df1=mydata[['hostname','ansible_distribution']]
Error:
KeyError: "['hostname'] not in index"
Update2:
Now i am able to fix that issue with below, but I need custom label in output, how to do that
`import json
import pandas as pd
from pandas.io.json import json_normalize
with open('d:/facts.json') as f:
d = json.load(f)
mydata = json_normalize(d['ansible_facts'])
mydata.columns = mydata.columns.to_series().apply(lambda x: x.strip())
df1=mydata[['ansible_env.HOSTNAME','ansible_distribution']]`
But i need to have custom columname lable in final output like Host,OSversion for above column, how can i do that?
UPDATE 3: now trying to rename columns name before I print it, tried following code but giving error like key error not in index
import json
import pandas as pd
from tabulate import tabulate
from pandas.io.json import json_normalize
with open('/home/cloud-user/facts.json') as f:
d = json.load(f)
mydata = json_normalize(d['ansible_facts'])
mydata.columns = mydata.columns.to_series().apply(lambda x: x.strip())
mydata=mydata.rename(columns={"ansible_env.HOSTNAME": "HOSTNAME", "ansible_disrribution": "OSType"})
df1=mydata[['HOSTNAME','OSType']]
print(tabulate(df1, headers='keys', tablefmt='psql'))
Traceback (most recent call last):
File "ab7.py", line 21, in <module>
df1=mydata[['HOSTNAME','OSType']]
File "/usr/lib64/python2.7/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/usr/lib64/python2.7/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/usr/lib64/python2.7/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['HOSTNAME' 'OSType'] not in index"
But if i dont rename, it working perfectly, But i need most readable column lable. any suggestion please.
without rename stuff code get works and output as below on console
+----+------------------------+------------------------+
| | ansible_env.HOSTNAME | ansible_distribution |
|----+------------------------+------------------------|
| 0 | ip-xx-xx-xx-xx | SLES |
+----+------------------------+------------------------+
Now instead anisble_env.HOSTNAME --> i need lable as HOSTNAME , instead of ansible_distribution --> I need OSType any suggestion please
Update 4:
I fixed issue with below
df.rename(columns={'ansible_hostname':'HOSTNAME','ansible_distribution':'OS Version','ansible_ip_addresses':'Private IP','ansible_windows_domain':'FQDN'},inplace=True)

Select multiple columns as a DataFrame by passing a list to it:
df[['col_name1', 'col_name2']]
For more information try this link:
https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

Related

How to handle .json fine in tabular form in python?

By using this code:
import pandas as pd
patients_df = pd.read_json('/content/students.json',lines=True)
patients_df.head()
the data are shown in tabular form look like this:
The main json file looks like this:
data = []
for line in open('/content/students.json', 'r'):
data.append(json.loads(line))
How can I get the score column of the table in an organized manner like column name Exam, Quiz, and Homework
Possible solution could be the following:
# pip install pandas
import pandas as pd
import json
def separate_column(row):
for e in row["scores"]:
row[e["type"]] = e["score"]
return row
with open('/content/students.json', 'r') as file:
data = [json.loads(line.rstrip()) for line in file]
df = pd.json_normalize(data)
df = df.apply(separate_column, axis=1)
df = df.drop(['scores'], axis=1)
print(df)

Using pandas with praw

I'm messing around learning to work with APIs, I figured I'd make a Reddit bot. I'm trying to apply some code I used for a different script. That script used requests turned the request to json then added it a pandas dataframe and then wrote a csv.
I'm trying to do so about the same but don't know how to run the Reddit data into the dataframe. What I've tried below throws errors.
#!/usr/bin/python
import praw
import pandas as pd
reddit = praw.Reddit('my_bot')
subreddit = reddit.subreddit("askreddit")
for submission in subreddit.hot(limit=5):
print("Title: ", submission.title)
print("Score: ", submission.score)
print("Link: ", submission.url)
print("---------------------------------\n")
csv_file = f"/home/robothead/scripts/python/reddit/reddit-data.csv"
# start with empty dataframe
df = pd.DataFrame()
#j_data = subreddit.json()
#parse_data = j_data['data']
# append to the dataframe
#df = df.append(pd.DataFrame.from_dict(pd.json_normalize(parse_data), orient='columns'))
# append to the dataframe
df = df.append(pd.DataFrame.from_dict(pd(submission), orient='columns'))
# write the whole CSV at once
df.to_csv(csv_file, index=False, encoding='utf-8')
error:
Traceback (most recent call last):
File "bot.py", line 21, in <module>
df = df.append(pd.DataFrame.from_dict(pd(submission), orient='columns'))
TypeError: 'module' object is not callable
This is how I've done it in the past:
df = pd.DataFrame([ vars(post) for post in subreddit.hot(limit=5) ])
vars converts praw.Submission to a dict and pandas DataFrame constructor can take a list of dictionaries. Works well if you have dicts with the same keys, which is the case here. Of course you get a giant dataframe with ALL the columns. Some even have praw objects in them (that you can work with!). You'll probably want to parse that down by just keeping the columns you want before writing to a file.
Edit:
Just so there's no confusion, here is the full script example:
#!/usr/bin/python
import praw
import pandas as pd
reddit = praw.Reddit('my_bot')
subreddit = reddit.subreddit("askreddit")
df = pd.DataFrame([ vars(post) for post in subreddit.hot(limit=5) ])
df = df[["title","score","url"]]
df.to_csv(csv_file, index=False, encoding='utf-8')

Python deleting rows with a specific value in existing csv file

In Python I am using an existing csv file for a project. One of it's columns is sex. So, the values are either m,f,sex, or ' '.(blank)
I only want the rows with m and f, so how do I delete the rows that have the word sex or with no value in it?
You may read the csv file into a pandas dataFrame, then select the rows which are not blank.
import pandas as pd
inFile = "path/to/your/csv/file"
sep = ','
df = pd.read_csv(filepath_or_buffer=inFile, low_memory=False, encoding='utf-8', sep=sep)
df_mf = df.loc[(df['Sex']=='m') | (df['Sex']=='f')]
well here's a help in pandas
import pandas as pd
df= pd.read_csv('your file path')
filt = (df['sex'] =='m') | (df['sex'] == 'f')
updated_df = df.loc[filt,['other','columns','list']]
updated_df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv', index = False)

Reading Json file and converting it to columns in python

I am trying to read this json file in python using this code (I want to have all the data in a data frame):
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
df = pd.read_json('short_desc.json')
df.head()
Data frame head screenshot
using this code I am able to convert only the first row to separated columns:
json_normalize(df.short_desc.iloc[0])
First row screenshot
I want to do the same for whole df using this code:
df.apply(lambda x : json_normalize(x.iloc[0]))
but I get this error:
ValueError: If using all scalar values, you must pass an index
What I am doing wrong?
Thank you in advance
After reading the json file with json.load, you can use pd.DataFrame.from_records. This should create the DataFrame you are looking for.
wih open('short_desc.json') as f:
d = json.load(f)
df = pd.DataFrame.from_records(d)

JSON from API call to pandas dataframe

I'm trying to get an API call and save it as a dataframe.
problem is that I need the data from the 'result' column.
Didn't succeed to do that.
I'm basically just trying to save the API call as a csv file in order to work with it.
P.S when I do this with a "JSON to CSV converter" from the web it does it as I wish. (example: https://konklone.io/json/)
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&
address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&
endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
j
df = pd.DataFrame(j)
df.head()
output example picture
Try this
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
# print(j)
filename ="temp.csv"
df = pd.DataFrame(j['result'])
print(df.head())
df.to_csv(filename)
Looks like you need.
df = pd.DataFrame(j["result"])

Categories

Resources