webjobs json data to pandas dataframe - python

I am trying to read azure webjobs services json data for logs using REST API, I am able to get the data in dataframe with columns, but I need lastrun column (one of the column) data to be available in tabular format, where data available in key:value format as shown below picture
Example:
latest_run
0,"{'id': '202011160826295419', 'name': '202011160826295419','job_name': 'failjob','}"
1,"{'id': '202011160826295419', 'name': '202011160826295419','job_name': 'passjob','}"
now I want to display all id, job_name in a data frame format, any help please thanks in advance
Below is my code
data = response.json()
# print(data)
df = pd.read_json(json.dumps(data), orient='records')
# df = json.loads(json.dumps(data))
df = pd.DataFrame(df)
df = df["latest_run"]
df.to_csv('file1.csv')
print(df)
Data:

First things first, your JSON is not formatted properly (it isn't correct JSON). There is an extra opening quote at the end, and JSON should have double quotes all over. Pretending it's correct JSON for now, this is how you could load it:
# NOTE: You can also open a CSV file directly
import io
csv_content = """latest_run
0,"{'id': '202011160826295419', 'name': '202011160826295419','job_name': 'failjob'}"
1,"{'id': '202011160826295419', 'name': '202011160826295419','job_name': 'passjob'}"
"""
csv_file = io.StringIO(csv_content)
import csv
import json
import pandas
# Create a CSV reader (this can also be a file using 'open("myfile.csv", "r")' )
reader = csv.reader(csv_file, delimiter=",")
# Skip the first line (header)
next(reader)
# Load the rest of the data
data = [row for row in reader]
# Create a dataframe, loading the JSON as it goes in
df = pandas.DataFrame([json.loads(row[1].replace("'", "\"")) for row in data], index=[int(row[0]) for row in data])

Related

Extract only ids from json files and read them into a csv

I have a folder including multiple JSON files. Here is a sample JSON file (all JSON files have the same structure):
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
I'd like to extract only ids and write them into a CSV file. Here is my code:
import pandas as pd
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
PROBLEM: The above code writes some ids into a CSV file. However, it doesn't return all ids. Also, there are some ids in the output CSV file (ids.csv) that didn't exist in any of my JSON files!
I really appreciate it if someone helps me understand where is the problem.
Thank you,
one other way is create common list for all ids in the folder and write it to the output file only once, here example:
input_path = "TEST/True_JSON"
ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids.extend(json_data['ids'].to_list()) #select only "response_tweet_ids"
pd.DataFrame(
ids, colums=('ids', )
).to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
Please read the answer by #lemonhead to get more details.
I think you have two main issues here:
pandas seems to read in ids off-by-1 in some cases, probably due to internally reading in as a float and then converting to an int64 and flooring. See here for a similar issue encountered
To see this:
> x = '''
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
'''
> print(pd.read_json(io.StringIO(x)))
# outputs:
url label body ids
0 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 360175950098468864
1 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 394147879201148928
Note the off by one error with 394147879201148929! AFAIK, one quick way to obviate this in your case is just to tell pandas to read everything in as a string, e.g.
pd.read_json(json_file, dtype='string')
You are looping through your json files and writing each one to the same csv file. However, by default, pandas is opening the file in 'w' mode, which will overwrite any previous data in the file. If you open in append mode ('a') instead, that should do what you intended
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
In context:
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file, dtype='string') #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
Overall though, unless you are getting something else from pandas here, why not just use raw json and csv libraries? The following would be do the same without the pandas dependency:
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
all_ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = json.load(json_file)
ids = json_data['ids']
all_ids.extend(ids)
print(all_ids)
# write all ids to a csv file
# you could also remove duplicates or other post-processing at this point
with open('TEST/ids.csv', mode='wt', newline='') as fobj:
writer = csv.writer(fobj)
for row in all_ids:
writer.writerow([row])
By default, dataframe.to_csv() overwrites the file. So each time through the loop you replace the file with the IDs from that input file, and the final result is the IDs from the last file.
Use the mode='a' argument to append to the CSV file instead of overwriting.
ids.to_csv(
'TEST/ids.csv', encoding='utf-8', header=False, index=False,
mode='a'
)

Problems running 'botometer-python' script over multiple user accounts & saving to CSV

I'm new to python, having mostly used R, but I'm attempting to use the code below to run around 90 twitter accounts/handles (saved as a one-column csv file called '1' in the code below) through the Botometer V4 API. The API github says that you can run through a sequence of accounts with 'check_accounts_in' without upgrading to the paid-for BotometerLite.
However, I'm stuck on how to loop through all the accounts/handles in the spreadsheet and then save the individual results to a new csv. Any help or suggestions much appreciated.
import botometer
import csv
import pandas as pd
rapidapi_key = "xxxxx"
twitter_app_auth = {
'consumer_key': 'xxxxx',
'consumer_secret': 'xxxxx',
'access_token': 'xxxxx',
'access_token_secret': 'xxxxx',
}
bom = botometer.Botometer(wait_on_ratelimit=True,
rapidapi_key=rapidapi_key,
**twitter_app_auth)
#read in csv of account names with pandas
data = pd.read_csv("1.csv")
for screen_name, result in bom.check_accounts_in(data):
#add output to csv
with open('output.csv', 'w') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(['Account Name','Astroturf Score', 'Fake Follower Score']),
csvwriter.writerow([
result['user']['user_data']['screen_name'],
result['display_scores']['universal']['astroturf'],
result['display_scores']['universal']['fake_follower']
])
Im not sure what the API returns, but you need to loop through your CSV data and send each item to the API. with the returned results you can append the CSV. You can loop through the csv without pandas, but it kept that in place because you are already using it.
added a dummy function to demonstrate the some returned data saved to a csv.
CSV I used:
names
name1
name2
name3
name4
import pandas as pd
import csv
def sample(x):
return x + " Some new Data"
df = pd.read_csv("1.csv", header=0)
output = open('NewCSV.csv', 'w+')
for name in df['names'].values:
api_data = sample(name)
csvfile = csv.writer(output)
csvfile.writerow([api_data])
output.close()
to read the one column CSV directly without pandas. you may need to adjust based on your CSV
with open('1.csv', 'r') as csv:
content = csv.readlines()
for name in content[1:]: # skips the header row - remove [1:] if the file doesn have one
api_data = sample(name.replace('\n', ""))
Making some assumptions about your API. This may work:
This assumes the API is returning a dictionary:
{"cap":
{
"english": 0.8018818614025648,
"universal": 0.5557322218336633
}
import pandas as pd
import csv
df = pd.read_csv("1.csv", header=0)
output = open('NewCSV.csv', 'w+')
for name in df['names'].values:
api_data = bom.check_accounts_in(name)
csvfile = csv.writer(output)
csvfile.writerow([api_data['cap']['english'],api_data['cap']['universal']])
output.close()

Parse a json file to get the right columns to insert into bigquery

I'm relatively new to Python and I am trying to get some exchange rate data from the ECB free api:
GET https://api.exchangeratesapi.io/latest?base=GBP
I want to ultimately end up with this data in a bigquery table. Loading the data to BQ is fine, but getting it into the right column/row format before sending it the BQ is the problem.
I want to end up with a table like this:
Currency Rate Date
CAD 1.629.. 2019-08-27
HKD 9.593.. 2019-08-27
ISK 152.6.. 2019-08-27
... ... ...
I've tried a few things but not quite got there yet:
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
with open('data.json', 'w') as outfile:
json.dump(data['rates'], outfile)
a_dict = {'date': '2019-08-26'}
with open('data.json') as f:
data = json.load(f)
data.update(a_dict)
with open('data.json', 'w') as f:
json.dump(data, f)
print(data)
Here is the original json file:
{
"rates":{
"CAD":1.6296861353,
"HKD":9.593490542,
"ISK":152.6759753684,
"PHP":64.1305429339,
"DKK":8.2428443501,
"HUF":363.2604778172,
"CZK":28.4888284523,
"GBP":1.0,
"RON":5.2195062629,
"SEK":11.8475893558,
"IDR":17385.9684034803,
"INR":87.6742617713,
"BRL":4.9997236134,
"RUB":80.646191945,
"HRK":8.1744110201,
"JPY":130.2223254066,
"THB":37.5852652759,
"CHF":1.2042718318,
"EUR":1.1055465269,
"MYR":5.1255348081,
"BGN":2.1622278974,
"TRY":7.0550451616,
"CNY":8.6717964026,
"NOK":11.0104695256,
"NZD":1.9192287707,
"ZAR":18.6217151449,
"USD":1.223287232,
"MXN":24.3265563331,
"SGD":1.6981194654,
"AUD":1.8126540855,
"ILS":4.3032293014,
"KRW":1482.7479464473,
"PLN":4.8146551248
},
"base":"GBP",
"date":"2019-08-23"
}
Welcome! How about this, as one way to tackle your problem.
# import the pandas library so we can use it's from_dict function:
import pandas as pd
# subset the json to a dict of exchange rates and country codes:
d = data['rates']
# create a dataframe from this data, using pandas from_dict function:
df = pd.DataFrame.from_dict(d,orient='index')
# add a column for date (this value is taken from the json data):
df['date'] = data['date']
# name our columns, to keep things clean
df.columns = ['rate','date']
This gives you:
rate date
CAD 1.629686 2019-08-23
HKD 9.593491 2019-08-23
ISK 152.675975 2019-08-23
PHP 64.130543 2019-08-23
...
In this case the currency is the index of the dataframe, if you'd prefer it as column of it's own just add:
df['currency'] = df.index
You can then write this dataframe out to a .csv file, or write it into BigQuery.
For this i'd recommend you take a look at The BigQuery Client library, this can be a little hard to get your head around at first, so you may also want to check out pandas.DataFrame.to_gbq, which is easier, but less robust (see this link for more detail on Client library vs. a pandas function.
Thanks Ben P for the help.
Here is my script that works for those interested. It uses an internal library my team uses for the BQ load, but the rest is pandas and requests:
from aa.py.gcp import GCPAuth, GCPBigQueryClient
from aa.py.log import StandardLogger
import requests, os, pandas as pd
# Connect to BigQuery
logger = StandardLogger('test').logger
auth = GCPAuth(logger=logger)
credentials_path = 'XXX'
credentials = auth.get_credentials(credentials_path)
gcp_bigquery = GCPBigQueryClient(logger=logger)
gcp_bigquery.connect(credentials)
# api-endpoint
URL = "https://api.exchangeratesapi.io/latest?base=GBP"
# sending get request and saving the response as response object
r = requests.get(url=URL)
# extracting data in json format
data = r.json()
# extract rates object from json
d = data['rates']
# split currency and rate for dataframe
df = pd.DataFrame.from_dict(d,orient='index')
# add date element to dataframe
df['date'] = data['date']
#column names
df.columns = ['rate', 'date']
# print dataframe
print(df)
# write dateframe to csv
df.to_csv('data.csv', sep='\t', encoding='utf-8')
#########################################
# write csv to BQ table
file_path = os.getcwd()
file_name = 'data.csv'
dataset_id = 'Testing'
table_id = 'Exchange_Rates'
response = gcp_bigquery.load_file_into_table(file_path, file_name, dataset_id, table_id, source_format='CSV', field_delimiter="\t", create_disposition='CREATE_NEVER', write_disposition='WRITE_TRUNCATE',skip_leading_rows=1)

Python convert dictionary to CSV

I am trying to convert dictionary to CSV so that it is readable (in their respective key).
import csv
import json
from urllib.request import urlopen
x =0
id_num = [848649491, 883560475, 431495539, 883481767, 851341658, 42842466, 173114302, 900616370, 1042383097, 859872672]
for bilangan in id_num:
with urlopen("https://shopee.com.my/api/v2/item/get?itemid="+str(bilangan)+"&shopid=1883827")as response:
source = response.read()
data = json.loads(source)
#print(json.dumps(data, indent=2))
data_list ={ x:{'title':productName(),'price':price(),'description':description(),'preorder':checkPreorder(),
'estimate delivery':estimateDelivery(),'variation': variation(), 'category':categories(),
'brand':brand(),'image':image_link()}}
#print(data_list[x])
x =+ 1
i store the data in x, so it will be looping from 0 to 1, 2 and etc. i have tried many things but still cannot find a way to make it look like this or close to this:
https://i.stack.imgur.com/WoOpe.jpg
Using DictWriter from the csv module
Demo:
import csv
data_list ={'x':{'title':'productName()','price':'price()','description':'description()','preorder':'checkPreorder()',
'estimate delivery':'estimateDelivery()','variation': 'variation()', 'category':'categories()',
'brand':'brand()','image':'image_link()'}}
with open(filename, "w") as infile:
writer = csv.DictWriter(infile, fieldnames=data_list["x"].keys())
writer.writeheader()
writer.writerow(data_list["x"])
I think, maybe you just want to merge some cells like excel do?
If yes, I think this is not possible in csv, because csv format does not contain cell style information like excel.
Some possible solutions:
use openpyxl to generate a excel file instead of csv, then you can merge cells with "worksheet.merge_cells()" function.
do not try to merge cells, just keep title, price and other fields for each line, the data format should be like:
first line: {'title':'test_title', 'price': 22, 'image': 'image_link_1'}
second line: {'title':'test_title', 'price': 22, 'image': 'image_link_2'}
do not try to merge cells, but set the title, price and other fields to a blank string, so it will not show in your csv file.
use line break to control the format, that will merge multi lines with same title into single line.
hope that helps.
If I were you, I would have done this a bit differently. I do not like that you are calling so many functions while this website offers a beautiful JSON response back :) More over, I will use pandas library so that I have total control over my data. I am not a CSV lover. This is a silly prototype:
import requests
import pandas as pd
# Create our dictionary with our items lists
data_list = {'title':[],'price':[],'description':[],'preorder':[],
'estimate delivery':[],'variation': [], 'categories':[],
'brand':[],'image':[]}
# API url
url ='https://shopee.com.my/api/v2/item/get'
id_nums = [848649491, 883560475, 431495539, 883481767, 851341658,
42842466, 173114302, 900616370, 1042383097, 859872672]
shop_id = 1883827
# Loop throw id_nums and return the goodies
for id_num in id_nums:
params = {
'itemid': id_num, # take values from id_nums
'shopid':shop_id}
r = requests.get(url, params=params)
# Check if we got something :)
if r.ok:
data_json = r.json()
# This web site returns a beautiful JSON we can slice :)
product = data_json['item']
# Lets populate our data_list with the items we got. We could simply
# creating one function to do this, but for now this will do
data_list['title'].append(product['name'])
data_list['price'].append(product['price'])
data_list['description'].append(product['description'])
data_list['preorder'].append(product['is_pre_order'])
data_list['estimate delivery'].append(product['estimated_days'])
data_list['variation'].append(product['tier_variations'])
data_list['categories'].append([product['categories'][i]['display_name'] for i, _ in enumerate(product['categories'])])
data_list['brand'].append(product['brand'])
data_list['image'].append(product['image'])
else:
# Do something if we hit connection error or something.
# may be retry or ignore
pass
# Putting dictionary to a list and ordering :)
df = pd.DataFrame(data_list)
df = df[['title','price','description','preorder','estimate delivery',
'variation', 'categories','brand','image']]
# df.to ...? There are dozen of different ways to store your data
# that are far better than CSV, e.g. MongoDB, HD5 or compressed pickle
df.to_csv('my_data.csv', sep = ';', encoding='utf-8', index=False)

Python : For Loop iterating through a csv list to scrape from CSV website

I'm trying to create a loop that uses a column of names from a csv file in my path to pull data from a website that uses csv to hold its data. I need to iterate through each name and pass it to the website for a specific row with five columns that is related to each name. I feel as though I have searched every thread pertaining to scraping a website that is holding data in a CSV file by using a CSV list.
I have tested the code that is in the for loop and it works independently to gather the specific row and columns based upon the name with one value in the parameter:
i = 'City'
url = url_template.format(i) # get the url
url = pd.read_csv(url)
url_df = pd.DataFrame(url)
my_data = url_df.iloc[[13]]
# my_df = pd.DataFrame(my_data) ## greyed out for testing
print my_data
However, when I attempt to run a loop:
import pandas as pd
url_template = "http://foo.html?t={}&spam=green&eggs=yellow"
# CSV file where retrieving column list
need_conv = (pd.read_csv('pract.csv')) # csv file
# column to use in the loop
need = need_conv['column']
# empty DataFrame
test_df = pd.DataFrame()
# loop to retrieve data and print
for i in need: # for each year
url = url_template.format(i) # get the url
url = pd.read_csv(url) # convert the data
url_df = pd.DataFrame(url) # create DataFrame from parsed data
my_data = url_df.iloc[[13]] # upload specific row of data
test_df = test_df.append(my_data) # append data to empty dataframe
print test_df.head()
I receive an error:
pandas.io.common.EmptyDataError: No columns to parse from file
Help and feedback is very much appreciated!

Categories

Resources