Extract only ids from json files and read them into a csv - python

I have a folder including multiple JSON files. Here is a sample JSON file (all JSON files have the same structure):
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
I'd like to extract only ids and write them into a CSV file. Here is my code:
import pandas as pd
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
PROBLEM: The above code writes some ids into a CSV file. However, it doesn't return all ids. Also, there are some ids in the output CSV file (ids.csv) that didn't exist in any of my JSON files!
I really appreciate it if someone helps me understand where is the problem.
Thank you,

one other way is create common list for all ids in the folder and write it to the output file only once, here example:
input_path = "TEST/True_JSON"
ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids.extend(json_data['ids'].to_list()) #select only "response_tweet_ids"
pd.DataFrame(
ids, colums=('ids', )
).to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
Please read the answer by #lemonhead to get more details.

I think you have two main issues here:
pandas seems to read in ids off-by-1 in some cases, probably due to internally reading in as a float and then converting to an int64 and flooring. See here for a similar issue encountered
To see this:
> x = '''
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
'''
> print(pd.read_json(io.StringIO(x)))
# outputs:
url label body ids
0 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 360175950098468864
1 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 394147879201148928
Note the off by one error with 394147879201148929! AFAIK, one quick way to obviate this in your case is just to tell pandas to read everything in as a string, e.g.
pd.read_json(json_file, dtype='string')
You are looping through your json files and writing each one to the same csv file. However, by default, pandas is opening the file in 'w' mode, which will overwrite any previous data in the file. If you open in append mode ('a') instead, that should do what you intended
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
In context:
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file, dtype='string') #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
Overall though, unless you are getting something else from pandas here, why not just use raw json and csv libraries? The following would be do the same without the pandas dependency:
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
all_ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = json.load(json_file)
ids = json_data['ids']
all_ids.extend(ids)
print(all_ids)
# write all ids to a csv file
# you could also remove duplicates or other post-processing at this point
with open('TEST/ids.csv', mode='wt', newline='') as fobj:
writer = csv.writer(fobj)
for row in all_ids:
writer.writerow([row])

By default, dataframe.to_csv() overwrites the file. So each time through the loop you replace the file with the IDs from that input file, and the final result is the IDs from the last file.
Use the mode='a' argument to append to the CSV file instead of overwriting.
ids.to_csv(
'TEST/ids.csv', encoding='utf-8', header=False, index=False,
mode='a'
)

Related

multiple csv files data to single json file

I had two csv files named mortality1 and mortality2 and i want to insert these two csv files data into a single json file...when i am inserting these data, i am unable give the two files at the same time to json file.and this is my code
import csv
import json
import pandas as pd
from glob import glob
csvfile1 = open('C:/Users/DELL/Desktop/data/mortality1.csv', 'r')
csvfile2 = open('C:/Users/DELL/Desktop/data/mortality2.csv', 'r')
jsonfile = open('C:/Users/DELL/Desktop/data/cvstojson.json', 'w')
df = pd.read_csv(csvfile1)
df.to_json(jsonfile)
i want insert the 2 csv files data at the same time to the json file
If both of your csv data are having similar structure, then you can append the data frames to one another, and then convert it to a JSON.
Like
import csv
import json
import pandas as pd
from glob import glob
csvfile1 = open('C:/Users/DELL/Desktop/data/mortality1.csv', 'r')
csvfile2 = open('C:/Users/DELL/Desktop/data/mortality2.csv', 'r')
jsonfile = open('C:/Users/DELL/Desktop/data/cvstojson.json', 'w')
# Read and append both dataframes to single one
df = pd.read_csv(csvfile1).append(pd.read_csv(csvfile2))
# Create the json representation of all rows together.
df.to_json(jsonfile, orient="records")

How to convert multiple json files to cvs files

Hello I have multiple json files in a path and I want to convert all of them to csv files separately. Here is what I have tried so far which just convert one json file to a csv file.
with open('/Users/hh/MyDataSet/traceJSON-663-661-A0-25449-7.json') as f:
for line in f:
data.append(json.loads(line))
csv_file=open('/Users/hh/MyDataSet/GTruth/traceJSON-663-661-A0-25449-7.csv','w')
write=csv.writer(csv_file)
# write.writerow(["row number","type","rcvTime","pos_x","pos_y","pos_z","spd_x","spd_y","spd_z","acl_x","acl_y","acl_z"
# ,"hed_x","hed_y","hed_z"])
write.writerow(["row number","type","rcvTime","sender","pos_x","pos_y","pos_z","spd_x","spd_y","spd_z","acl_x","acl_y","acl_z"
,"hed_x","hed_y","hed_z"])
for elem in range(len(data)):
if data[elem]['type']==2:
write.writerow([elem,data[elem]['type'],round(data[elem]['rcvTime'],2),'663',round(data[elem]['pos'][0],2),round(data[elem]['pos'][1],2)
,round(data[elem]['pos'][2],2),round(data[elem]['spd'][0],2),round(data[elem]['spd'][1],2),round(data[elem]['spd'][2],2),
round(data[elem]['acl'][0],2),round(data[elem]['acl'][1],2),round(data[elem]['acl'][2],2),round(data[elem]['hed'][0],2),
round(data[elem]['hed'][1],2),round(data[elem]['hed'][2],2)])
elif data[elem]['type']==3:
write.writerow([elem,data[elem]['type'],round(data[elem]['rcvTime'],2),round(data[elem]['sender'],2),round(data[elem]['pos'][0],2),round(data[elem]['pos'][1],2)
,round(data[elem]['pos'][2],2),round(data[elem]['spd'][0],2),round(data[elem]['spd'][1],2),round(data[elem]['spd'][2],2),
round(data[elem]['acl'][0],2),round(data[elem]['acl'][1],2),round(data[elem]['acl'][2],2),round(data[elem]['hed'][0],2),
round(data[elem]['hed'][1],2),round(data[elem]['hed'][2],2)])
# json_file.close()
print('done!')
csv_file.close()
I appreciate if anyone can help me how can I do it. Also in each json file name "traceJSON-663-661-A0-25449-7", the first number like in the above code (663) should be written in csv file like the following code,if the type is 2:
write.writerow([elem,data[elem]['type'],round(data[elem]['rcvTime'],2),'663',....
My json file names are like traceJSON-51-49-A16-25217-7, traceJSON-57-55-A0-25223-7, ....
I suggest using pandas for this:
from glob import glob
import pandas as pd
import os
filepaths = glob('/Users/hh/MyDataSet/*.json') # get list of json files in folder
for f in filepaths:
filename = os.path.basename(f).rsplit('.', 1)[0] # extract filename without extension
nr = int(filename.split('-')[1]) # extract the number from the filename - assuming that all filenames are formatted similarly, use regex otherwise
df = pd.read_json(f) # read the json file as a pandas dataframe, assuming the json file isn't nested
df['type'] = df['type'].replace(2, nr) # replace 2 in 'type' column with the number in the filename
df.to_csv(f'{filename}.csv') # save as csv
If you want to round columns, you can also do this with pandas
import csv
import glob
import json
import os.path
for src_path in glob.glob('/Users/hh/MyDataSet/*.json'):
src_name = os.path.splitext(os.path.basename(src_path))[0]
data = []
with open(src_path) as f:
for line in f:
data.append(json.loads(line))
dest_path = '/Users/hh/MyDataSet/GTruth/' + src_name + '.csv'
csv_file=open(dest_path,'w')
write=csv.writer(csv_file)
write.writerow(["row number","type","rcvTime","sender","pos_x","pos_y","pos_z","spd_x","spd_y","spd_z","acl_x","acl_y","acl_z"
,"hed_x","hed_y","hed_z"])
for elem in range(len(data)):
if data[elem]['type']==2:
sender = src_name.split('-')[1]
write.writerow([elem,data[elem]['type'],round(data[elem]['rcvTime'],2),sender,round(data[elem]['pos'][0],2),round(data[elem]['pos'][1],2)
,round(data[elem]['pos'][2],2),round(data[elem]['spd'][0],2),round(data[elem]['spd'][1],2),round(data[elem]['spd'][2],2),
round(data[elem]['acl'][0],2),round(data[elem]['acl'][1],2),round(data[elem]['acl'][2],2),round(data[elem]['hed'][0],2),
round(data[elem]['hed'][1],2),round(data[elem]['hed'][2],2)])
elif data[elem]['type']==3:
write.writerow([elem,data[elem]['type'],round(data[elem]['rcvTime'],2),round(data[elem]['sender'],2),round(data[elem]['pos'][0],2),round(data[elem]['pos'][1],2)
,round(data[elem]['pos'][2],2),round(data[elem]['spd'][0],2),round(data[elem]['spd'][1],2),round(data[elem]['spd'][2],2),
round(data[elem]['acl'][0],2),round(data[elem]['acl'][1],2),round(data[elem]['acl'][2],2),round(data[elem]['hed'][0],2),
round(data[elem]['hed'][1],2),round(data[elem]['hed'][2],2)])
csv_file.close()
print('done!')

Problems running 'botometer-python' script over multiple user accounts & saving to CSV

I'm new to python, having mostly used R, but I'm attempting to use the code below to run around 90 twitter accounts/handles (saved as a one-column csv file called '1' in the code below) through the Botometer V4 API. The API github says that you can run through a sequence of accounts with 'check_accounts_in' without upgrading to the paid-for BotometerLite.
However, I'm stuck on how to loop through all the accounts/handles in the spreadsheet and then save the individual results to a new csv. Any help or suggestions much appreciated.
import botometer
import csv
import pandas as pd
rapidapi_key = "xxxxx"
twitter_app_auth = {
'consumer_key': 'xxxxx',
'consumer_secret': 'xxxxx',
'access_token': 'xxxxx',
'access_token_secret': 'xxxxx',
}
bom = botometer.Botometer(wait_on_ratelimit=True,
rapidapi_key=rapidapi_key,
**twitter_app_auth)
#read in csv of account names with pandas
data = pd.read_csv("1.csv")
for screen_name, result in bom.check_accounts_in(data):
#add output to csv
with open('output.csv', 'w') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(['Account Name','Astroturf Score', 'Fake Follower Score']),
csvwriter.writerow([
result['user']['user_data']['screen_name'],
result['display_scores']['universal']['astroturf'],
result['display_scores']['universal']['fake_follower']
])
Im not sure what the API returns, but you need to loop through your CSV data and send each item to the API. with the returned results you can append the CSV. You can loop through the csv without pandas, but it kept that in place because you are already using it.
added a dummy function to demonstrate the some returned data saved to a csv.
CSV I used:
names
name1
name2
name3
name4
import pandas as pd
import csv
def sample(x):
return x + " Some new Data"
df = pd.read_csv("1.csv", header=0)
output = open('NewCSV.csv', 'w+')
for name in df['names'].values:
api_data = sample(name)
csvfile = csv.writer(output)
csvfile.writerow([api_data])
output.close()
to read the one column CSV directly without pandas. you may need to adjust based on your CSV
with open('1.csv', 'r') as csv:
content = csv.readlines()
for name in content[1:]: # skips the header row - remove [1:] if the file doesn have one
api_data = sample(name.replace('\n', ""))
Making some assumptions about your API. This may work:
This assumes the API is returning a dictionary:
{"cap":
{
"english": 0.8018818614025648,
"universal": 0.5557322218336633
}
import pandas as pd
import csv
df = pd.read_csv("1.csv", header=0)
output = open('NewCSV.csv', 'w+')
for name in df['names'].values:
api_data = bom.check_accounts_in(name)
csvfile = csv.writer(output)
csvfile.writerow([api_data['cap']['english'],api_data['cap']['universal']])
output.close()

Loading a series of JSON objects in pandas dataframe

I have downloaded a sample dataset from here that is a series of JSON objects.
{...}
{...}
I need to load them to a pandas dataframe. I have tried below code
import pandas as pd
import json
filename = "sample-S2-records"
df = pd.DataFrame.from_records(map(json.loads, "sample-S2-records"))
But there seems to be parsing error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What am I missing?
You can try pandas.read_json method:
import pandas as pd
data = pd.read_json('/path/to/file.json', lines=True)
print data
I have tested it with this file, it works fine
The function needs a list of JSON objects. For example,
data = [ json_obj_1,json_obj_2,....]
The file does not contain the syntax for list and just has series of JSON objects. Following would solve the issue:
import pandas as pd
import json
# Load content to a variable
with open('../sample-S2-records/sample-S2-records', 'r') as content_file:
content = content_file.read().strip()
# Split content by new line
content = content.split('\n')
# Read each line which has a json obj and store json obj in a list
json_list = []
for each_line in content:
json_list.append(json.loads(each_line))
# Load the json list in form of a string
df = pd.read_json(json.dumps(json_list))

Read json file as pandas dataframe?

I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!

Categories

Resources