I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!
Related
I want to read text file. The file is like this:
17430147 17277121 17767569 17352501 17567841 17650342 17572001
I want the result:
17430147
17277121
17767569
17352501
17567841
17650342
17572001
So, i try some codes:
data = pd.read_csv('train.txt', header=None, delimiter=r"\s+")
or
data = pd.read_csv('train.txt', header=None, delim_whitespace=True)
From those codes, the error like this:
ParserError: Too many columns specified: expected 75262 and found 154
Then i try this code:
file = open("train.txt", "r")
data = []
for i in file:
i = i.replace("\n", "")
data.append(i.split(" "))
But i think there are missing value in txt file:
'2847',
'2848',
'2849',
'1947',
'2850',
'2851',
'2729',
''],
['2852',
'2853',
'2036',
Thank you!
The first step would be to read the text file as a string of values.
with open('train.txt','r') as f:
lines = f.readlines()
list_of_values = lines[0].split(' ')
Here, list_of_values looks like:
['17430147',
'17277121',
'17767569',
'17352501',
'17567841',
'17650342',
'17572001']
Now, to create a DataFrame out of this list, simply execute:
import pandas as pd
pd.DataFrame(list_of_values)
This will give a pandas DataFrame with a single column with values read from the text file.
If only different values that exist in the text file are required to be obtained, then the list list_of_values can be directly used.
You can use .T method to transpose your dataframe.
data = pd.read_csv("train.txt", delim_whitespace=True).T
I am trying to re-write a json file to add missing data values. but i cant seem to get the code to re-write the data on the json file. Here is the code to fill in missing data:
import pandas as pd
import json
data_df = pd.read_json("Data_test.json")
#replacing empty strings with nan
df2 = data_df.mask(data_df == "")
#filling the nan with data from above.
df2["Food_cat"].fillna(method="ffill", inplace=True,)
"Data_test.json" is the file with the list of dictionary and I am trying to either edit this json file or create a new one with the filled in data that was missing.
I have tried using
with open('complete_data', 'w') as f:
json.dump(df2, f)
but it does not seem to work. is there a way to edit the current data or create a new json file with the completed data?
this is the original data, I would like to keep this format.
Try to do this
import pandas as pd
import json
data_df = pd.read_json("Data_test.json")
#replacing empty strings with nan
df2 = data_df.mask(data_df == "")
#filling the nan with data from above.
df2["Food_cat"].fillna(method="ffill", inplace=True,)
df2.to_json('path_of_file.json')
Tell me if it works.
I am writing a script to read txt file using Pandas.
I need to query on particular type of hearders.
Reading excel is possible but i cannot read txt file.
import pandas as pd
#df=pd.read_excel('All.xlsx','Sheet1',dtype={'num1':str},index=False) #works
df=pd.read_csv('read.txt',dtype={'PHONE_NUMBER_1':str}) #doest work
array=['A','C']
a = df['NAME'].isin(array)
b = df[a]
print(b)
try using this syntax.
you are not using the correct key value
df=pd.read_csv('read.txt',dtype={'BRAND_NAME_1':str})
You can try this:
import pandas as pd
df = pd.read_table("input.txt", sep=" ", names=('BRAND_NAME_1'), dtype={'BRAND_NAME_1':str})
You can read file txt then astype for column.
Read file:
pd.read_csv('file.txt', names = ['PHONE_NUMBER_1', 'BRAND_NAME_1'])
names: is name of columns
Assign type:
df['PHONE_NUMBER_1'] = df['PHONE_NUMBER_1'].astype(str)
I am trying to read this json file in python using this code (I want to have all the data in a data frame):
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
df = pd.read_json('short_desc.json')
df.head()
Data frame head screenshot
using this code I am able to convert only the first row to separated columns:
json_normalize(df.short_desc.iloc[0])
First row screenshot
I want to do the same for whole df using this code:
df.apply(lambda x : json_normalize(x.iloc[0]))
but I get this error:
ValueError: If using all scalar values, you must pass an index
What I am doing wrong?
Thank you in advance
After reading the json file with json.load, you can use pd.DataFrame.from_records. This should create the DataFrame you are looking for.
wih open('short_desc.json') as f:
d = json.load(f)
df = pd.DataFrame.from_records(d)
I have downloaded a sample dataset from here that is a series of JSON objects.
{...}
{...}
I need to load them to a pandas dataframe. I have tried below code
import pandas as pd
import json
filename = "sample-S2-records"
df = pd.DataFrame.from_records(map(json.loads, "sample-S2-records"))
But there seems to be parsing error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What am I missing?
You can try pandas.read_json method:
import pandas as pd
data = pd.read_json('/path/to/file.json', lines=True)
print data
I have tested it with this file, it works fine
The function needs a list of JSON objects. For example,
data = [ json_obj_1,json_obj_2,....]
The file does not contain the syntax for list and just has series of JSON objects. Following would solve the issue:
import pandas as pd
import json
# Load content to a variable
with open('../sample-S2-records/sample-S2-records', 'r') as content_file:
content = content_file.read().strip()
# Split content by new line
content = content.split('\n')
# Read each line which has a json obj and store json obj in a list
json_list = []
for each_line in content:
json_list.append(json.loads(each_line))
# Load the json list in form of a string
df = pd.read_json(json.dumps(json_list))