I am having a bit of trouble trying to load a JSON file into a pandas data frame. I have gone through different iterations of code listed below with no luck:
import pandas as pd
fileName = "ipl2015_auction_data.json"
jsonData = pd.read_json(fileName, orient="records")
ERROR: ValueError: arrays must all be same length
import pandas as pd
fileName = "ipl2015_auction_data.json"
#jsonData = pd.read_json(fileName, orient="records")
with open(fileName, "r") as jsonFile:
data = jsonFile.read()
df = pd.DataFrame(data)
print df.head()
ERROR: pandas.core.common.PandasError: DataFrame constructor not properly called!
I can't figure out what I am doing wrong. Here is the JSON data I am trying to load.
SOLUTION:
It turns out there was an issue with my pandas installation. Re-installing it did the trick.
Related
import json
import pandas as pd
with open('sample.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
print(df.to_string()[str('GET'):int('502')])
i need (GET AND 502,404 ERROR LIST) FROM THE LOCAL JSON USING PYTHON PANDAS
I have a bunch of code for reading multiple pickle files using Pandas:
dfs = []
for filename in glob.glob(os.path.join(path,"../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pd.read_pickle(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)
how can I read the files using pyarrow? Meanwhile, this way does not work and raises an error.
dfs = []
for filename in glob.glob(os.path.join(path, "../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pa.read_serialized(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)
FYI, pyarrow.read_serialized is deprecated and you should just use arrow ipc or python standard pickle module when willing to serialize data.
Anyway I'm not sure what you are trying to achieve, saving objects with Pickle will try to deserialize them with the same exact type they had on save, so even if you don't use pandas to load back the object, you will still get back a pandas DataFrame (as that's what you pickled) and will still need pandas installed to be able to create one.
For example, you can easily get rid of pandas.read_pickle and replace it with just pickle.load, but what you get back will still be a pandas.DataFrame
import pandas as pd
original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pd.to_pickle(original_df, "./dummy.pkl")
import pickle
loaded_back = pickle.load(open("./dummy.pkl", "rb"))
print(loaded_back)
I am trying to read this json file in python using this code (I want to have all the data in a data frame):
import numpy as np
import pandas as pd
import json
from pandas.io.json import json_normalize
df = pd.read_json('short_desc.json')
df.head()
Data frame head screenshot
using this code I am able to convert only the first row to separated columns:
json_normalize(df.short_desc.iloc[0])
First row screenshot
I want to do the same for whole df using this code:
df.apply(lambda x : json_normalize(x.iloc[0]))
but I get this error:
ValueError: If using all scalar values, you must pass an index
What I am doing wrong?
Thank you in advance
After reading the json file with json.load, you can use pd.DataFrame.from_records. This should create the DataFrame you are looking for.
wih open('short_desc.json') as f:
d = json.load(f)
df = pd.DataFrame.from_records(d)
I have downloaded a sample dataset from here that is a series of JSON objects.
{...}
{...}
I need to load them to a pandas dataframe. I have tried below code
import pandas as pd
import json
filename = "sample-S2-records"
df = pd.DataFrame.from_records(map(json.loads, "sample-S2-records"))
But there seems to be parsing error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
What am I missing?
You can try pandas.read_json method:
import pandas as pd
data = pd.read_json('/path/to/file.json', lines=True)
print data
I have tested it with this file, it works fine
The function needs a list of JSON objects. For example,
data = [ json_obj_1,json_obj_2,....]
The file does not contain the syntax for list and just has series of JSON objects. Following would solve the issue:
import pandas as pd
import json
# Load content to a variable
with open('../sample-S2-records/sample-S2-records', 'r') as content_file:
content = content_file.read().strip()
# Split content by new line
content = content.split('\n')
# Read each line which has a json obj and store json obj in a list
json_list = []
for each_line in content:
json_list.append(json.loads(each_line))
# Load the json list in form of a string
df = pd.read_json(json.dumps(json_list))
I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']