How to store this JSON file in a Pandas data frame?

How to store this JSON file in a Pandas data frame? - python

I have never worked with JSON files before. I have this News Classification dataset. I wanted to get this in a Pandas dataframe.
It looks like this:
{"content": "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.","annotation":{"notes":"","label":["Business"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
{"content": "SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.","annotation":{"notes":"","label":["SciTech"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
There are more entries but I have posted just two of them. Each entry is bracketed as {}. Each entry has 4 keys: 'contents', 'annotations', 'extras', 'metadata'. I would like to have this in dataframe with the above keys as columns.
I tried the json library and Pandas.read_json function but both gave me errors.
with open('News-Classification-DataSet.json') as data_file:
df=json.load(data_file)
This gave an error: JSONDecodeError: Extra data: line 2 column 1 (char 378)

I believe you have to read this file in for each line, as the way you have it, isn't a valid json format.
So to read that in:
import json
data = []
with open('News-Classification-DataSet.json') as f:
for line in f:
data.append(json.loads(line))
Now you should be able to work with that, however, what do you want as your datframe output?
If you want to go straight to a dataframe, you can do as suggested:
df = pd.read_json("News-Classification-DataSet.json", lines=True)
But you have nested columns which I don't know how you want to deal with that.

To load line delimited json into a dataframe,
import pandas as pd
df = pd.read_json("News-Classification-DataSet.json", lines=True)
To parse the dict inside columns into columns,
pd.concat(
[
df["annotation"].apply(pd.Series),
df[["content", "extras"]],
df["metadata"].apply(pd.Series),
],
axis=1,
)

Related

pandas reading csv with one row spanning multiple lines

My csv starts out like this:
,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813
each row ends like this in a new line:
0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"
These values are from a big numpy array encoded with np.array2string() and span multiple lines in the csv.
When using pd.read_csv it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607". When using the parameter engine="python" it throws an "ParserError: unexpected end of data". When using the seperator sep='\t+' it just puts each line in a new row in the dataframe. When using csv.reader by using with open(file_path) and then iterating through each line, the same happens as with the sep='\t+'.
Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?

I tried to use your csv data to check it. I pasted the code along with the answer below,
import pandas as pd
import csv
data_path = 'dt.csv'
df = pd.read_csv(data_path, header = None, quoting=csv.QUOTE_NONE, encoding='utf-8')
dt_json = pd.DataFrame.to_json(df)
print(dt_json)
For an example, I just tried to change the data format from CSV to JSON using pandas dataframe.
{"0":{"0":null,"1":0.0},
"1":{"0":"index","1":"0"},
"2":{"0":"spotify_id","1":"5Jk0vfT81ltt2rYyrWDzZ5"},
"3":{"0":"artist_name","1":"Hundred Waters"},
"4":{"0":"track_name","1":"Xtalk - Kodak to Graph Remix"},
"5":{"0":"album_name","1":"The Moon Rang Like a Bell"},
"6":{"0":"duration_ms","1":"285327"},
"7":{"0":"lyrics","1":"not fetched"},
"8":{"0":"lyrics_bert_embeddings","1":"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813\r\n 0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"}}
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

How to create a json file from csv file where csv file has following format

Name,abc
Title,teacher
Email,abc.edu
Phone,000-000-0000
Office,21building
About,"abc is teacher"
Name,def
Title,plumber
Email,plumber#plumber.com
Phone,111-111-1111
Office,22building
About,"The best plumber in the town"
Name,ghi
Title,producer
Phone,333-333-3333
Office,33building
About,"The best producer"

I would use pandas library to read the .csv (foo.csv in this example) data and then convert it to json using to_json.
In this case you have a dictionary
import pandas as pd
pd.read_csv('aaa.csv', header=None, index_col=0, squeeze=True)\
.to_json(orient='columns')
If you want to export a .json file
import pandas as pd
with open('exported_file.json', 'w') as f:
pd.read_csv('foo.csv', header=None, index_col=0, squeeze=True)\
.to_json(f, orient='columns')

I suppose that the CSV-file containes a sequentional records about personal in a format "Label,Value" and you'd like to reorganize it in the separated records for each person with labels along the second dimension as a column names. The output is going to be stored as a JSON-file.
If this is the case, then we can use pandas.DataFrame.pivot to change the scructure of data. But before that we have to group labes by person. For this purpose, I will assume that the Name label is obligatory for each person, and each unique label appears at most once between names:
data = '''Name,abc
Title,teacher
Email,abc.edu
Phone,000-000-0000
Office,21building
About,"abc is teacher"
Name,def
Title,plumber
Email,plumber#plumber.com
Phone,111-111-1111
Office,22building
About,"The best plumber in the town"
Name,ghi
Title,producer
Phone,333-333-3333
Office,33building
About,"The best producer"'''
df = pd.read_csv(StringIO(data), names=['label','value'])
df['grouper'] = (df['label'] == 'Name').cumsum()
df = df.pivot(index='grouper', columns='label', values='value')
Having this data we can save it as:
df.to_json('test.json', orient='records', lines=True)

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.

The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Loading CSV into dataframe results in all records becoming "NaN"

I'm new to python (and posting on SO), and I'm trying to use some code I wrote that worked in another similar context to import data from a file into a MySQL table. To do that, I need to convert it to a dataframe. In this particular instance I'm using Federal Election Comission data that is pipe-delimited (It's the "Committee Master" data here). It looks like this.
C00000059|HALLMARK CARDS PAC|SARAH MOE|2501 MCGEE|MD #500|KANSAS CITY|MO|64108|U|Q|UNK|M|C||
C00000422|AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE|WALKER, KEVIN MR.|25 MASSACHUSETTS AVE, NW|SUITE 600|WASHINGTON|DC|200017400|B|Q||M|M|ALABAMA MEDICAL PAC|
C00000489|D R I V E POLITICAL FUND CHAPTER 886|JERRY SIMS JR|3528 W RENO||OKLAHOMA CITY|OK|73107|U|N||Q|L||
C00000547|KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE|JERRY SLAUGHTER|623 SW 10TH AVE||TOPEKA|KS|666121627|U|Q|UNK|Q|M|KANSAS MEDICAL SOCIETY|
C00000729|AMERICAN DENTAL ASSOCIATION POLITICAL ACTION COMMITTEE|DI VINCENZO, GIORGIO T. DR.|1111 14TH STREET, NW|SUITE 1100|WASHINGTON|DC|200055627|B|Q|UNK|M|M|INDIANA DENTAL PAC|
When I run this code, all of the records come back "NaN."
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
df = pd.DataFrame(data, columns=['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID'])
print(df.head(10))
If I remove the dataframe part and just do this, it displays the data, so it doesn't seem like it's a problem with file itself (but what do I know?):
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
print(data.head(10))
I've spent hours looking at different questions here that seem to be trying to address similar issues -- in which cases the problems apparently stemmed from things like the encoding or different kinds of delimiters -- but each time I try to make the same changes to my code I get the same result. I've also converted the whole thing to a csv, by changing all the commas in fields to "$" and then changing the pipes to commas. It still shows up as all "Nan," even though the number of records is correct if I upload it to MySQL (they're just all empty).

You made typos in columns list. Pandas can automatically recognize columns.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
df = pd.DataFrame(data)
print(df.head(10))
Also, you can create an empty dataframe and concatenate the readed file.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
data2 = pd.DataFrame()
df = pd.concat([data,data2],ignore_index=True)
print(df.head(10))

Try this, worked for me:
path = Desktop/Python/FECupdates
df = pd.read_csv(path+'cm.txt',encoding ='unicode_escape', sep='|')
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df.columns = ['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID']
df.head(200)
Output:

Reading data with more columns than expected into a dataframe

I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607

Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607

Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store this JSON file in a Pandas data frame? - python

To load line delimited json into a dataframe, import pandas as pd df = pd.read_json("News-Classification-DataSet.json", lines=True) To parse the dict inside columns into columns, pd.concat( [ df["annotation"].apply(pd.Series), df[["content", "extras"]], df["metadata"].apply(pd.Series), ], axis=1, )

Related

pandas reading csv with one row spanning multiple lines

How to create a json file from csv file where csv file has following format

Pandas error reading csv with double quotes

Loading CSV into dataframe results in all records becoming "NaN"

Reading data with more columns than expected into a dataframe

Categories

Resources