How to store this JSON file in a Pandas data frame? - python

I have never worked with JSON files before. I have this News Classification dataset. I wanted to get this in a Pandas dataframe.
It looks like this:
{"content": "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.","annotation":{"notes":"","label":["Business"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
{"content": "SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.","annotation":{"notes":"","label":["SciTech"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
There are more entries but I have posted just two of them. Each entry is bracketed as {}. Each entry has 4 keys: 'contents', 'annotations', 'extras', 'metadata'. I would like to have this in dataframe with the above keys as columns.
I tried the json library and Pandas.read_json function but both gave me errors.
with open('News-Classification-DataSet.json') as data_file:
df=json.load(data_file)
This gave an error: JSONDecodeError: Extra data: line 2 column 1 (char 378)

I believe you have to read this file in for each line, as the way you have it, isn't a valid json format.
So to read that in:
import json
data = []
with open('News-Classification-DataSet.json') as f:
for line in f:
data.append(json.loads(line))
Now you should be able to work with that, however, what do you want as your datframe output?
If you want to go straight to a dataframe, you can do as suggested:
df = pd.read_json("News-Classification-DataSet.json", lines=True)
But you have nested columns which I don't know how you want to deal with that.

To load line delimited json into a dataframe,
import pandas as pd
df = pd.read_json("News-Classification-DataSet.json", lines=True)
To parse the dict inside columns into columns,
pd.concat(
[
df["annotation"].apply(pd.Series),
df[["content", "extras"]],
df["metadata"].apply(pd.Series),
],
axis=1,
)

Related

pandas reading csv with one row spanning multiple lines

My csv starts out like this:
,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813
each row ends like this in a new line:
0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"
These values are from a big numpy array encoded with np.array2string() and span multiple lines in the csv.
When using pd.read_csv it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607". When using the parameter engine="python" it throws an "ParserError: unexpected end of data". When using the seperator sep='\t+' it just puts each line in a new row in the dataframe. When using csv.reader by using with open(file_path) and then iterating through each line, the same happens as with the sep='\t+'.
Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?
I tried to use your csv data to check it. I pasted the code along with the answer below,
import pandas as pd
import csv
data_path = 'dt.csv'
df = pd.read_csv(data_path, header = None, quoting=csv.QUOTE_NONE, encoding='utf-8')
dt_json = pd.DataFrame.to_json(df)
print(dt_json)
For an example, I just tried to change the data format from CSV to JSON using pandas dataframe.
{"0":{"0":null,"1":0.0},
"1":{"0":"index","1":"0"},
"2":{"0":"spotify_id","1":"5Jk0vfT81ltt2rYyrWDzZ5"},
"3":{"0":"artist_name","1":"Hundred Waters"},
"4":{"0":"track_name","1":"Xtalk - Kodak to Graph Remix"},
"5":{"0":"album_name","1":"The Moon Rang Like a Bell"},
"6":{"0":"duration_ms","1":"285327"},
"7":{"0":"lyrics","1":"not fetched"},
"8":{"0":"lyrics_bert_embeddings","1":"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813\r\n 0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"}}
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv

How to create a json file from csv file where csv file has following format

Name,abc
Title,teacher
Email,abc.edu
Phone,000-000-0000
Office,21building
About,"abc is teacher"
Name,def
Title,plumber
Email,plumber#plumber.com
Phone,111-111-1111
Office,22building
About,"The best plumber in the town"
Name,ghi
Title,producer
Phone,333-333-3333
Office,33building
About,"The best producer"
I would use pandas library to read the .csv (foo.csv in this example) data and then convert it to json using to_json.
In this case you have a dictionary
import pandas as pd
pd.read_csv('aaa.csv', header=None, index_col=0, squeeze=True)\
.to_json(orient='columns')
If you want to export a .json file
import pandas as pd
with open('exported_file.json', 'w') as f:
pd.read_csv('foo.csv', header=None, index_col=0, squeeze=True)\
.to_json(f, orient='columns')
I suppose that the CSV-file containes a sequentional records about personal in a format "Label,Value" and you'd like to reorganize it in the separated records for each person with labels along the second dimension as a column names. The output is going to be stored as a JSON-file.
If this is the case, then we can use pandas.DataFrame.pivot to change the scructure of data. But before that we have to group labes by person. For this purpose, I will assume that the Name label is obligatory for each person, and each unique label appears at most once between names:
data = '''Name,abc
Title,teacher
Email,abc.edu
Phone,000-000-0000
Office,21building
About,"abc is teacher"
Name,def
Title,plumber
Email,plumber#plumber.com
Phone,111-111-1111
Office,22building
About,"The best plumber in the town"
Name,ghi
Title,producer
Phone,333-333-3333
Office,33building
About,"The best producer"'''
df = pd.read_csv(StringIO(data), names=['label','value'])
df['grouper'] = (df['label'] == 'Name').cumsum()
df = df.pivot(index='grouper', columns='label', values='value')
Having this data we can save it as:
df.to_json('test.json', orient='records', lines=True)

Pandas error reading csv with double quotes

I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Loading CSV into dataframe results in all records becoming "NaN"

I'm new to python (and posting on SO), and I'm trying to use some code I wrote that worked in another similar context to import data from a file into a MySQL table. To do that, I need to convert it to a dataframe. In this particular instance I'm using Federal Election Comission data that is pipe-delimited (It's the "Committee Master" data here). It looks like this.
C00000059|HALLMARK CARDS PAC|SARAH MOE|2501 MCGEE|MD #500|KANSAS CITY|MO|64108|U|Q|UNK|M|C||
C00000422|AMERICAN MEDICAL ASSOCIATION POLITICAL ACTION COMMITTEE|WALKER, KEVIN MR.|25 MASSACHUSETTS AVE, NW|SUITE 600|WASHINGTON|DC|200017400|B|Q||M|M|ALABAMA MEDICAL PAC|
C00000489|D R I V E POLITICAL FUND CHAPTER 886|JERRY SIMS JR|3528 W RENO||OKLAHOMA CITY|OK|73107|U|N||Q|L||
C00000547|KANSAS MEDICAL SOCIETY POLITICAL ACTION COMMITTEE|JERRY SLAUGHTER|623 SW 10TH AVE||TOPEKA|KS|666121627|U|Q|UNK|Q|M|KANSAS MEDICAL SOCIETY|
C00000729|AMERICAN DENTAL ASSOCIATION POLITICAL ACTION COMMITTEE|DI VINCENZO, GIORGIO T. DR.|1111 14TH STREET, NW|SUITE 1100|WASHINGTON|DC|200055627|B|Q|UNK|M|M|INDIANA DENTAL PAC|
When I run this code, all of the records come back "NaN."
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
df = pd.DataFrame(data, columns=['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID'])
print(df.head(10))
If I remove the dataframe part and just do this, it displays the data, so it doesn't seem like it's a problem with file itself (but what do I know?):
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('Desktop/Python/FECupdates/cm.txt', delimiter='|')
print(data.head(10))
I've spent hours looking at different questions here that seem to be trying to address similar issues -- in which cases the problems apparently stemmed from things like the encoding or different kinds of delimiters -- but each time I try to make the same changes to my code I get the same result. I've also converted the whole thing to a csv, by changing all the commas in fields to "$" and then changing the pipes to commas. It still shows up as all "Nan," even though the number of records is correct if I upload it to MySQL (they're just all empty).
You made typos in columns list. Pandas can automatically recognize columns.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
df = pd.DataFrame(data)
print(df.head(10))
Also, you can create an empty dataframe and concatenate the readed file.
import pandas as pd
import pymysql
print('convert CSV to dataframe')
data = pd.read_csv ('cn.txt', delimiter='|')
data2 = pd.DataFrame()
df = pd.concat([data,data2],ignore_index=True)
print(df.head(10))
Try this, worked for me:
path = Desktop/Python/FECupdates
df = pd.read_csv(path+'cm.txt',encoding ='unicode_escape', sep='|')
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
df.columns = ['CMTE_ID','CMTE_NM','TRES_NM','CMTE_ST1','CMTE_ST2','CMTE_CITY','CMTE_ST','CMTE_ZIP','CMTE_DSGN','CMTE_TP','CMTE_PTY_AFFILIATION','CMTE_FILING_FREQ','ORG_TP','CONNECTED_ORG_NM','CAND_ID']
df.head(200)
Output:

Reading data with more columns than expected into a dataframe

I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607
Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607
Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])

Categories

Resources