How to transform JSON SList to pandas dataframe? - python

a = ['{"type": "book",',
'"title": "sometitle",',
'"author": [{"name": "somename"}],',
'"year": "2000",',
'"identifier": [{"type": "ISBN", "id": "1234567890"}],',
'"publisher": "somepublisher"}', '',
'{"type": "book",', '
'"title": "sometitle2",',
'"author": [{"name": "somename2"}],',
'"year": "2001",',
'"identifier": [{"type": "ISBN", "id": "1234567890"}],',
'"publisher": "somepublisher"}', '']
I have this convoluted SList and I would like to ultimately get it into a tidy pandas dataframe.
I have tried a number of things, for example:
i = iter(a)
b = dict(zip(i, i))
Unfortunately, this creates a dictionary that looks even worse:
{'{"type": "book",':
...
Where I had an SList of dictionaries, I now have a dictionary of dictionaries.
I also tried
pd.json_normalize(a)
but this throws an error message AttributeError: 'str' object has no attribute 'values'
I also tried
r = json.dumps(a.l)
loaded_r = json.loads(r)
print(loaded_r)
but this yields a list
['{"type": "book",',
...
Again, in the end I'd like to have a pandas dataframe like this
type title author year ...
book sometitle somename 2000 ...
book sometitle2 somename2 2001
Obviously, I haven't really gotten to the point where I can feed the data to a pandas function. Everytime I did that, the functions screamed at me...

a = ['{"type": "book",',
'"title": "sometitle",',
'"author": [{"name": "somename"}],',
'"year": "2000",',
'"identifier": [{"type": "ISBN", "id": "1234567890"}],',
'"publisher": "somepublisher"}', '',
'{"type": "book",',
'"title": "sometitle2",',
'"author": [{"name": "somename2"}],',
'"year": "2001",',
'"identifier": [{"type": "ISBN", "id": "1234567890"}],',
'"publisher": "somepublisher"}', '']
b = "[%s]" % ''.join([',' if i == '' else i for i in a ]).strip(',')
data = json.loads(b)
df = pd.DataFrame(data)
print(df)
type title author year \
0 book sometitle [{'name': 'somename'}] 2000
1 book sometitle2 [{'name': 'somename2'}] 2001
identifier publisher
0 [{'type': 'ISBN', 'id': '1234567890'}] somepublisher
1 [{'type': 'ISBN', 'id': '1234567890'}] somepublisher

Related

How do I converted my textfile to a nested json in python

I have a text file which I want to convert to a nested json structure. The text file is :
Report_for Reconciliation
Execution_of application_1673496470638_0001
Spark_version 2.4.7-amzn-0
Java_version 1.8.0_352 (Amazon.com Inc.)
Start_time 2023-01-12 09:45:13.360000
Spark Properties:
Job_ID 0
Submission_time 2023-01-12 09:47:20.148000
Run_time 73957ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 0
Number_of_tasks 16907
Number_of_executed_tasks 16907
Completion_time 73207ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 1
Submission_time 2023-01-12 09:48:34.177000
Run_time 11525ms
Result JobSucceeded
Number_of_stages 2
Stage_ID 1
Number_of_tasks 16907
Number_of_executed_tasks 0
Completion_time 0ms
Stage_executed parquet at RawDataPublisher.scala:53
Stage_ID 2
Number_of_tasks 300
Number_of_executed_tasks 300
Completion_time 11520ms
Stage_executed parquet at RawDataPublisher.scala:53
Job_ID 2
Submission_time 2023-01-12 09:48:46.908000
Run_time 218358ms
Result JobSucceeded
Number_of_stages 1
Stage_ID 3
Number_of_tasks 1135
Number_of_executed_tasks 1135
Completion_time 218299ms
Stage_executed parquet at RawDataPublisher.scala:53
I want the output to be :
{
"Report_for": "Reconciliation",
"Execution_of": "application_1673496470638_0001",
"Spark_version": "2.4.7-amzn-0",
"Java_version": "1.8.0_352 (Amazon.com Inc.)",
"Start_time": "2023-01-12 09:45:13.360000",
"Job_ID 0": {
"Submission_time": "2023-01-12 09:47:20.148000",
"Run_time": "73957ms",
"Result": "JobSucceeded",
"Number_of_stages": "1",
"Stage_ID 0”: {
"Number_of_tasks": "16907",
"Number_of_executed_tasks": "16907",
"Completion_time": "73207ms",
"Stage_executed": "parquet at RawDataPublisher.scala:53"
"Stage": "parquet at RawDataPublisher.scala:53",
},
},
}
I tried defaultdict method but it was generating a json with values as list which was not acceptable to make a table on it. Here's what I did:
import json
from collections import defaultdict
INPUT = 'demofile.txt'
dict1 = defaultdict(list)
def convert():
with open(INPUT) as f:
for line in f:
command, description = line.strip().split(None, 1)
dict1[command].append(description.strip())
OUTPUT = open("demo1file.json", "w")
json.dump(dict1, OUTPUT, indent = 4, sort_keys = False)
and was getting this:
"Report_for": [ "Reconciliation" ],
"Execution_of": [ "application_1673496470638_0001" ],
"Spark_version": [ "2.4.7-amzn-0" ],
"Java_version": [ "1.8.0_352 (Amazon.com Inc.)" ],
"Start_time": [ "2023-01-12 09:45:13.360000" ],
"Job_ID": [
"0",
"1",
"2", ....
]]]
I just want to convert my text to the above json format so that I can build a table on top of it.
There's no way, python or one of it's libraries can figure out your nesting requirements, if a flat text is being given as an input. How should it know Stages are inside Jobs...for example.
You will have to programmatically tell your application how it works.
I hacked an example which should work, you can go from there (assuming input_str is what you posted as your file content):
# define your nesting structure
nesting = {'Job_ID': {'Stage_ID': {}}}
upper_nestings = []
upper_nesting_keys = []
# your resulting dictionary
result_dict = {}
# your "working" dictionaries
current_nesting = nesting
working_dict = result_dict
# parse each line of the input string
for line_str in input_str.split('\n'):
# key is the first word, value are all consecutive words
line = line_str.split(' ')
# if key is in nesting, create new sub-dict, all consecutive entries are part of the sub-dict
if line[0] in current_nesting.keys():
current_nesting = current_nesting[line[0]]
upper_nestings.append(line[0])
upper_nesting_keys.append(line[1])
working_dict[line_str] = {}
working_dict = working_dict[line_str]
else:
# if a new "parallel" or "upper" nesting is detected, reset your nesting structure
if line[0] in upper_nestings:
nests = upper_nestings[:upper_nestings.index(line[0])]
keys = upper_nesting_keys[:upper_nestings.index(line[0])]
working_dict = result_dict
for nest in nests:
working_dict = working_dict[' '.join([nest, keys[nests.index(nest)]])]
upper_nestings = upper_nestings[:upper_nestings.index(line[0])+1]
upper_nesting_keys = upper_nesting_keys[:upper_nestings.index(line[0])]
upper_nesting_keys.append(line[1])
current_nesting = nesting
for nest in upper_nestings:
current_nesting = current_nesting[nest]
working_dict[line_str] = {}
working_dict = working_dict[line_str]
continue
working_dict[line[0]] = ' '.join(line[1:])
print(result_dict)
Results in:
{
'Report_for': 'Reconciliation',
'Execution_of': 'application_1673496470638_0001',
'Spark_version': '2.4.7-amzn-0',
'Java_version': '1.8.0_352 (Amazon.com Inc.)',
'Start_time': '2023-01-12 09:45:13.360000',
'Spark': 'Properties: ',
'Job_ID 0': {
'Submission_time': '2023-01-12 09:47:20.148000',
'Run_time': '73957ms',
'Result': 'JobSucceeded',
'Number_of_stages': '1',
'Stage_ID 0': {
'Number_of_tasks': '16907',
'Number_of_executed_tasks': '16907',
'Completion_time': '73207ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
},
'Job_ID 1': {
'Submission_time': '2023-01-12 09:48:34.177000',
'Run_time': '11525ms',
'Result': 'JobSucceeded',
'Number_of_stages': '2',
'Stage_ID 1': {
'Number_of_tasks': '16907',
'Number_of_executed_tasks': '0',
'Completion_time': '0ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
},
'Stage_ID 2': {
'Number_of_tasks': '300',
'Number_of_executed_tasks': '300',
'Completion_time': '11520ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
},
'Job_ID 2': {
'Submission_time':
'2023-01-12 09:48:46.908000',
'Run_time': '218358ms',
'Result': 'JobSucceeded',
'Number_of_stages': '1',
'Stage_ID 3': {
'Number_of_tasks': '1135',
'Number_of_executed_tasks': '1135',
'Completion_time': '218299ms',
'Stage_executed': 'parquet at RawDataPublisher.scala:53'
}
}
}
and should pretty much be generically usable for all kinds of nesting definitions from a flat input. Let me know if it works for you!

Flattening multi nested json into a pandas dataframe

I'm trying to flatten this json response into a pandas dataframe to export to csv.
It looks like this:
j = [
{
"id": 401281949,
"teams": [
{
"school": "Louisiana Tech",
"conference": "Conference USA",
"homeAway": "away",
"points": 34,
"stats": [
{"category": "rushingTDs", "stat": "1"},
{"category": "puntReturnYards", "stat": "24"},
{"category": "puntReturnTDs", "stat": "0"},
{"category": "puntReturns", "stat": "3"},
],
}
],
}
]
...Many more items in the stats area.
If I run this and flatten to the teams level:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
I get:
school conference homeAway points stats
0 Louisiana Tech Conference USA away 34 [{'category': 'rushingTDs', 'stat': '1'}, {'ca...
How do I flatten it twice so that all of the stats are on their own column in each row?
If I do this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
I end up with multiple rows instead of more columns:
You can try:
df = pd.DataFrame(j).explode("teams")
df = pd.concat([df, df.pop("teams").apply(pd.Series)], axis=1)
df["stats"] = df["stats"].apply(lambda x: {d["category"]: d["stat"] for d in x})
df = pd.concat(
[
df,
df.pop("stats").apply(pd.Series),
],
axis=1,
)
print(df)
Prints:
id school conference homeAway points rushingTDs puntReturnYards puntReturnTDs puntReturns
0 401281949 Louisiana Tech Conference USA away 34 1 24 0 3
can you try this:
multiple_level_data = pd.json_normalize(j, record_path =['teams'])
multiple_level_data = multiple_level_data.explode('stats').reset_index(drop=True)
multiple_level_data=multiple_level_data.join(pd.json_normalize(multiple_level_data.pop('stats')))
#convert rows to columns.
multiple_level_data=multiple_level_data.set_index(multiple_level_data.columns[0:4].to_list())
dfx=multiple_level_data.pivot_table(values='stat',columns='category',aggfunc=list).apply(pd.Series.explode).reset_index(drop=True)
multiple_level_data=multiple_level_data.reset_index().drop(['stat','category'],axis=1).drop_duplicates().reset_index(drop=True)
multiple_level_data=multiple_level_data.join(dfx)
Output:
school
conference
homeAway
points
puntReturnTDs
puntReturnYards
puntReturns
rushingTDs
0
Louisiana Tech
Conference USA
away
34
0
24
3
1
Instead of calling explode() on an output of a json_normalize(), you can explicitly pass the paths to the meta data for each column in a single json_normalize() call. For example, ['teams', 'school'] would be one path, ['teams', 'conference'] is another path, etc. This will create a long dataframe similar to what you already have.
Then you can call pivot() to reshape this output into the correct shape.
# normalize json
df = pd.json_normalize(
j, record_path=['teams', 'stats'],
meta=['id', *(['teams', c] for c in ('school', 'conference', 'homeAway', 'points'))]
)
# column name contains 'teams' prefix; remove it
df.columns = [c.split('.')[1] if '.' in c else c for c in df]
# pivot the intermediate result
df = (
df.astype({'points': int, 'id': int})
.pivot(['id', 'school', 'conference', 'homeAway', 'points'], 'category', 'stat')
.reset_index()
)
# remove index name
df.columns.name = None
df

How to pass list-like using .reindex as doing it in .loc has been deprecated?

I have a dataframe with multiple fields and I want to use some columns values to recreate a new dataframe as a JSON object:
Street City State Zip_Code
24 St. Kansas City KS 12345-213
... ... ... ....
In order to do so, I was using .loc and .apply like this in python:
def address_x(vals):
val = {
'street': None if not str(vals[0]) else vals[0],
'city': None if not str(vals[1]) else vals[1],
'state': None if not str(vals[2]) else state(vals[2]),
'postal_code': postal_code(str(vals[3]))
}
return val
def transform (dataset):
df = pd.DataFrame()
df['address'] = dataset.loc[['Street', 'City', 'State', 'Zip_Code']].apply(address_x, axis=1)
return df
obj = s3client.get_object(Bucket=bucket, Key=key)
new_df = transform(pd.read_csv(io.BytesIO(obj['Body'].read()), delimiter='|', sep='|'))
new_df.to_json('TEST.json', orient='records', lines=True)
That gives me this error message KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
I am trying to use df['address'] = dataset.reindex(['STREET', 'CITY', 'STATE', 'ZIP CODE']).apply(lambda x: address_x(x)) but just stores all values as null instead of this:
{"address":{
"street": "24 St.",
"city": "Kansas City",
"state": "Kansas",
"postal_code": 12345-213}
}
The input is a regular csv file that is using '|' as separator and between all columns it has, this are just 4 of them in the example up.
Then I store it as a json and currently the output looks like: {"address":{"street":null,"city":null,"state":null,"postal_code":null}} for each record, instead of populating the json with the csv values.
Change to:
def address_x(vals):
val = {
'street': None if not str(vals['Street']) else vals['Street'],
'city': None if not str(vals['City']) else vals['City'],
'state': None if not str(vals['State']) else state(vals['State']),
'postal_code': postal_code(str(vals['Zip_Code']))
}
return val
df['address'] = dataset[['Street', 'City', 'State', 'Zip_Code']].apply(address_x, axis=1)

creating coumn for each output receive in one field in python

I am implementing an emotion analysis using lstm method where I have already done my training model as well as my prediction part. but my prediction is appearing in one column.. I will show you below.
Here are my codes:
with open('output1.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
#creating empty list to be able to create a dataframe
names = []
dates = []
commentss = []
labels = []
hotelname = []
for item in selection1:
name = item['name']
hotelname.append(name)
#print ('>>>>>>>>>>>>>>>>>> ', name)
Date = item['reviews']
for d in Date:
names.append(name)
#convert date from 'january 12, 2020' to 2020-01-02
date = pd.to_datetime(d['date']).strftime("%Y-%m-%d")
#adding date to the empty list dates[]
dates.append(date)
#print('>>>>>>>>>>>>>>>>>> ', date)
CommentID = item['reviews']
for com in CommentID:
comment = com['review']
lcomment = comment.lower() # converting all to lowercase
result = re.sub(r'\d+', '', lcomment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = remove_stopwords(results)
commentss.append(comment)
# print('>>>>>>',comments)
#add the words in comments that are already present in the keys of dictionary
encoded_samples = [[word2id[word] for word in comments if word in word2id.keys()]]
# Padding
encoded_samples = keras.preprocessing.sequence.pad_sequences(encoded_samples, maxlen=max_words)
# Make predictions
label_probs, attentions = model_with_attentions.predict(encoded_samples)
label_probs = {id2label[_id]: prob for (label, _id), prob in zip(label2id.items(), label_probs[0])}
labels.append(label_probs)
#creating dataframe
dataframe={'name': names,'date': dates, 'comment': commentss, 'classification': labels}
table = pd.DataFrame(dataframe, columns=['name', 'date', 'comment', 'classification'])
json = table.to_json('hotel.json', orient='records')
here is the results i obtain:
[
{
"name": "Radisson Blu Azuri Resort & Spa",
"date": "February 02, 2020",
"comment": [
"enjoy",
"daily",
"package",
"start",
"welcoming",
"end",
"recommend",
"hotel"
],
"label": {
"joy": 0.0791392997,
"surprise": 0.0002606699,
"love": 0.4324670732,
"sadness": 0.2866959572,
"fear": 0.0002588668,
"anger": 0.2011781186
}
},
you can find the complete output on this link: https://jsonblob.com/a9b4035c-5576-11ea-afe8-1d95b3a2e3fd
Is it possible to break the label field into separate fields like below??
[
{
"name": "Radisson Blu Azuri Resort & Spa",
"date": "February 02, 2020",
"comment": [
"enjoy",
"daily",
"package",
"start",
"welcoming",
"end",
"recommend",
"hotel"
],
"joy": 0.0791392997,
"surprise": 0.0002606699,
"love": 0.4324670732,
"sadness": 0.2866959572,
"fear": 0.0002588668,
"anger": 0.2011781186
},
Can someone please help me how do i need to modify my codes and make this possible please guys explain to me please..
If you can't do it before you produce the result, you can easily manipulate that dictionary like so:
def move_labels_to_dict_root(result):
labels = result["labels"]
meta_data = result
del meta_data["labels"]
result = {**meta_data, **labels}
return result
and then call move_labels_to_dict_root in a list comprehension like [move_labels_to_dict_root(result) for result in results].
However, I would ask why you want to do this?

How to conditionally select elements in a list comprehension?

I couldn't find any examples that match my use case. Still working through my way in python lists and dictionaries.
Problem:
all_cars = {'total_count': 3,'cars': [{'name': 'audi','model': 'S7'}, {'name': 'honda', 'model': 'accord'},{'name': 'jeep', 'model': 'wrangler'} ]}
owners = {'users':[{'owner': 'Nick', 'car': 'audi'},{'owner': 'Jim', 'car': 'ford'},{'owner': 'Mike', 'car': 'mercedes'} ]}
def duplicate():
for c in all_cars['cars']:
if c['name'] == [c['users']for c in owners['users']]:
pass
else:
res = print(c['name'])
return res
output = ['honda', 'jeep', audi']
and
def duplicate():
for c in all_cars['cars']:
if c['name'] == 'audi':
pass
else:
res = print(c['name'])
return res
output - ['honda', 'jeep']
I am trying to find matching values in both dictionaries, using list comprehension, then return non-matching values only.
Solution: Using 'in' rather than '==' operator, I was able to compare values between both lists and skip duplicates.
def duplicate():
for c in all_cars['cars']:
if c['name'] in [c['users']for c in owners['users']]:
pass
else:
res = print(c['name'])
return res
To answer the question in your title, you can conditionally add elements during a list comprehension using the syntax [x for y in z if y == a], where y == a is any condition you need - if the condition evaluates to True, then the element y will be added to the list, otherwise it will not.
I would just keep a dictionary of all of the owner data together:
ownerData = { "Shaft" : {
"carMake" : "Audi",
"carModel" : "A8",
"year" : "2015" },
"JamesBond" : {
"carMake" : "Aston",
"carModel" : "DB8",
"year" : "2012" },
"JeffBezos" : {
"carMake" : "Honda",
"carModel" : "Accord"
"year" : "1989"}
}
Now you can loop through and query it something like this:
for o in ownerData:
if "Audi" in o["carMake"]:
print("Owner %s drives a %s %s %s" % (o, o["year"], o["carMake"], o["carModel"]))
Should output:
"Owner Shaft drives a 2015 Audi A8"
This way you can expand your data set for owners without creating multiple lists.
OK, based on your feedback on the solution above, here is how I would tackle your problem. Drop your common items into lists and then use "set" to print out the diff.
all_cars = {'total_count': 3,'cars': [{'name': 'audi','model': 'S7'},
{'name': 'honda', 'model': 'accord'},{'name': 'jeep', 'model': 'wrangler'} ]}
owners = {'users':[{'owner': 'Nick', 'car': 'audi'},{'owner': 'Jim',
'car': 'ford'},{'owner': 'Mike', 'car': 'mercedes'} ]}
allCarList = []
ownerCarList = []
for auto in all_cars['cars']:
thisCar = auto['name']
if thisCar not in allCarList:
allCarList.append(thisCar)
for o in owners['users']:
thisCar = o['car']
if thisCar not in ownerCarList:
ownerCarList.append(thisCar)
diff = list(set(allCarList) - set(ownerCarList))
print(diff)
I put this in and ran it and came up with this output:
['jeep', 'honda']
Hope that helps!

Categories

Resources