Convert http text response to pandas dataframe [duplicate] - python

This question already has answers here:
Convert Python dict into a dataframe
(18 answers)
JSON to pandas DataFrame
(14 answers)
Closed last year.
I want to convert the below text into a pandas dataframe. Is there a way I can use Python Pandas pre-built or in-built parser to convert? I can make a custom function for parsing but want to know if there is pre-built and/or fast solution.
In this example, the dataframe should result in two rows, one each of ABC & PQR
{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}

You've listed everything you need as tags. Use json.loads to get a dict from string
import json
import pandas as pd
d = json.loads('''{
"data": [
{
"ID": "ABC",
"Col1": "ABC_C1",
"Col2": "ABC_C2"
},
{
"ID": "PQR",
"Col1": "PQR_C1",
"Col2": "PQR_C2"
}
]
}''')
df = pd.DataFrame(d['data'])
print(df)
Output:
ID Col1 Col2
0 ABC ABC_C1 ABC_C2
1 PQR PQR_C1 PQR_C2

Related

Explode function

This is my first question on here. I have searched around on here and throughout the web and I seem unable to find the answer to my question. I'm trying to explode out a list in a json file out into multiple columns and rows. Everything I have tried so far has proven unsuccessful.
I am doing this over multiple json files within a directory in order to have it print out in the dataframe like this.
Goal:
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
did
Version
Nodes
rds
time
c
sc
f
uc
Instead I get this in my dataframe:
did
Version
Nodes
rds
fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
did
Version
Nodes
rds
everything in fusage
example of the json I'm working with. The json structure will not change
{
"did": "123456789",
"mId": "1a2b3cjsks",
"timestamp": "2021-11-26T11:10:58.322000",
"beat": {
"did": "123456789",
"collectionTime": "2010-05-26 11:10:58.004783+00",
"Nodes": 6,
"Version": "v1.4.6-2",
"rds": "0.00B",
"fusage": [
{
"time": "2010-05-25",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-19",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"t": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
},
{
"time": "2010-05-23",
"c": "string",
"sc": "string",
"f": "string",
"uc": "int"
}
]
}
}
My end goal is to get the dataframe out to a csv in order to be ingested. I appreciate everyone's help looking at this.
using python 3.8.10 & pandas 1.3.4
python code below
import csv
import glob
import json
import os
import pandas as pd
tempdir = '/dir/to/files/json_temp'
json_files = os.path.join(tempdir, '*.json')
file_list = glob.glob(json_files)
dfs = []
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
df.explode('fusage')
print(df)
If you're going to use the explode function, after that, apply pd.Series over the column containing the fusage list (beat.fusage) to obtain a Series for each list item.
/dir/to/files
├── example-v1.4.6-2.json
└── example-v2.2.2-2.json
...
for file in file_list:
with open(file) as f:
data = pd.json_normalize(json.loads(f.read()))
dfs.append(data)
df = pd.concat(dfs, ignore_index=True)
fusage_list = df.explode('beat.fusage')['beat.fusage'].apply(pd.Series)
df = pd.concat([df, fusage_list], axis=1)
# show desired columns
df = df[['did', 'beat.Version', 'beat.Nodes', 'beat.rds', 'time', 'c', 'sc', 'f', 'uc']]
print(df)
Output from df
did beat.Version beat.Nodes beat.rds time c sc f uc
0 123456789 v1.4.6-2 6 0.00B 2010-05-25 string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-19 string string string int
0 123456789 v1.4.6-2 6 0.00B NaN string string string int
0 123456789 v1.4.6-2 6 0.00B 2010-05-23 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-25 string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-19 string string string int
1 123777777 v2.2.2-2 4 0.00B NaN string string string int
1 123777777 v2.2.2-2 4 0.00B 2010-05-23 string string string int

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Get intersection of 4 JSON files based on 1-2 common key values? (Python)

Below are 4 JSON files:
3 JSON files have 3 key fields: name, rating, and year
1 JSON has only 2 key fields: name, rating (no year)
[
{
"name": "Apple",
"year": "2014",
"rating": "21"
},
{
"name": "Pear",
"year": "2003",
"rating": ""
},
{
"name": "Pineapple",
"year": "1967",
"rating": "60"
},
]
[
{
"name": "Pineapple",
"year": "1967",
"rating": "5.7"
},
{
"name": "Apple",
"year": "1915",
"rating": "2.3"
},
{
"name": "Apple",
"year": "2014",
"rating": "3.7"
}
]
[
{
"name": "Apple",
"year": "2014",
"rating": "2.55"
}
]
[
{
"name": "APPLE",
"rating": "+4"
},
{
"name": "LEMON",
"rating": "+3"
}
]
When you search for 'Apple' across all 4 files, you want to return 1 name, 1 year, and 4 ratings:
name: Apple (closest match to search term across all 4 files)
year: 2014 (the MOST COMMON year for Apple across first 3 JSONs)
rating: 21 (from JSON1)
3.7 (from JSON2)
2.55 (from JSON3)
+4 (from JSON4)
Now pretend JSON3 (or any JSON) has no match for 'name: Apple'. In that case, instead return the following. Assume there will be at least one match in at least one file.
name: Apple (closest match to search term across all 4 files)
year: 2014 (the MOST COMMON year for Apple across first 3 JSONs)
rating: 21 (from JSON1)
3.7 (from JSON2)
Not Found (from JSON3)
+4 (from JSON4)
How would you get this output in Python?
This question is similar to the example code in Python - Getting the intersection of two Json-Files , except there are 4 files, 1 file is missing the year key, and we don't need the intersection of the rating key's value.
Here's what I have so far, just for two sets of JSON above:
import json
with open('1.json', 'r') as f:
json1 = json.load(f)
with open('2.json', 'r') as f:
json2 = json.load(f)
json2[0]['name'] = list(set(json2[0]['name']) - set(json1[0]['name']))
print(json.dumps(json2, indent=2))
I get output from this, but it doesn't match what I'm trying to achieve. For example, this is part of the output:
{
"name": [
"a",
"n",
"i",
"P"
],
"year": "1967",
"rating": "5.7"
},
When you are creating a set with the set constructor, it expects an iterable object and will iterate through the values of this object to make your set. So when you try to make a set directly from a string you end up with
name = set('Apple')
# name = {'A', 'p', 'p', 'l', 'e'}
since the string is an iterable object made up of characters. Instead, you would want to wrap the string into a list or tuple like so
name = set(['Apple'])
# name = {'Apple'}
which in your case would look like
json2[0]['name'] = list(set([json2[0]['name']]) - set([json1[0]['name']]))
but I still don't think that this is really what you are trying to achieve. Instead I would suggest that you iterate through the each of your json files making your own dictionary that is indexed on the names from the json files. Each value in the dictionary would have another dictionary with two keys, rating and year, both of which have a list of values. Once you're done building up your dictionary you would end up with a rating and year list for each name, and then you could convert each year list to a single value by choosing the most frequent year in the year list.
Here's an example of how your dictionary might look
{
"Apple": { "rating": [21, 3.7, ...], "year": [1915, 2014, 2014] }
"Pineapple": ...
...
}

Can I store a Parquet file with a dictionary column having mixed types in their values?

I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": "Value3" }
]
})
df.to_parquet("test.parquet")
Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:
import pandas as pandas
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
df.to_parquet("test.parquet")
This throws the following error:
ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')
Notice how, for the last row of the DF, the Field property of the ColC dictionary is a list instead of a string.
Is there any workaround to be able to store this DF as a Parquet file?
ColC is a UDT (user defined type) with one field called Field of type Union of String, List of String.
In theory arrow supports it, but in practice it has a hard time figuring out what the type of ColC is. Even if you were providing the schema of your data frame explicitly, it wouldn't work because this type of conversion (converting unions from pandas to arrow/parquet) isn't supported yet.
union_type = pa.union(
[pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
'dense'
)
col_c_type = pa.struct(
[
pa.field('Field', union_type)
]
)
schema=pa.schema(
[
pa.field('ColA', pa.int32()),
pa.field('ColB', pa.string()),
pa.field('ColC', col_c_type),
]
)
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field": "Value" },
{ "Field": "Value2" },
{ "Field": ["Value3"] }
]
})
pa.Table.from_pandas(df, schema)
This gives you this error:
('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'
Even if you create the arrow table manually it won't be able to convert it to parquet (again, union are not supported).
import io
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
xs = pa.array(["Value", "Value2", None], type=pa.string())
ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
types = pa.array([0, 0, 1], type=pa.int8())
col_c = pa.UnionArray.from_sparse(types, [xs, ys])
table = pa.Table.from_arrays(
[col_a, col_b, col_c],
schema=pa.schema([
pa.field('ColA', col_a.type),
pa.field('ColB', col_b.type),
pa.field('ColC', col_c.type),
])
)
with io.BytesIO() as buffer:
pq.write_table(table, buffer)
Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>
I think your only option for now it to use a struct where fields have got different names for string value and list of string values.
df = pd.DataFrame({
"ColA": [1, 2, 3],
"ColB": ["X", "Y", "Z"],
"ColC": [
{ "Field1": "Value" },
{ "Field1": "Value2" },
{ "Field2": ["Value3"] }
]
})
df.to_parquet('/tmp/hello')
I just had the same problem and fixed by converting ColC to string:
df['ColC'] = df['ColC'].astype(str)
I am not sure this would not create a problem in the future, don't quote me.

Load a dataframe from a single json object

I have the following json object:
{
"Name": "David",
"Gender": "M",
"Date": "2014-01-01",
"Address": {
"Street": "429 Ford",
"City": "Oxford",
"State": "DE",
"Zip": 1009
}
}
How would I load this into a pandas dataframe so that it orients itself as:
name gender date address
David M 20140-01-01 {...}
What I'm trying now is:
pd.read_json(file)
But it orients it as four records instead of one.
You should read it as a Series and then (optionally) convert to a DataFrame:
df = pd.DataFrame(pd.read_json(file, typ='series')).T
df.shape
#(1, 4)
if your JSON file is composed of 1 JSON object per line (not an array, not a pretty printed JSON object)
then you can use:
df = pd.read_json(file, lines=True)
and it will do what you want
if file contains:
{"Name": "David","Gender": "M","Date": "2014-01-01","Address": {"Street": "429 Ford","City": "Oxford","State": "DE","Zip": 1009}}
on 1 line, then you get:
If you use
df = pd.read_json(file, orient='records')
you can load as 1 key per column, but the sub-keys will be split up into multiple rows.

Categories

Resources