Parse JSON in a Pandas DataFrame - python

I have some data in a pandas DataFrame, but one of the columns contains multi-line JSON. I am trying to parse that JSON out into a separate DataFrame along with the CustomerId. Here you will see my DataFrame...
df
Out[1]:
Id object
CustomerId object
CallInfo object
Within the CallInfo column, the data looks like this...
[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]
I want to create a new DataFrame called df_norm which contains the CustomerId, CallDate, and CallLength.
I have tried several ways but couldn't find a working solution. Can anyone help me with this?
Mock up code example...
import pandas as pd
import json
Id = [1, 2, 3]
CustomerId = [700001, 700002, 700003]
CallInfo = ['[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]', '[{"CallDate":"2021-07-09","CallLength":102}]', '[{"CallDate":"2021-07-11","CallLength":226},{"CallDate":"2021-07-11","CallLength":216}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"Id": Id,
"CustomerId": CustomerId,
"CallInfo": CallInfo
})
print(df)

This should work. Create a new list of rows and then toss that into the pd.DataFrame constructor:
new_rows = [{
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': item['CallDate'],
'CallLength': item['CallLength']}
for _, row in df.iterrows() for item in json.loads(row['CallInfo'])]
new_df = pd.DataFrame(new_rows)
print(new_df)
EDIT: to account for None values in CallInfo column:
new_rows = []
for _, row in df.iterrows():
call_date = None
call_length = None
if row['CallInfo'] is not None: # Or additional checks, e.g. == "" or something...
for item in json.loads(row['CallInfo']):
call_date = item['CallDate']
call_length = item['CallLength']
new_rows.append({
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': call_date,
'CallLength': call_length})

Related

Ungroup pandas dataframe after bfill

I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))

How do I find the uniques and the count of rows for multiple columns?

I have a file with 136 columns. I was trying to find the unique values of each column and from there, I need to find the number of rows for the unique values.
I tried using df and dict for the unique values. However, when I export it back to csv file, the unique values are exported as a list in one cell for each column.
Is there any way I can do to simplify the counting process of the unique values in each column?
df = pd.read_excel(filename)
column_headers = list(df.columns.values)
df_unique = {}
df_count = {}
def approach_1(data):
count = 0
for entry in data:
if not entry =='nan' or not entry == 'NaN':
count += 1
return count
for unique in column_headers:
new = df.drop_duplicates(subset=unique , keep='first')
df_unique[unique] = new[unique].tolist()
csv_unique = pd.DataFrame(df_unique.items(), columns = ['Data Source Field', 'First Row'])
csv_unique.to_csv('Unique.csv', index = False)
for count in df_unique:
not_nan = approach_1(df_unique[count])
df_count[count] = not_nan
csv_count = pd.DataFrame(df_count.items(), columns = ['Data Source Field', 'Count'])
.unique() is simpler ->len(df[col].unique()) is the count
import pandas as pd
dict = [
{"col1":"0","col2":"a"},
{"col1":"1","col2":"a"},
{"col1":"2","col2":"a"},
{"col1":"3","col2":"a"},
{"col1":"4","col2":"a"},
{"col2":"a"}
]
df = pd.DataFrame.from_dict(dict)
result_dict = {}
for col in df.columns:
result_dict[col] = len(df[col].dropna().unique())
print(result_dict)

Converting API output from a dictionary to a dataframe (Python)

I have fed some data into a TravelTime (https://github.com/traveltime-dev/traveltime-python-sdk) API which calculates the time it takes to drive between 2 locations. The result of this (called out) is a dictionary with that looks like this:
{'results': [{'search_id': 'arrival_one_to_many',
'locations': [{'id': 'KA8 0EU', 'properties': {'travel_time': 2646}},
{'id': 'KA21 5DT', 'properties': {'travel_time': 392}}],
'unreachable': []}]}
However, I need a table that would look a bit like this:
search_id
id
Travel Time
arrival_one_to_many
KA21 5DT
2646
arrival_one_to_many
KA21 5DT
392
I've tried converting this dictionary to a dataframe using
out_2 = pd.DataFrame.from_dict(out)
This shows as a dataframe with one column called results, so I tried use out_2['results'].str.split(',', expand=True) to split this into multiple columns at the comma delimiters but got an error:
0
0
NaN
Is anyone able to help me to get this dictionary to a readable and useable dataframe/table?
Thanks
#MICHAELKM22 since you are not using all the keys from the dictionary you wont be able to convert it directly to dataframe.
First extract required keys and then convert it into dataframe.
df_list = []
for res in data['results']:
serch_id = res['search_id']
for loc in res['locations']:
temp_df = {}
temp_df['search_id'] = res['search_id']
temp_df['id'] = loc["id"]
temp_df['travel_time'] = loc["properties"]['travel_time']
df_list.append(temp_df)
df = pd.DataFrame(df_list)
search_id id travel_time
0 arrival_one_to_many KA8 0EU 2646
1 arrival_one_to_many KA21 5DT 392
First, this json to be parsed to fetch required value. Once those values are fetched, then we can store them into dataframe.
Below is the code to parse this json (PS: I have saved json in one file) and these values added to DataFrame.
import json
import pandas as pd
f = open('Json_file.json')
req_file = json.load(f)
df = pd.DataFrame()
for i in req_file['results']:
dict_new = dict()
dict_new['searchid'] = i['search_id']
for j in i['locations']:
dict_new['location_id'] = j['id']
dict_new['travel_time'] = j['properties']['travel_time']
df = df.append(dict_new, ignore_index=True)
print(df)
Below is the output of above code:
searchid location_id travel_time
0 arrival_one_to_many KA8 0EU 2646.0
1 arrival_one_to_many KA21 5DT 392.0

Panda Dataframe adding more columns to .csv when editing certain values

I am using Panda Dataframe to store some information for my code. In my code,
Initial State of csv:
...............
ID,Name
...............
Adding Data into dataframe:
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = df.append(name_desc, ignore_index=True)
This was my panda dataframe upon creating the database:
....................
,ID,Name
0,23523223,BlahBlah
....................
Below is my code that searches through the ID column to locate the row with the stated ID (name_desc["ID"]).
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
The problem I encountered was after I have edited the name, I get a resultant db that looks like:
................................
Unnamed: 0 ID Name
0 0 23523223 BlahBlah
................................
If I continously execute:
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
I get this db:
..................................
,Unnamed: 0,Unnamed: 0.1,ID,Name
0,0,0,235283335,Dinese
..................................
I can't figure out why I have extra columns being added in the front of my database as I make edits.
I think you have a problem that is related to the df creation. The example you provided here does not return what you are showing:
BlahBlah = 'foo'
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = pd.DataFrame(data=name_desc, index=[0])
print(df.columns) # it returns an Index(['ID', 'Name'], dtype='object')
print(len(df.columns)) # it returns 2, the number of your df columns
If you can, try to find what instruction adds the extra column to your code. Otherwise you can remove the column using drop and remove the column with '' as name. inplace is used to actually modify the dataframe. if inplace is not added, you just create a view of the dataframe without actually modifying it:
df.drop(columns = [''], inplace = True)
Finally, I post in the following the full example. My assumption is that your df is somehow created with the empty column at the beginnig, so I also add it in the dictionary:
BlahBlah = 'foo'
name_desc = {'':'',"ID": 23523223, "Name": BlahBlah} # I added an empty column
df = pd.DataFrame(data=name_desc, index = [0])
print(df.columns) # Index(['', 'ID', 'Name'], dtype='object')
df.drop(columns = [''],inplace = True)
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
print(df.columns) # Index(['ID', 'Name'], dtype='object')

How to read this JSON file in Python?

I'm trying to read such a JSON file in Python, to save only two of the values of each response part:
{
"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
For example, I want to put the name and the age in a table. I already tried it this way (based on this topic), but it's not working for me.
import json
import pandas as pd
file = open("myfile.json")
data = json.loads(file)
columns = [dct['name', 'age'] for dct in data['response']]
df = pd.DataFrame(data['response'], columns=columns)
print(df)
I also have seen more solutions of reading a JSON file, but that all were solutions of a JSON file with no other header values at the top, like responseHeader in this case. I don't know how to handle that. Anyone who can help me out?
import json
with open("myfile.json") as f:
columns = [(dic["name"],dic["age"]) for dic in json.load(f)["response"]["docs"]]
print(columns)
result:
[(['Peter'], ['23']), (['Harry'], ['30'])]
You can pass the list data["response"]["docs"] to pandas directly as it's a recordset.
df = pd.DataFrame(data["response"]["docs"])`
print(df)
>>> name country age
0 [Peter] [England] [23]
1 [Harry] [Wales] [30]
The data in you DatFrame will be bracketed though as you can see. If you want to remove the brackets you can consider the following:
for column in df.columns:
df.loc[:, column] = df.loc[:, column].str.get(0)
if column == 'age':
df.loc[:, column] = df.loc[:, column].astype(int)
sample = {"responseHeader":{
"status":0,
"time":2,
"params":{
"q":"query",
"rows":"2",
"wt":"json"}},
"response":{"results":2,"start":0,"docs":[
{
"name":["Peter"],
"country":["England"],
"age":["23"]},
{
"name":["Harry"],
"country":["Wales"],
"age":["30"]}]
}}
data = [(x['name'][0], x['age'][0]) for x in
sample['response']['docs']]
df = pd.DataFrame(names, columns=['name',
'age'])

Categories

Resources