I have the following telegram export JSON dataset:
import pandas as pd
df = pd.read_json("data/result.json")
>>>df.colums
Index(['name', 'type', 'id', 'messages'], dtype='object')
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
# Sample output
sample_df = pd.DataFrame({"messages": [
{"id": 11, "from": "user3984", "text": "Do you like soccer?"},
{"id": 312, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]},
{"id": 4324, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '#user87324'}, ' really?']}
]})
Within df, there's a "messages" column, which has the following output:
>>> df["messages"]
0 {'id': -999713937, 'type': 'service', 'date': ...
1 {'id': -999713936, 'type': 'service', 'date': ...
2 {'id': -999713935, 'type': 'message', 'date': ...
3 {'id': -999713934, 'type': 'message', 'date': ...
4 {'id': -999713933, 'type': 'message', 'date': ...
...
22377 {'id': 22102, 'type': 'message', 'date': '2022...
22378 {'id': 22103, 'type': 'message', 'date': '2022...
22379 {'id': 22104, 'type': 'message', 'date': '2022...
22380 {'id': 22105, 'type': 'message', 'date': '2022...
22381 {'id': 22106, 'type': 'message', 'date': '2022...
Name: messages, Length: 22382, dtype: object
Within messages, there's a particular key named "text", and that's the place I want to focus. Turns out when you explore the data, text column can have:
A single text:
>>> df["messages"][5]["text"]
'JAJAJAJAJAJAJA'
>>> df["messages"][22262]["text"]
'No creo'
But sometimes it's nested. Like the following:
>>> df["messages"][22373]["text"]
['O ', {'type': 'mention', 'text': '#user87324'}, ' really?']
>>> df["messages"][22189]["text"]
['The average married couple has sex roughly once a week. ', {'type': 'mention', 'text': '#googlefactss'}, ' ', {'type': 'hashtag', 'text': '#funfact'}]
>>> df["messages"][22345]["text"]
[{'type': 'mention', 'text': '#user817430'}]
In case for nested data, if I want to grab the main text, I can do the following:
>>> df["messages"][22373]["text"][0]
'O '
>>> df["messages"][22189]["text"][0]
'The average married couple has sex roughly once a week. '
>>>
From here, everything seems ok. However, the problem arrives when I do the for loop. If I try the following:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
print(tg_id, tg_from, tg_text)
A sample output is:
21263 user3984 jajajajaja
21264 user837 ['Not sure', {'type': 'hashtag', 'text': '#confused'}]
21265 user3984 What time is it?✋
MY ASK: How to flatten the rows? I need the following (and store that in a data frame):
21263 user3984 jajajajaja
21264 user837 Not sure
21265 user837 type: hashtag
21266 user837 text: #confused
21267 user3984 What time is it?✋
I tried to detect "text" type like this:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
if type(tg_text) == list:
tg_text = tg_text[0]
print(tg_id, tg_from, tg_text)
With this I only grab the first text, but I'm expecting to grab the other fields as well or to 'flatten' the data.
I also tried:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
if type(tg_text) == list:
tg_text = tg_text[0]
tg_second = tg_text[1]["text"]
print(tg_id, tg_from, tg_text, tg_second)
But no luck because indices are variable, length from messages are variable too.
In addition, even if the output weren't close of my desired solution, I also tried:
for item in df["messages"]:
tg_text = item.get("text", "None")
if type(tg_text) == list:
for i in tg_text:
print(item, i)
mydict = {}
for k, v in df.items():
print(k, v)
mydict[k] = v
# Used df["text"].explode()
# Used json_normalize but no luck
Any thoughts?
Assuming a dataframe like the following:
df = pd.DataFrame({"messages": [
{"id": 21263, "from": "user3984", "text": "jajajajaja"},
{"id": 21264, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]},
{"id": 21265, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '#user87324'}, ' really?']}
]})
First, expand the messages dictionaries into separate id, from and text columns.
expanded = pd.concat([df.drop("messages", axis=1), pd.json_normalize(df["messages"])], axis=1)
Then explode the dataframe to have a row for each entry in text:
exploded = expanded.explode("text")
Then expand the dictionaries that are in some of the entries, converting them to lists of text:
def convert_dict(entry):
if type(entry) is dict:
return [f"{k}: {v}" for k, v in entry.items()]
else:
return entry
exploded["text"] = exploded["text"].apply(convert_dict)
Finally, explode again to separate the converted dicts to separate rows.
final = exploded.explode("text")
The resulting output should look like this
id from text
0 21263 user3984 jajajajaja
1 21264 user837 Not sure
1 21264 user837 type: hashtag
1 21264 user837 text: #confused
2 21265 user3984 O
2 21265 user3984 type: mention
2 21265 user3984 text: #user87324
2 21265 user3984 really?
Just to share some ideas to flatten your list,
def flatlist(srclist):
flatlist=[]
if srclist: #check if srclist is not None
for item in srclist:
if(type(item) == str): #check if item is type of string
flatlist.append(item)
if(type(item) == dict): #check if item is type of dict
for x in item:
flatlist.append(x + ' ' + item[x]) #combine key and value
return flatlist
for item in df["messages"]:
tg_text = item.get("text", "None")
flat_list = flatlist(tg_text) # get the flattened list
for tg in flat_list: # loop through the list and get the data you want
tg_id = item.get("id", "None")
tg_from = item.get("from", "None")
print(tg_id, tg_from, tg)
Related
Trying to seed a database in django app. I have a csv file that I converted to json and now I need to reformat it to match the django serialization required format found here
This is what the json format needs to look like to be acceptable to django (Which looks an awful lot like a dictionary with 3 keys, the third having a value which is a dictionary itself):
[
{
"pk": "4b678b301dfd8a4e0dad910de3ae245b",
"model": "sessions.session",
"fields": {
"expire_date": "2013-01-16T08:16:59.844Z",
...
}
}
]
My json data looks like this after converting it from csv with pandas:
[{'model': 'homepage.territorymanager', 'pk': 1, 'Name': 'Aaron ##', 'Distributor': 'National Energy', 'State': 'BC', 'Brand': 'Trane', 'Cell': '778-###-####', 'email address': None, 'Notes': None, 'Unnamed: 9': None}, {'model': 'homepage.territorymanager', 'pk': 2, 'Name': 'Aaron Martin ', 'Distributor': 'Pierce ###', 'State': 'PA', 'Brand': 'Bryant/Carrier', 'Cell': '267-###-####', 'email address': None, 'Notes': None, 'Unnamed: 9': None},...]
I am using this function to try and reformat
def re_serialize_reg_json(d, jsonFilePath):
for i in d:
d2 = {'Name': d[i]['Name'], 'Distributor' : d[i]['Distributor'], 'State' : d[i]['State'], 'Brand' : d[i]['Brand'], 'Cell' : d[i]['Cell'], 'EmailAddress' : d[i]['email address'], 'Notes' : d[i]['Notes']}
d[i] = {'pk': d[i]['pk'],'model' : d[i]['model'], 'fields' : d2}
print(d)
and it returns this error which doesn't make any sense because the format that django requires has a dictionary as the value of the third key:
d2 = {'Name': d[i]['Name'], 'Distributor' : d[i]['Distributor'], 'State' : d[i]['State'], 'Brand' : d[i]['Brand'], 'Cell' : d[i]['Cell'], 'EmailAddress' : d[i]['email address'], 'Notes' : d[i]['Notes']}
TypeError: list indices must be integers or slices, not dict
Any help appreciated!
Here is what I did to get d:
df = pandas.read_csv('/Users/justinbenfit/territorymanagerpython/territory managers - Sheet1.csv')
df.to_json('/Users/justinbenfit/territorymanagerpython/territorymanagers.json', orient='records')
jsonFilePath = '/Users/justinbenfit/territorymanagerpython/territorymanagers.json'
def load_file(file_path):
with open(file_path) as f:
d = json.load(f)
return d
d = load_file(jsonFilePath)
print(d)
D is actually a list containing multiple dictionaries, so in order to make it work you want to change that for i in d part to: for i in range(len(d)).
I have the a dictionary like this:
{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"}
I want to create another list as follows:
[{"label":{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"},"value":
{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"}}]
I have tried some methods with .items() but none of them gives the desired result.
Is that what you want?
dict_ = {"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"}
output = [{"label": dict_ , "value": dict_ }]
print(output)
[{"label":{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"},"value":
{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"}}] == [{"label": dict_ , "value": dict_ }]
Gives True
Following my comment, below is the code I would go through assuming key and output:
# Could be the keys would get from somewhere
vals = ["1","2","3","4"]
# Probably same coming from external sources
example_op =
{"Topic":"text","title":"texttitle","abstract":"textabs","year":"textyear","authors":"authors"}
#Global list
item_list = []
temp_dict = {}
for key in vals:
temp_dict[key] = example_op
item_list.append(temp_dict)
Final output of the list would be as:
Out[9]:
[{'1': {'Topic': 'text',
'title': 'texttitle',
'abstract': 'textabs',
'year': 'textyear',
'authors': 'authors'},
'2': {'Topic': 'text',
'title': 'texttitle',
'abstract': 'textabs',
'year': 'textyear',
'authors': 'authors'},
'3': {'Topic': 'text',
'title': 'texttitle',
'abstract': 'textabs',
'year': 'textyear',
'authors': 'authors'},
'4': {'Topic': 'text',
'title': 'texttitle',
'abstract': 'textabs',
'year': 'textyear',
'authors': 'authors'}}]
suppose I have a list of dicts (where each dict has the same keys) like this:
list_of_dicts = [
{'Id': 4726, 'Body': 'Hello from John', 'Title': None, 'Comments': 'Dallas. '},
{'Id': 4726, 'Body': 'Hello from Mary', 'Title': None, 'Comments': "Austin"},
{'Id': 4726, 'Body': 'Hello from Dylan', 'Title': None, 'Comments': "Boston"},
]
I need to concat only the Body, Title and Comments part and return a single dict, like this:
{'Id': 4726, 'Body': 'Hello from John Hello from Mary Hello from Dylan', 'Title': None, 'Comments': 'Dallas. Austin Boston'}
Please note, Title is None. So, we have to be careful there. This is what I have done so far...but, failing somewhere...I cannot see where...
keys = set().union(*list_of_dicts)
print(keys)
k_value = list_of_dicts[0]['Id']
d_dict = {k: " ".join(str(dic.get(k, '')) for dic in list_of_dicts) for k in keys if k != 'Id'}
merged_dict = {'Id': k_value}
merged_dict.update(d_dict)
But, the above returns this ...which I do not like:
Final Merged Dict: {'Id': 4726, 'Body': 'Hello from John Hello from Mary Hello from Dylan', 'Title': 'None None None', 'Comments': 'Dallas. Austin Boston'}
First, I'd remove Id from keys to avoid having to skip it in the dictionary comprehension, and use a simple assignment rather than .update() at the end.
In the argument to join, filter out when dic[k] is None. And if the join results in an empty string (because all the values are None), convert that to None in the final result.
keys = set().union(*list_of_dicts)
keys.remove('Id')
print(keys)
k_value = list_of_dicts[0]['Id']
d_dict = {k: (" ".join(str(dic[k]) for dic in list_of_dicts if k in dic and dic[k] is not None) or None) for k in keys}
d_dict['Id'] = k_value
print(d_dict)
DEMO
As you parse your list of dictionaries, you can store the intermediate results in defaultdict objects to hold a list of the string values. Once all the dictionaries have been parsed, you can then join together the strings.
from collections import defaultdict
dd_body = defaultdict(list)
dd_comments = defaultdict(list)
dd_titles = defaultdict(list)
for row in list_of_dicts:
dd_body[row['Id']].append(row['Body'])
dd_comments[row['Id']].append(row['Comments'])
dd_titles[row['Id']].append(row['Title'] or '') # Effectively removes `None`.
result = []
for id_ in dd_body: # All three dictionaries have the same keys.
body = ' '.join(dd_body[id_]).strip()
comments = ' '.join(dd_comments[id_]).strip()
titles = ' '.join(dd_titles[id_]).strip() or None
result.append({'Id': id_, 'Body': body, 'Title': titles, 'Comments': comments})
>>> result
[{'Id': 4726,
'Body': 'Hello from John Hello from Mary Hello from Dylan',
'Title': None,
'Comments': 'Dallas. Austin Boston'}]
Less Pythonic then other answers but I like to think that it is easy to understand.
body, title, comments = "", "", ""
list_of_dicts=[
{'Id': 4726, 'Body': 'Hello from John', 'Title': None, 'Comments': 'Dallas. '},
{'Id': 4726, 'Body': 'Hello from Mary', 'Title': None, 'Comments': "Austin"},
{'Id': 4726, 'Body': 'Hello from Dylan', 'Title': None, 'Comments': "Boston"},
]
id = list_of_dicts[0]['Id']
for dict in list_of_dicts:
if dict['Body'] is not None:
body=body + dict['Body']
if dict['Title'] is not None:
title=title + dict['Title']
if dict ['Comments'] is not None:
comments=comments + dict['Comments']
if title == "":
title = None
if body == "":
body = None
if comments == "":
comments = None
record = {'Id': id, 'Body': body, 'Title': title, 'Comments': comments}
If only Title field has a option of being None then it can be shortened by removing the checks on the other fields.
body, title, comments = "", "", ""
list_of_dicts=[
{'Id': 4726, 'Body': 'Hello from John', 'Title': None, 'Comments': 'Dallas. '},
{'Id': 4726, 'Body': 'Hello from Mary', 'Title': None, 'Comments': "Austin"},
{'Id': 4726, 'Body': 'Hello from Dylan', 'Title': None, 'Comments': "Boston"}]
id = list_of_dicts[0]['Id']
for dict in list_of_dicts:
body=body + dict['Body']
comments=comments + dict['Comments']
if dict['Title'] is not None:
title=title + dict['Title']
if title == "":
title = None
record = {'Id': id, 'Body': body, 'Title': title, 'Comments': comments}
For this type of data manipulations pandas is your friend.
import pandas as pd
# Your list of dictionaries.
list_of_dicts = [
{'Id': 4726, 'Body': 'Hello from John', 'Title': None, 'Comments': 'Dallas. '},
{'Id': 4726, 'Body': 'Hello from Mary', 'Title': None, 'Comments': "Austin"},
{'Id': 4726, 'Body': 'Hello from Dylan', 'Title': None, 'Comments': "Boston"},
]
# Can be read into a pandas dataframe
df = pd.DataFrame(list_of_dicts)
# Do a database style groupby() and apply the function that you want to each group
group_transformed_df = df.groupby('Id').agg(lambda x: ' '.join(x)).reset_index() # I do reset_index to get a normal DataFrame back.
# DataFrame() -> dict
output_dict = group_transformed_df.to_dict('records')
There are many types of dicts you can get from a DataFrame. You want the records option.
I want to convert this nested json into a df.
Tried different functions but none works correctly.
The encoding that worked for my was -
encoding = "utf-8-sig"
[{'replayableActionOperationState': 'SKIPPED',
'replayableActionOperationGuid': 'RAO_1037351',
'failedMessage': 'Cannot replay action: RAO_1037351: com.ebay.sd.catedor.core.model.DTOEntityPropertyChange; local class incompatible: stream classdesc serialVersionUID = 7777212484705611612, local class serialVersionUID = -1785129380151507142',
'userMessage': 'Skip all mode',
'username': 'gfannon',
'sourceAuditData': [{'guid': '24696601-b73e-43e4-bce9-28bc741ac117',
'operationName': 'UPDATE_CATEGORY_ATTRIBUTE_PROPERTY',
'creationTimestamp': 1563439725240,
'auditCanvasInfo': {'id': '165059', 'name': '165059'},
'auditUserInfo': {'id': 1, 'name': 'gfannon'},
'externalId': None,
'comment': None,
'transactionId': '0f135909-66a7-46b1-98f6-baf1608ffd6a',
'data': {'entity': {'guid': 'CA_2511202',
'tagType': 'BOTH',
'description': None,
'name': 'Number of Shelves'},
'propertyChanges': [{'propertyName': 'EntityProperty',
'oldEntity': {'guid': 'CAP_35',
'name': 'DisableAsVariant',
'group': None,
'action': 'SET',
'value': 'true',
'tagType': 'SELLER'},
'newEntity': {'guid': 'CAP_35',
'name': 'DisableAsVariant',
'group': None,
'action': 'SET',
'value': 'false',
'tagType': 'SELLER'}}],
'entityChanges': None,
'primary': True}}],
'targetAuditData': None,
'conflictedGuids': None,
'fatal': False}]
This is what i tried so far, there are more tries but that got me as close as i can.
with open(r"Desktop\Ann's json parsing\report.tsv", encoding='utf-8-sig') as data_file:
data = json.load(data_file)
df = json_normalize(data)
print (df)
pd.DataFrame(df) ## The nested lists are shown as a whole column, im trying to parse those colums - 'failedMessage' and 'sourceAuditData'`I also tried json.loads/json(df) but the output isnt correct.
pd.DataFrame.from_dict(a['sourceAuditData'][0]['data']['propertyChanges'][0]) ##This line will retrive one of the outputs i need but i dont know how to perform it on the whole file.
The expected result should be a csv/xlsx file with a column and value for each row.
For your particular example:
def unroll_dict(d):
data = []
for k, v in d.items():
if isinstance(v, list):
data.append((k, ''))
data.extend(unroll_dict(v[0]))
elif isinstance(v, dict):
data.append((k, ''))
data.extend(unroll_dict(v))
else:
data.append((k,v))
return data
And given the data in your question is stored in the variable example:
df = pd.DataFrame(unroll_dict(example[0])).set_index(0).transpose()
My question is very similar to Reorganize Dictionary
my dictionary comes in very similar type presented above. So I want to replicate this.
But I have to error `''str' object cannot be interpreted as an integer. I present you my code.
appended_data = pd.DataFrame()
for item in data:
for ii in list(item.keys()):
df = pd.DataFrame.from_dict(item[ii], orient='index')
df['date'] = int(ii)
df['code'] = item[ii]
appended_data = appended_data.append(df)
dic = [appended_data.to_dict(orient='records')]
# [[{'comp': '삼성자산운용', 'name': 'KODEX 레버리지', 'base': '코스피 200', 'lstdt': '2010/02/22', 'tax': '배당소득세(보유기간과세)', 'earn': '55.59', 'bosu': '0.64', 'ocha': '3.87', 'gap': '-0.48', 'nav': '18,494', 'volt': '높음', 'bun': '주식-시장대표', 'repli': '실물', 'pdf': '', 'info': '보기', 'date': 20180102, 'code': 'A278240'}, {'comp': '미래에셋자산운용', 'name': 'TIGER 레버리지', 'base': '코스피 200', 'lstdt': '2010/04/09', 'tax': '배당소득세(보유기간과세)', 'earn': '57.98', 'bosu': '0.09', 'ocha': '3.66', 'gap': '-0.27', 'nav': '295', 'volt': '보통', 'bun': '주식-시장대표', 'repli': '실물', 'pdf': '', 'info': '보기', 'date': 20180102, 'code': 'A267770'}]
output = {}
for entry in dic:
entry = entry.copy()
date = entry.pop('date') #Here is the error
code = entry.pop('code') #Here is the error
output.setdefault(code, {})[data] = entry
Thank you so much
Finally, I found the answer, but it requires a lot of memory in the computer.
import collections
import itertools
import pandas as pd
def expand_grid(data_dict):
rows = itertools.product(*data_dict.values())
return pd.DataFrame.from_records(rows, columns=data_dict.keys())
appended_data = pd.DataFrame()
for item in re_data2:
for ii in list(item.keys()):
df = pd.DataFrame.from_dict(item[ii], orient='index')
df['date'] = ii
df['code'] = item[ii]
appended_data = appended_data.append(df)
dic = [appended_data.to_dict(orient='records')]
d = collections.defaultdict(dict)
for entry in dic:
for item in entry:
key1 = item['code']
key2 = item['date']
value = {item['comp'], item['name'], item['base'], item['lstdt'], item['tax'], item['earn'], \
item['bosu'], item['ocha'], item['gap'], item['nav'], item['volt'], item['bun'], \
item['repli'], item['pdf'], item['info']}
d[key1][key2] = value
print(d)