How to flatten columns in pandas dataframe with some columns as json? - python

I try to flatten some columns in my dataframe, but unfurtunately it does not work.
What would be the correct way of doing this?
created_at
tweet_hashtag
tweet_cashtag
2022-07-23
[{'start': 16, 'end': 27, 'tag': 'blockchain'}, {'start': 28, 'end': 32, 'tag': 'btc'}, {'start': 33, 'end': 37, 'tag': 'eth'}, {'start': 38, 'end': 42, 'tag': 'eth'}]
[{'start': 0, 'end': 4, 'tag': 'Act'}, {'start': 7, 'end': 11, 'tag': 'jar'}]
2022-04-24
[{'start': 6, 'end': 7, 'tag': 'chain'}, {'start': 8, 'end': 3, 'tag': 'btc'}, {'start': 3, 'end': 7, 'tag': 'eth'}]
[{'start': 4, 'end': 8, 'tag': 'Act'}, {'start': 7, 'end': 9, 'tag': 'aapl'}]
And my preferred result would be:
created_at
tweet_hashtag.tag
tweet_cashtag.tag
2022-07-23
blockchain, btc, eth,eth
Act, jar
2022-04-24
chain, btc, eth
Act, aapl
Thanks in advance!
I tried to flatten with this solution, but it does not work: How to apply json_normalize on entire pandas column

you can use:
def get_values(a,b):
x_values=[]
for i in range(0,len(a)):
x_values.append(a[i]['tag'])
y_values=[]
for j in range(0,len(b)):
y_values.append(b[j]['tag'])
return ','.join(x_values),','.join(y_values)
df[['tweet_hashtag','tweet_cashtag']]=df[['tweet_hashtag','tweet_cashtag']].apply(lambda x: get_values(x['tweet_hashtag'], x['tweet_cashtag']),axis=1)
or:
def get_hashtags(a):
x_values=[]
for i in range(0,len(a)):
x_values.append(a[i]['tag'])
return ','.join(x_values)
def get_cashtags(b):
y_values=[]
for i in range(0,len(b)):
y_values.append(b[i]['tag'])
return ','.join(y_values)
df['tweet_hashtag']=df['tweet_hashtag'].apply(lambda x: get_hashtags(x))
df['tweet_cashtag']=df['tweet_cashtag'].apply(lambda x: get_cashtags(x))
print(df)
'''
created_at tweet_hashtag tweet_cashtag
0 2022-07-23 blockchain,btc,eth,eth Act,jar
1 2022-04-24 chain,btc,eth Act,aapl
'''

Related

Python create list of dicts

I am new to python and I am trying to construct data structure from existing data.
I have following:
[
{'UserName': 'aaa', 'AccessKeyId': 'AKIAYWQTISJD6X27YVK', 'Status': 'Active', 'CreateDate': datetime.datetime(2022, 9, 8, 15, 56, 39, tzinfo=tzutc())},
{'UserName': 'eee', 'AccessKeyId': 'AKIAYWQTISJD6QXMAKY', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 12, 30, 59, tzinfo=tzutc())},
{'UserName': 'eee', 'AccessKeyId': 'AKIAYWQTISJDUARK6FV', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 16, 58, 38, tzinfo=tzutc())}
]
I need to get this:
{
"aaa": [
{'AccessKeyId': 'AKIAYWQTISJD6X27YVK', 'Status': 'Active', 'CreateDate': datetime.datetime(2022, 9, 8, 15, 56, 39, tzinfo=tzutc())}],
"eee": [
{'AccessKeyId': 'AKIAYWQTISJD6QXMAKY', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 12, 30, 59, tzinfo=tzutc())},
{'AccessKeyId': 'AKIAYWQTISJDUARK6FV', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 16, 58, 38, tzinfo=tzutc())}
]
}
I tried following:
list_per_user = {i['UserName']: copy.deepcopy(i) for i in key_list}
for obj in list_per_user:
del list_per_user[obj]['UserName']
but I am missing array here. So in case of two keys per user I will have only last one with this. I don't know how to get the list I need per user.
Thanks!
Create an external dict that maps username -> list of entries.
data = [
{'UserName': 'aaa', 'AccessKeyId': 'AKIAYWQTISJD6X27YVK', 'Status': 'Active', 'CreateDate': datetime.datetime(2022, 9, 8, 15, 56, 39, tzinfo=tzutc())},
{'UserName': 'eee', 'AccessKeyId': 'AKIAYWQTISJD6QXMAKY', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 12, 30, 59, tzinfo=tzutc())},
{'UserName': 'eee', 'AccessKeyId': 'AKIAYWQTISJDUARK6FV', 'Status': 'Active', 'CreateDate': datetime.datetime(2023, 1, 24, 16, 58, 38, tzinfo=tzutc())}
]
new_data = {}
for entry in data:
new_data.setdefault(entry["UserName"], []).append(
{k: v for k, v in entry.items() if k != "UserName"}
)
print(new_data)
Output (some fields hidden because I don't want to import those libraries in my repl, but they'll be there when you run it)
{'aaa': [{'AccessKeyId': 'AKIAYWQTISJD6X27YVK', 'Status': 'Active'}],
'eee': [{'AccessKeyId': 'AKIAYWQTISJD6QXMAKY', 'Status': 'Active'},
{'AccessKeyId': 'AKIAYWQTISJDUARK6FV', 'Status': 'Active'}]}

Append a list of dictionaries to the value in another dictionary

I am trying to create nested dictionaries as I loop through tokens output by my NER model. This is the code that I have so far:
token_classifier = pipeline('ner', model='./fine_tune_nerbert_output/', tokenizer = './fine_tune_nerbert_output/', aggregation_strategy="average")
sentence = "alisa brown i live in san diego, california and sometimes in kansas city, missouri"
tokens = token_classifier(sentence)
which outputs:
[{'entity_group': 'LABEL_1',
'score': 0.99938214,
'word': 'alisa',
'start': 0,
'end': 5},
{'entity_group': 'LABEL_2',
'score': 0.9972813,
'word': 'brown',
'start': 6,
'end': 11},
{'entity_group': 'LABEL_0',
'score': 0.99798816,
'word': 'i live in',
'start': 12,
'end': 21},
{'entity_group': 'LABEL_3',
'score': 0.9993938,
'word': 'san',
'start': 22,
'end': 25},
{'entity_group': 'LABEL_4',
'score': 0.9988097,
'word': 'diego',
'start': 26,
'end': 31},
{'entity_group': 'LABEL_0',
'score': 0.9996742,
'word': ',',
'start': 31,
'end': 32},
{'entity_group': 'LABEL_3',
'score': 0.9985813,
'word': 'california',
'start': 33,
'end': 43},
{'entity_group': 'LABEL_0',
'score': 0.9997311,
'word': 'and sometimes in',
'start': 44,
'end': 60},
{'entity_group': 'LABEL_3',
'score': 0.9995384,
'word': 'kansas',
'start': 61,
'end': 67},
{'entity_group': 'LABEL_4',
'score': 0.9988242,
'word': 'city',
'start': 68,
'end': 72},
{'entity_group': 'LABEL_0',
'score': 0.99949193,
'word': ',',
'start': 72,
'end': 73},
{'entity_group': 'LABEL_3',
'score': 0.99960154,
'word': 'missouri',
'start': 74,
'end': 82}]
I then run a for loop:
ner_dict = dict()
nested_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
if token['entity_group'] in ner_dict:
nested_dict[token['entity_group']] = {}
nested_dict[token['entity_group']][token['word']] = token['score']
ner_dict.update({token['entity_group']: (ner_dict[token['entity_group']], nested_dict[token['entity_group']])})
else:
ner_dict[token['entity_group']] = {}
ner_dict[token['entity_group']][token['word']] = token['score']
this outputs:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ((({'san': 0.9994766}, {'california': 0.998961}),
{'san': 0.99925905}),
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
which is close to what I want but this is my ideal output:
{'LABEL_1': {'devyn': 0.9995816},
'LABEL_2': {'donahue': 0.9996502},
'LABEL_3': ({'san': 0.9994766}, {'california': 0.998961}, {'san': 0.99925905},
{'california': 0.9987863}),
'LABEL_4': ({'francisco': 0.99923646}, {'diego': 0.9992399})}
how would I do this without getting each entry in a different tuple? Thanks in advance.
Your output for LABEL_4 should be diego and city based on the input provided. Something like below :
{
'LABEL_1': {'alisa': 0.99938214},
'LABEL_2': {'brown': 0.9972813},
'LABEL_3': {'san': 0.9993938, 'california': 0.9985813, 'kansas': 0.9995384},
'LABEL_4': {'diego': 0.9988097, 'city': 0.9988242}
}
If the above output is what you desire, change the code to
ner_dict = dict()
for token in tokens:
if token['entity_group'] != 'LABEL_0':
nested_dict = ner_dict.setdefault(token['entity_group'], {})
nested_dict[token['word']] = token['score']
Here example that you can use with your code
ner_dict = {}
for token in tokens:
if token['entity_group'] != 'LABEL_0':
ner_dict.setdefault(token['entity_group'], {})[token['word']] = token['score']

Flatten/merge a list of dictionaries in python

I have a list of dictionaries:
data = [{'average': 2, 'day': '2022-01-01'},
{'average': 3, 'day': '2022-01-02'},
{'average': 5, 'day': '2022-01-03'},
{'sum': 8, 'day': '2022-01-01'},
{'sum': 15, 'day': '2022-01-02'},
{'sum': 9, 'day': '2022-01-03'},
{'total_value': 19, 'day': '2022-01-01'},
{'total_value': 99, 'day': '2022-01-02'},
{'total_value': 15, 'day': '2022-01-03'}]
I want my output as:
output = [{'average': 2, 'sum': 8, 'total_value': 19, 'day': '2022-01-01'},
{'average': 3, 'sum': 15, 'total_value': 99, 'day': '2022-01-02'},
{'average': 5, 'sum': 9, 'total_value': 15, 'day': '2022-01-03'}]
The output puts the values together based off their date. My approaches so far have been to try and separate everything out into different dictionaries (date_dict, sum_dict, etc.) and then bringing them all together, but that doesn't seem to work and is extremely sloppy.
You could iterate over data and create a dictionary using day as key:
data = [{'average': 2, 'day': '2022-01-01'},
{'average': 3, 'day': '2022-01-02'},
{'average': 5, 'day': '2022-01-03'},
{'sum': 8, 'day': '2022-01-01'},
{'sum': 15, 'day': '2022-01-02'},
{'sum': 9, 'day': '2022-01-03'},
{'total_value': 19, 'day': '2022-01-01'},
{'total_value': 99, 'day': '2022-01-02'},
{'total_value': 15, 'day': '2022-01-03'}]
output = {}
for item in data:
if item['day'] not in output:
output[item['day']] = item
else:
output[item['day']].update(item)
print(list(output.values()))
Out:
[
{'average': 2, 'day': '2022-01-01', 'sum': 8, 'total_value': 19},
{'average': 3, 'day': '2022-01-02', 'sum': 15, 'total_value': 99},
{'average': 5, 'day': '2022-01-03', 'sum': 9, 'total_value': 15}
]
Had a bit of fun and made it with dict/list comprehension. Check out that neat | operator in python 3.9+ :-)
Python <3.9
from collections import ChainMap
data_grouped_by_day = {
day : dict(ChainMap(*[d for d in data if d["day"] == day ]))
for day in {d["day"] for d in data }
}
for day, group_data in data_grouped_by_day.items():
group_data.update(day=day)
result = list(data_grouped_by_day.values())
Python 3.9+
from collections import ChainMap
result = [
dict(ChainMap(*[d for d in data if d["day"] == day ])) | {"day" : day}
for day in {d["day"] for d in data}
]
The output in both cases is (keys order may vary)
[{'total_value': 99, 'day': '2022-01-02', 'sum': 15, 'average': 3},
{'total_value': 15, 'day': '2022-01-03', 'sum': 9, 'average': 5},
{'total_value': 19, 'day': '2022-01-01', 'sum': 8, 'average': 2}]

Altering string using a list of dictionaries

Background
I am using NeuroNER http://neuroner.com/ to label text data sample_string as seen below.
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2000 and her number is 1111112222'
Output (using NeuroNER)
My output is a list of dictionary dic_list
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2000'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '1111112222'}]
Legend
id = text ID
type = type of text being identified
start = starting position of identified text
end = ending position of identified text
text = text that is identified
Goal
Since the location of the text(e.g. Jane) is given by start and end, I would like to change each text from dic_list to **BLOCK** in my list sample_string
Desired Output
sample_string = 'Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**
Question
I have tried Replacing a character from a certain index and Edit the values in a list of dictionaries? but they are not quite what I am looking for
How do I achieve my desired output?
If you want a solution based on the start and end indexes,
you can use the intervals between those is dic_list, to know which parts you need. then join them with **BLOCK**.
try this:
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
parts_to_take = [(0, dic_list[0]['start'])] + [(first["end"]+1, second["start"]) for first, second in zip(dic_list, dic_list[1:])] + [(dic_list[-1]['end'], len(sample_string)-1)]
parts = [sample_string[start:end] for start, end in parts_to_take]
sample_string = '**BLOCK**'.join(parts)
print(sample_string)
I may be missing something but you can just use .replace():
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
for dic in dic_list:
sample_string = sample_string.replace(dic['text'], '**BLOCK**')
print(sample_string)
Though regex will probably be faster:
import re
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 0, 'end': 6, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
pattern = re.compile('|'.join(dic['text'] for dic in dic_list))
result = pattern.sub('**BLOCK**', sample_string)
print(result)
Both output:
Patient **BLOCK** **BLOCK** was seen by Dr. **BLOCK** on **BLOCK** and her number is **BLOCK**
per the suggestion of # Error - Syntactical Remorse
sample_string = 'Patient Jane Candy was seen by Dr. Smith on 12/1/2018 and her number is 5041112222'
dic_list = [
{'id': 'T1', 'type': 'PATIENT', 'start': 8, 'end': 11, 'text': 'Jane'},
{'id': 'T2', 'type': 'PATIENT', 'start': 13, 'end': 17, 'text': 'Candy'},
{'id': 'T3', 'type': 'DOCTOR', 'start': 35, 'end': 39, 'text': 'Smith'},
{'id': 'T4', 'type': 'DATE', 'start': 44, 'end': 52, 'text': '12/1/2018'},
{'id': 'T5', 'type': 'PHONE', 'start': 72, 'end': 81, 'text': '5041112222'}]
offset = 0
filler = '**BLOCK**'
for dic in dic_list:
sample_string = sample_string[:dic['start'] + offset ] + filler + sample_string[dic['end'] + offset + 1:]
offset += dic['start'] - dic['end'] + len(filler) - 1
print(sample_string)

Trying To Sort A List Of Dicts After Combining Django Query

I have a django view that I need to query from different models and combine them, and then organize by date ('created_at'), right now when combining the models I get a list of dicts like below. How can I sort this by date.
[{'content': u'Just another another message', 'created_at':
datetime.datetime(2018, 4, 22, 15, 35, 11, 577175, tzinfo=<UTC>),
u'successmatch_id': 5, u'id': 8, 'reciever': u'UserA'},
{'content': u'testing blah', 'created_at': datetime.datetime(2018, 4,
22, 15, 33, 28, 84469, tzinfo=<UTC>), u'successmatch_id': 5, u'id': 7,
'reciever': u'UserB'}, {'content': u'Hi how are you',
'created_at': datetime.datetime(2018, 4, 22, 13, 29, 49, 516701,
tzinfo=<UTC>), u'successmatch_id': 5, u'id': 6, 'reciever':
u'UserA'}]
Python's built-in sorting has the ability to specify what metric to sort by:
x = [{"test": 1}, {"test": 2}, {"test": 0}]
x.sort(key=lambda item: item["test"])
x is edited in place, and is now:
[{'test': 0}, {'test': 1}, {'test': 2}]
So, in your case, assuming your list is called my_list, you'd want to do:
my_list.sort(key=lambda item: item["created_at"])
Or, if you wanted the newest dicts to occur first,
my_list.sort(key=lambda item: item["created_at"], reverse=True)
If you are happy using a 3rd party library, you can use pandas, which accepts a list of dictionaries.
But note that datetime objects may be converted to pandas.Timestamp objects.
import pandas as pd
import datetime
lst = [{'content': u'Just another another message',
'created_at': datetime.datetime(2018, 4, 22, 15, 35, 11, 577175, tzinfo=None),
u'successmatch_id': 5, u'id': 8, 'reciever': u'UserA'},
{'content': u'testing blah',
'created_at': datetime.datetime(2018, 4, 22, 15, 33, 28, 84469, tzinfo=None),
u'successmatch_id': 5, u'id': 7, 'reciever': u'UserB'},
{'content': u'Hi how are you',
'created_at': datetime.datetime(2018, 4, 22, 13, 29, 49, 516701, tzinfo=None),
u'successmatch_id': 5, u'id': 6, 'reciever': u'UserA'}]
res = pd.DataFrame(lst).sort_values('created_at').T.to_dict().values()
Result:
dict_values([{'content': 'Hi how are you', 'created_at': Timestamp('2018-04-22 13:29:49.516701'),
'id': 6, 'reciever': 'UserA', 'successmatch_id': 5},
{'content': 'testing blah', 'created_at': Timestamp('2018-04-22 15:33:28.084469'),
'id': 7, 'reciever': 'UserB', 'successmatch_id': 5},
{'content': 'Just another another message', 'created_at': Timestamp('2018-04-22 15:35:11.577175'),
'id': 8, 'reciever': 'UserA', 'successmatch_id': 5}])

Categories

Resources