Python: most efficient way to categorize transactions - python

I have a large list of transactions that I want to categorize.
It looks like this:
transactions: [
{
"id": "20200117-16045-0",
"date": "2020-01-17",
"creationTime": null,
"text": "SuperB Vesterbro T 74637",
"originalText": "SuperB Vesterbro T 74637",
"details": null,
"category": null,
"amount": {
"value": -160.45,
"currency": "DKK"
},
"balance": {
"value": 12572.68,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200117-4800-0",
"date": "2020-01-17",
"creationTime": null,
"text": "Rent 45228",
"originalText": "Rent 45228",
"details": null,
"category": null,
"amount": {
"value": -48.00,
"currency": "DKK"
},
"balance": {
"value": 12733.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
},
{
"id": "20200114-1200-0",
"date": "2020-01-14",
"creationTime": null,
"text": "Superbest 86125",
"originalText": "SUPERBEST 86125",
"details": null,
"category": null,
"amount": {
"value": -12.00,
"currency": "DKK"
},
"balance": {
"value": 12781.13,
"currency": "DKK"
},
"type": "Card",
"state": "Booked"
}
]
I loaded in the data like this:
with open('transactions.json') as transactions:
file = json.load(transactions)
data = json_normalize(file)['transactions'][0]
return pd.DataFrame(data)
And I have the following categories so far, I want to group the transactions by:
CATEGORIES = {
'Groceries': ['SuperB', 'Superbest'],
'Housing': ['Insurance', 'Rent']
}
Now I would like to loop through each row in the DataFrame and group each transaction.
I would like to do this, by checking if text contains one of the values from the CATEGORIES dictionary.
If so, that transaction should get categorized as the key of the CATEGORIES dictionary - for instance Groceries.
How do I do this most efficiently?

IIUC,
we can create a pipe delimited list from your dictionary and do some assignment with .loc
print(df)
for k,v in CATEGORIES.items():
pat = '|'.join(v)
df.loc[df['text'].str.contains(pat),'category'] = k
print(df[['text','category']])
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries
more efficienct solution :
we create a single list of all your values and extract them with str.extract at the same time we re-create your dictionary, so each value is now the key we will map onto your target dataframe.
words = []
mapping_dict = {}
for k,v in CATEGORIES.items():
for item in v:
words.append(item)
mapping_dict[item] = k
ext = df['text'].str.extract(f"({'|'.join(words)})")
df['category'] = ext[0].map(mapping_dict)
print(df)
text category
0 SuperB Vesterbro T 74637 Groceries
1 Rent 45228 Housing
2 Superbest 86125 Groceries

Related

Create/ re-create a list of dictionaries from a dictionary via Python Recursion function

So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).
{
"event_name": "restaurants",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
And I want to convert it to this following list of dictionaries
[
{
"event_name": "restaurants",
"properties": {
"restaurant_id": "41009112",
"name": "Mangal Kebab Turkish Restaurant",
"cuisine": "Turkish",
"_id": "5a9909384309cf90b5739342",
"borough": "Queens"
}
},
{
"event_name": "restaurant_address",
"properties": {
"zipcode": "11104",
"ref_id": "41009112",
"street": "Queens Boulevard",
"building": "4620"
}
},
{
"event_name": "restaurant_address_coord"
"ref_id": "41009112"
"0": -73.9180155,
"1": 40.7427742
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1414540800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "0"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1397692800000,
"ref_id": "41009112",
"score": 10,
"grade": "A",
"index": "1"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1381276800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "2"
}
}
]
And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.
Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.
This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).
# input dict
d = {
"event_name": "demo",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
def convert_structure(d: dict):
''' function to convert to new structure'''
# the new dict
e = {}
e['event_name'] = d['event_name']
e['properties'] = {}
e['properties']['restaurant_id'] = d['properties']['restaurant_id']
# and so forth...
# keep building the new structure / template
# return a list
return [e]
# run & print
x = convert_structure(d)
print(x)
the reuslt (for the part done) looks like this:
[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]
If a pattern is identified, then the above could be improved...

Best Way to convert the data in the format

I have a data structure that is something like this
my_data = [
('Continent1','Country1','State1'),
('Continent1','Country1','State2'),
('Continent1','Country2','State1'),
('Continent1','Country2','State2'),
('Continent1','Country2','State3','City1',11111)
]
With the input not limited to State it can be narrowed down further to something like
Cotinent ==> Country ==> State ==> City ==> Zip (With State, City and Zip) being optional fields.
I wish to convert it to a json like provided on the fields shared in payload
{
"Regions": [{
"Continent": 'Continent1',
"Country": "Country1",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country2",
"State": "state1"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state2"
}, {
"Continent": 'Continent1',
"Country": "Country1",
"State": "state3",
"City": "City1",
"zip": "11111",
}]
}
Any pseudo code/approach for the same would be appreciated which would support the output based on multiple inputs.
keys = ["Continent", "Country", "State", "City", "Zip"]
transformed_data = {
"Regions": [dict(zip(keys, row)) for row in my_data]
}

How can I find a specific key from a python dict and then get a value from that key in Python

I have a python dictionary that looks something like this:
[
{
"timestamp": 1621559698154,
"user": {
"uri": "spotify:user:xxxxxxxxxxxxxxxxxxxx",
"name": "Panda",
"imageUrl": "https://i.scdn.co/image/ab67757000003b82b54c68ed19f1047912529ef4"
},
"track": {
"uri": "spotify:track:6SJSOont5dooK2IXQoolNQ",
"name": "Dirty",
"imageUrl": "http://i.scdn.co/image/ab67616d0000b273a36e3d46e406deebdd5eafb0",
"album": {
"uri": "spotify:album:0NMpswZbEcswI3OIe6ml3Y",
"name": "Dirty (Live)"
},
"artist": {
"uri": "spotify:artist:4ZgQDCtRqZlhLswVS6MHN4",
"name": "grandson"
},
"context": {
"uri": "spotify:artist:4ZgQDCtRqZlhLswVS6MHN4",
"name": "grandson",
"index": 0
}
}
},
{
"timestamp": 1621816159299,
"user": {
"uri": "spotify:user:xxxxxxxxxxxxxxxxxxxxxxxx",
"name": "maja",
"imageUrl": "https://i.scdn.co/image/ab67757000003b8286459151d5426f5a9e77cfee"
},
"track": {
"uri": "spotify:track:172rW45GEnGoJUuWfm1drt",
"name": "Your Best American Girl",
"imageUrl": "http://i.scdn.co/image/ab67616d0000b27351630f0f26aff5bbf9e10835",
"album": {
"uri": "spotify:album:16i5KnBjWgUtwOO7sVMnJB",
"name": "Puberty 2"
},
"artist": {
"uri": "spotify:artist:2uYWxilOVlUdk4oV9DvwqK",
"name": "Mitski"
},
"context": {
"uri": "spotify:playlist:0tol7yRYYfiPJ17BuJQKu2",
"name": "I Bet on Losing Dogs",
"index": 0
}
}
}
]
How can I get, for example, the group of values for user.name "Panda" and then get that specific "track" list? I can't parse through the list by index because the list order changes randomly.
If you are only looking for "Panda", then you can just loop over the list, check whether the name is "Panda", and then retrieve the track list accordingly.
Otherwise, that would be inefficient if you want to do that for many different users. I would first make a dict that maps user to its index in the list, and then use that for each user (I am assuming that the list does not get modified while you execute the code, although it can be modified between executions.)
user_to_id = {data[i]['user']['name']: i for i in range(len(data))} # {'Panda': 0, 'maja': 1}
def get_track(user):
return data[user_to_id[user]]['track']
print(get_track('maja'))
print(get_track('Panda'))
where data is the list you provided.
Or, perhaps just make a dictionary of tracks directly:
tracks = {item['user']['name']: item['track'] for item in data}
print(tracks['Panda'])
If you want to get list of tracks for user Panda:
tracks = [entry['track'] for entry in data if entry['user']['name'] == 'Panda']

Convert JSON with nested objects to Pandas Dataframe

I am trying to load json from a url and convert to a Pandas dataframe, so that the dataframe would look like the sample below.
I've tried json_normalize, but it duplicates the columns, one for each data type (value and stringValue). Is there a simpler way than this method and then dropping and renaming columns after creating the dataframe? I want to keep the stringValue.
Person ID Position ID Job ID Manager
0 192 936 93 Tom
my_json = {
"columns": [
{
"alias": "c3",
"label": "Person ID",
"dataType": "integer"
},
{
"alias": "c36",
"label": "Position ID",
"dataType": "string"
},
{
"alias": "c40",
"label": "Job ID",
"dataType": "integer",
"entityType": "job"
},
{
"alias": "c19",
"label": "Manager",
"dataType": "integer"
},
],
"data": [
{
"c3": {
"value": 192,
"stringValue": "192"
},
"c36": {
"value": "936",
"stringValue": "936"
},
"c40": {
"value": 93,
"stringValue": "93"
},
"c19": {
"value": 12412453,
"stringValue": "Tom"
}
}
]
}
If c19 is of type string, this should work
alias_to_label = {x['alias']: x['label'] for x in my_json["columns"]}
is_str = {x['alias']: ('string' == x['dataType']) for x in my_json["columns"]}
data = []
for x in my_json["data"]:
data.append({
k: v["stringValue" if is_str[k] else 'value']
for k, v in x.items()
})
df = pd.DataFrame(data).rename(columns=alias_to_label)

Python : Finding commonalities in multiple array of dicts

I have an array of dicts. I don't know how many dicts will be inside of this list because the result is different by data.
I have to find commonalities by values from those. Once I find common things, then I have to merge those dicts that have same values and figure out the frequency of this values.
This is sample data.
[
{
"id": 100
"category": null,
"mid": null
},
{
"id": 100
"city": "roma"
},
{
"id": 100
"category": null,
"mid": null
},
{
"id": 100
"city": "roma"
},
{
"id": 200
"category": "red",
"mid": null
},
{
"id": 200
"region": "toscany"
},
{
"id": 300
"category": "blue",
"mid": "cold",
"sub": null
},
{
"id": 400
"category": "yellow",
"mid": "warm"
},
{
"id": 400
"city": "milano"
}
]
and the expected result should be like this.
[
{
"id": 100
"category": null,
"mid": null,
"city": "roma"
"count": 2
},
{
"id": 200
"category": "red",
"mid": null,
"region": "toscany",
"count": 1
},
{
"id": 300
"category": "blue",
"mid": "cold",
"sub": null,
"count": 1
},
{
"id": 400
"category": "yellow",
"mid": "warm",
"city": "milano",
"count": 1
}
]
I know how to find commonalities from two dicts but have no idea with multiple dicts. Maybe I can use items() to find same values and chainmap() to merge but till now kept failed to expected result.
Edit
What I did when I have only two dicts.
a={
"id": 100,
"category": null,
"mid": null
}
b={
"id": 100,
"city": "roma"
}
def grouping_records():
rows.sort(key=itemgetter('id'))
for date, items in groupby(rows, key=itemgetter('id')):
print(id)
for i in items:
print(' ', i)
if __name__ == "__main__":
grouping_records()
groupby is a bit complex for many of us, try this naive solution
mylist = [dict(s) for s in set(frozenset(d.items()) for d in original)] # remove dublicate dictionaries if needed
ids = set([d['id'] for d in mylist])
id_cnt = {id: {"count": ids.count(id)} for id in ids }
for d in mylist:
id = d['id']
id_cnt[id].update(d)
result = id_cnt.values()

Categories

Resources