I've data stored in pandas dataframe and I want to convert tat into a JSON format. Example data can be replicated using following code
data = {'Product':['A', 'B', 'A'],
'Zone':['E/A', 'A/N', 'E/A'],
'start':['08:00:00', '09:00:00', '12:00:00'],
'end':['12:30:00', '17:00:00', '17:40:00'],
'seq':['0, 1, 2 ,3 ,4','0, 1, 2 ,3 ,4', '0, 1, 2 ,3 ,4'],
'store':['Z',"'AS', 'S'", 'Z']
}
df = pd.DataFrame(data)
I've tried converting it into JSON format using following code
df_parsed = json.loads(df.to_json(orient="records"))
Output generated from above
[{'Product': 'A', 'Zone': 'E/A', 'start': '08:00:00', 'end': '17:40:00', 'seq': '0, 1, 2 ,3 ,4', 'store': 'Z'}, {'Product': 'B', 'Zone': 'A/N', 'start': '09:00:00', 'end': '17:00:00', 'seq': '0, 1, 2 ,3 ,4', 'store': 'AS'}, {'Product': 'A', 'Zone': 'E/A', 'start': '08:00:00', 'end': '17:40:00', 'seq': '0, 1, 2 ,3 ,4', 'store': 'Z'}]
Desired Result:
{
'A': {'Zone': 'E/A',
'tp': [{'start': [8, 0], 'end': [12, 0], 'seq': [0, 1, 2 ,3 ,4]},
{'start': [12, 30], 'end': [17, 40], 'seq': [0, 1, 2 ,3 ,4]}],
'store': ['Z']
},
'B': {'Zone': 'A/N',
'tp': [{'start': [9, 0], 'end': [17, 0], 'seq': [0, 1, 2 ,3 ,4]}],
'store': ['AS', 'S']
}
}
If a product belongs to same store the result for column start, end and seq should be clubbed as shown in desired output. Also start time and end time should be represented like [9,0] if value for time is "09:00:00" only hour and minute needs to be represented so we can discard value of seconds from time columns.
This will be complicated a bit. So you have to do it step by step:
def funct(row):
row['start'] = row['start'].str.split(':').str[0:2]
row['end'] = row['end'].str.split(':').str[0:2]
row['store'] = row['store'].str.replace("'", "").str.split(', ')
d = (row.groupby('Zone')[row.columns[1:]]
.apply(lambda x: x.to_dict(orient='record'))
.reset_index(name='tp').to_dict(orient='row'))
return d
di = df.groupby(['Product'])[df.columns[1:]].apply(funct).to_dict()
di:
{'A': [{'Zone': 'E/A',
'tp': [{'start': ['08', '00'],
'end': ['12', '30'],
'seq': '0, 1, 2 ,3 ,4',
'store': ['Z']},
{'start': ['12', '00'],
'end': ['17', '40'],
'seq': '0, 1, 2 ,3 ,4',
'store': ['Z']}]}],
'B': [{'Zone': 'A/N',
'tp': [{'start': ['09', '00'],
'end': ['17', '00'],
'seq': '0, 1, 2 ,3 ,4',
'store': ['AS', 'S']}]}]}
Explanation:
1st create your own custom function.
change the start, end column to a list form.
group by Zone and apply to_dict to rest of the columns.
reset index and name the column that are having [{'start': ['08', '00'], 'end': ['12', '30'], 'seq': '0, 1, 2 ,3 ,4',
as tp.
now apply to_dict to the whole result and return it.
Ultimately you need to convert your dataframe into this below format once you are able to do it the rest of the thing will become easy for you.
Zone tp
E/A [{'start': ['08', '00'], 'end': ['12', '30'], ...
A/N [{'start': ['09', '00'], 'end': ['17', '00'], ...
EDIT:
import pandas as pd
import ast
def funct(row):
y = row['start'].str.split(':').str[0:-1]
row['start'] = row['start'].str.split(':').str[0:2].apply(lambda x: list(map(int, x)))
row['end'] = row['end'].str.split(':').str[0:2].apply(lambda x: list(map(int, x)))
row['seq'] = row['seq'].apply(lambda x: list(map(int, ast.literal_eval(x))))
row['store'] = row['store'].str.replace("'", "")
d = (row.groupby('Zone')[row.columns[1:-1]]
.apply(lambda x: x.to_dict(orient='record'))
.reset_index(name='tp'))
######### For store create a different dataframe and then merge it to the other df ########
d1 = (row.groupby('Zone').agg({'store': pd.Series.unique}))
d1['store'] = d1['store'].str.split(",")
d_merged = (pd.merge(d,d1, on='Zone', how='left')).to_dict(orient='record')[0]
return d_merged
di = df.groupby(['Product'])[df.columns[1:]].apply(funct).to_dict()
di:
{'A': {'Zone': 'E/A',
'tp': [{'start': [8, 0], 'end': [12, 30], 'seq': [0, 1, 2, 3, 4]},
{'start': [12, 0], 'end': [17, 40], 'seq': [0, 1, 2, 3, 4]}],
'store': ['Z']},
'B': {'Zone': 'A/N',
'tp': [{'start': [9, 0], 'end': [17, 0], 'seq': [0, 1, 2, 3, 4]}],
'store': ['AS', ' S']}}
Related
I try to flatten some columns in my dataframe, but unfurtunately it does not work.
What would be the correct way of doing this?
created_at
tweet_hashtag
tweet_cashtag
2022-07-23
[{'start': 16, 'end': 27, 'tag': 'blockchain'}, {'start': 28, 'end': 32, 'tag': 'btc'}, {'start': 33, 'end': 37, 'tag': 'eth'}, {'start': 38, 'end': 42, 'tag': 'eth'}]
[{'start': 0, 'end': 4, 'tag': 'Act'}, {'start': 7, 'end': 11, 'tag': 'jar'}]
2022-04-24
[{'start': 6, 'end': 7, 'tag': 'chain'}, {'start': 8, 'end': 3, 'tag': 'btc'}, {'start': 3, 'end': 7, 'tag': 'eth'}]
[{'start': 4, 'end': 8, 'tag': 'Act'}, {'start': 7, 'end': 9, 'tag': 'aapl'}]
And my preferred result would be:
created_at
tweet_hashtag.tag
tweet_cashtag.tag
2022-07-23
blockchain, btc, eth,eth
Act, jar
2022-04-24
chain, btc, eth
Act, aapl
Thanks in advance!
I tried to flatten with this solution, but it does not work: How to apply json_normalize on entire pandas column
you can use:
def get_values(a,b):
x_values=[]
for i in range(0,len(a)):
x_values.append(a[i]['tag'])
y_values=[]
for j in range(0,len(b)):
y_values.append(b[j]['tag'])
return ','.join(x_values),','.join(y_values)
df[['tweet_hashtag','tweet_cashtag']]=df[['tweet_hashtag','tweet_cashtag']].apply(lambda x: get_values(x['tweet_hashtag'], x['tweet_cashtag']),axis=1)
or:
def get_hashtags(a):
x_values=[]
for i in range(0,len(a)):
x_values.append(a[i]['tag'])
return ','.join(x_values)
def get_cashtags(b):
y_values=[]
for i in range(0,len(b)):
y_values.append(b[i]['tag'])
return ','.join(y_values)
df['tweet_hashtag']=df['tweet_hashtag'].apply(lambda x: get_hashtags(x))
df['tweet_cashtag']=df['tweet_cashtag'].apply(lambda x: get_cashtags(x))
print(df)
'''
created_at tweet_hashtag tweet_cashtag
0 2022-07-23 blockchain,btc,eth,eth Act,jar
1 2022-04-24 chain,btc,eth Act,aapl
'''
I have the following dataframe:
columns = pd.date_range(start="2022-05-21", end="2022-06-30")
data = [
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
]
df = pd.DataFrame(data, columns=columns)
2022-05-21 2022-05-22 2022-05-23 ... 2022-06-28 2022-06-29 2022-06-30
0 0 0 0 ... 5 5 5
1 5 5 5 ... 1 1 1
2 5 5 5 ... 5 5 5
I have to take the first and last column index for every distinct value in the order they are. The correct output for this dataframe will be:
[
[
{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
]
]
My best approach for the moment is:
series_set = df.apply(frozenset, axis=1)
container = []
for index in range(len(df.index)):
row = df.iloc[[index]]
values = series_set.iloc[[index]]
inner_container = []
for value in values[index]:
single_value_series = row[row.columns[row.isin([value]).all()]]
dates = single_value_series.columns
result = dict(value=value, start=dates[0].strftime("%Y-%m-%d"), end=dates[-1].strftime("%Y-%m-%d"))
inner_container.append(result)
container.append(inner_container)
The result is:
[
[
{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}
],
[
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'}
],
[
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-05-21', 'end': '2022-06-30'}
]
]
It has several problems, only the first array is correct :)
When I convert dataframe to frozenset it is sorted and order is changed and also if some value appears more than once it is removed.
I will appreciate any idea and guidance. What I want to avoid is iterating the dataframe.
Thank you!
You can first transpose DataFrame by DataFrame.T and then aggregate minimal and maximal index with convert values to strings by Series.dt.strftime, last convert to dictionaries by DataFrame.to_dict.
For get consecutive groups is compared shifted values with Series.cumsum.
df1 = df.T.reset_index()
L = [df1.groupby(df1[x].ne(df1[x].shift()).cumsum())
.agg(value=(x, 'first'),
start=('index', 'min'),
end=('index', 'max'))
.assign(start=lambda x: x['start'].dt.strftime('%Y-%m-%d'),
end=lambda x: x['end'].dt.strftime('%Y-%m-%d'))
.to_dict(orient='records') for x in df1.columns.drop('index')]
print (L)
[[{'value': 0, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 1, 'start': '2022-06-20', 'end': '2022-06-30'}],
[{'value': 5, 'start': '2022-05-21', 'end': '2022-05-31'},
{'value': 2, 'start': '2022-06-01', 'end': '2022-06-19'},
{'value': 5, 'start': '2022-06-20', 'end': '2022-06-30'}]]
I need to group a input dictionary based on two keys and return each group as part of a list of dictionaries. For e.g.,
data = {
'name': ['A', 'C', 'B', 'B'],
'tag': [13, 26, 13, 3],
'id': [234, 235, 236, 237],
'values': [[1, 3, 3], [1, 2, 1], [1, 2, 3], [1, 1, 1]],
}
I can use defaultdict to do the subsetting and to return one key of the dict pretty easily. For e.g., this will return a list of dicts grouped by data['name']:
Without using pandas (dataset is too big), how can I groupby one or more tags (say, by=['name', 'tag']) and return a list of dicts?
Edit: Expected output can be a list of dicts:
[
{'name': 'A', tag: 13, 'id': 234, 'values': [1, 3, 3]},
{'name': 'C', tag: 26, 'id': 235, 'values': [1, 2, 1]},
{'name': 'B', tag: 13, 'id': 236, 'values': [1, 2, 3]},
{'name': 'B', tag: 3, 'id': 237, 'values': [1, 1, 2]}
]
or a dict of dicts:
{
('A', 13): {'id': 234, 'values': [1, 3, 3]},
('C', 26): {'id': 235, 'values': [1, 2, 1]},
('B', 13): {'id': 236, 'values': [1, 2, 3]},
('B', 3): {'id': 237, 'values': [1, 1, 2]}
}
It's actually a lot easier than it might seem:
{(n, t): {'id': i, 'values': vs} for n, t, i, vs in zip(*data.values())}
Once you zip the 4 values together, it's just a matter of
iterating over the resulting sequence of tuples,
unpacking each tuple and
constructing the desired key/value pair from the unpacked values.
If there is any concern over the order in which the 4 list values will be returned by data.values(), you can be more explicit:
from operator import itemgetter
# g(data) == (data['name'], data['tag'], data['id'], data['values'])
g = itemgetter('name', 'tag', 'id', 'values')
result = {(n, t): {'id': i, 'values': vs} for n, t, i, vs in zip(*g(data))}
Python 3.6
Task:
Given a sorted list of linear features (like in a linear referencing system),
combine adjacent linear features belonging to the same key (linear_feature[0]['key'] == linear_feature[1]['key'] and linear_feature[0]['end'] == linear_feature[1]['start'])
until the combined linear feature has (end - start) ≥ THRESHOLD.
If feature cannot be combined with subsequent adjacent features such that (end - start) ≥ THRESHOLD, combine with previous adjacent feature of the same key, or return self.
EDIT: Added a solution below in an answer post.
THRESHOLD = 3
linear_features = sorted([
{'key': 1, 'start': 0, 'end': 2, 'count': 1},
{'key': 1, 'start': 2, 'end': 4, 'count': 1},
{'key': 1, 'start': 4, 'end': 5, 'count': 1},
{'key': 2, 'start': 0, 'end': 3, 'count': 1},
{'key': 2, 'start': 3, 'end': 4, 'count': 1},
{'key': 2, 'start': 4, 'end': 5, 'count': 1},
{'key': 3, 'start': 0, 'end': 1, 'count': 1},
], key=lambda x: (x['key'], x['start']))
# This isn't necessarily an intermediate step, just here for visualization
intermediate = [
{'key': 1, 'start': 0, 'end': 4, 'count': 2}, # Adjacent features combined
{'key': 1, 'start': 4, 'end': 5, 'count': 1}, # This can't be made into a feature with (end - start) gte THRESHOLD; combine with previous
{'key': 2, 'start': 0, 'end': 3, 'count': 1},
{'key': 2, 'start': 3, 'end': 5, 'count': 2}, # This can't be made into a feature with (end - start) gte THRESHOLD; combine with previous
{'key': 3, 'start': 0, 'end': 1, 'count': 1}, # This can't be made into a new feature, and there is no previous, so self
]
desired_output = [
{'key': 1, 'start': 0, 'end': 5, 'count': 3},
{'key': 2, 'start': 0, 'end': 5, 'count': 3},
{'key': 3, 'start': 0, 'end': 1, 'count': 1},
]
I figured out a solution:
def reducer(x, THRESHOLD):
x = add_until(x, THRESHOLD)
if len(x) == 1:
return x
if len(x) == 2:
if length(x[1]) < THRESHOLD:
x[0]['end'] = x[1]['end']
x[0]['count'] += x[1]['count']
return [x[0]]
else:
return x
first, rest = x[0], x[1:]
return [first] + reducer(rest, THRESHOLD)
def add_until(x, THRESHOLD):
if len(x) == 1:
return x
first, rest = x[0], x[1:]
if length(first) >= THRESHOLD:
return [first] + add_until(rest, THRESHOLD)
else:
rest[0]['start'] = first['start']
rest[0]['count'] += first['count']
return add_until(rest, THRESHOLD)
from itertools import groupby
THRESHOLD = 3
linear_features = sorted([
{'key': 1, 'start': 0, 'end': 2, 'count': 1},
{'key': 1, 'start': 2, 'end': 4, 'count': 1},
{'key': 1, 'start': 4, 'end': 5, 'count': 1},
{'key': 2, 'start': 0, 'end': 3, 'count': 1},
{'key': 2, 'start': 3, 'end': 4, 'count': 1},
{'key': 2, 'start': 4, 'end': 5, 'count': 1},
{'key': 3, 'start': 0, 'end': 1, 'count': 1},
{'key': 4, 'start': 0, 'end': 3, 'count': 1},
{'key': 4, 'start': 3, 'end': 4, 'count': 1},
{'key': 4, 'start': 4, 'end': 5, 'count': 1},
{'key': 4, 'start': 5, 'end': 6, 'count': 1},
{'key': 4, 'start': 6, 'end': 9, 'count': 1},
], key=lambda x: (x['key'], x['start']))
def length(x):
"""x is a dict with a start and end property"""
return x['end'] - x['start']
results = []
for key, sites in groupby(linear_features, lambda x: x['key']):
sites = list(sites)
results += reducer(sites, 3)
print(results)
[
{'key': 1, 'start': 0, 'end': 5, 'count': 3},
{'key': 2, 'start': 0, 'end': 5, 'count': 3},
{'key': 3, 'start': 0, 'end': 1, 'count': 1},
{'key': 4, 'start': 0, 'end': 3, 'count': 1},
{'key': 4, 'start': 3, 'end': 6, 'count': 3},
{'key': 4, 'start': 6, 'end': 9, 'count': 1}
]
You want something like the this:
PSEUDOCODE
while f=1 < max = count of features:
if features[f-1]['key'] == features[f]['key'] and
features[f-1]['end'] == features[f]['start']:
#combine
features[f-1]['end'] = features[f]['end']
features[f-1]['count'] += 1
del features[f]; max -= 1
else:
f += 1
I have a list of dictionaries like so:
dictlist = [{'day': 0, 'start': '8:00am', 'end': '5:00pm'},
{'day': 1, 'start': '10:00am', 'end': '7:00pm'},
{'day': 2, 'start': '8:00am', 'end': '5:00pm'},
{'day': 3, 'start': '10:00am', 'end': '7:00pm'},
{'day': 4, 'start': '8:00am', 'end': '5:00pm'},
{'day': 5, 'start': '11:00am', 'end': '1:00pm'}]
I want to summarize days that share the same 'start' and 'end' times.
For example,
summarylist = [([0,2, 4], '8:00am', '5:00pm'),
([1, 3], '10:00am', '7:00pm')
([5], '11:00am', '1:00pm')]
I have tried to adapt some other StackOverflow solutions re: sets and intersections to achieve this with no luck. I was trying to re-purpose the solution to this question to no avail. Hoping someone can point me in the right direction.
If you don't need the exact format that you provide you could use defaultdict
dictlist = [{'day': 0, 'start': '8:00am', 'end': '5:00pm'},
{'day': 1, 'start': '10:00am', 'end': '7:00pm'},
{'day': 2, 'start': '8:00am', 'end': '5:00pm'},
{'day': 3, 'start': '10:00am', 'end': '7:00pm'},
{'day': 4, 'start': '8:00am', 'end': '5:00pm'},
{'day': 5, 'start': '11:00am', 'end': '1:00pm'}]
from collections import defaultdict
dd = defaultdict(list)
for d in dictlist:
dd[(d['start'],d['end'])].append(d['day'])
Result:
>>> dd
defaultdict(<type 'list'>, {('11:00am', '1:00pm'): [5], ('10:00am', '7:00pm'): [1, 3], ('8:00am', '5:00pm'): [0, 2, 4]})
And if format is important to you could do:
>>> my_list = [(v, k[0], k[1]) for k,v in dd.iteritems()]
>>> my_list
[([5], '11:00am', '1:00pm'), ([1, 3], '10:00am', '7:00pm'), ([0, 2, 4], '8:00am', '5:00pm')]
>>> # If you need the output sorted:
>>> sorted_my_list = sorted(my_list, key = lambda k : len(k[0]), reverse=True)
>>> sorted_my_list
[([0, 2, 4], '8:00am', '5:00pm'), ([1, 3], '10:00am', '7:00pm'), ([5], '11:00am', '1:00pm')]
With itertools.groupby:
In [1]: %paste
dictlist = [{'day': 0, 'start': '8:00am', 'end': '5:00pm'},
{'day': 1, 'start': '10:00am', 'end': '7:00pm'},
{'day': 2, 'start': '8:00am', 'end': '5:00pm'},
{'day': 3, 'start': '10:00am', 'end': '7:00pm'},
{'day': 4, 'start': '8:00am', 'end': '5:00pm'},
{'day': 5, 'start': '11:00am', 'end': '1:00pm'}]
## -- End pasted text --
In [2]: from itertools import groupby
In [3]: tuplist = [(d['day'], (d['start'], d['end'])) for d in dictlist]
In [4]: key = lambda x: x[1]
In [5]: summarylist = [(sorted(e[0] for e in g),) + k
...: for k, g in groupby(sorted(tuplist, key=key), key=key)]
In [6]: summarylist
Out[6]:
[([1, 3], '10:00am', '7:00pm'),
([5], '11:00am', '1:00pm'),
([0, 2, 4], '8:00am', '5:00pm')]
You can use itertools.groupby like this.
source code:
from itertools import groupby
for k, grp in groupby(sorted(dictlist, key=lambda x:(x['end'], x['start'])), key=lambda x:(x['start'], x['end'])):
print [i['day'] for i in grp], k
output:
[5] ('11:00am', '1:00pm')
[0, 2, 4] ('8:00am', '5:00pm')
[1, 3] ('10:00am', '7:00pm')
But I think using defaultdict(#Akavall answer) is the right way in this particular case.