Inconsistent behaviour from json_normalize with deeply nested json data - python

I am trying to import deeply nested json into pandas (v0.24.2) using json_normalize and coming across a few inconsistencies which I am struggling to resolve.
An example json is as follows, which is inconstantly formatted as indicated by Missing keyEB
json = [ {'keyA': 1,
'keyB': 2,
'keyC': [{
'keyCA': 3,
'keyCB': {'keyCBA':4,
'keyCBB':5,
'keyCBC': [{'keyCBCA':6, 'keyCBCB':7, 'keyCBCC':8},
{'keyCBCA':9, 'keyCBCB':10, 'keyCBCC':11},
{'keyCBCA':12, 'keyCBCB':13, 'keyCBCC':14}],
'keyCBD':15},
'keyCC':16}],
'keyD':17,
'keyE': [{
'keyEA':18,
'keyEB': {'keyEBA':19,'keyEBB':20}
}]
},{
'keyA': 31,
'keyB': 32,
'keyC': [{
'keyCA': 33,
'keyCB': {'keyCBA': 34,
'keyCBB': 35,
'keyCBC': [{'keyCBCA': 36, 'keyCBCB': 37, 'keyCBCC': 38},
{'keyCBCA': 39, 'keyCBCB': 40, 'keyCBCC': 41},
{'keyCBCA': 42, 'keyCBCB': 43, 'keyCBCC': 44}],
'keyCBD': 45},
'keyCC': 46}],
'keyD': 47,
'keyE': [{
'keyEA': 48,
'Missing keyEB': 49
}]
}]
The following code gives the expected behavior of json_normalize, extracting the correctly normalised data :
First level json correctly normalized
from pandas.io.json import json_normalize
json_normalize(data = json)
keyA keyB keyC keyD keyE
0 1 2 [{'key... 17 [{'key...
1 31 32 [{'key... 47 [{'key...
Second level KeyC correctly normalized
json_normalize(data = json, record_path = ['keyC'], meta = ['keyA'])
keyCA keyCB keyCC keyA
0 3 {'keyC... 16 1
1 33 {'keyC... 46 31
Fourth level keyCBC correctly normalized
json_normalize(data = json, record_path = ['keyC', 'keyCB', 'keyCBC'], meta = ['keyA'])
keyCBCA keyCBCB keyCBCC keyA
0 6 7 8 1
1 9 10 11 1
2 12 13 14 1
3 36 37 38 31
4 39 40 41 31
5 42 43 44 31
However, other branches seem to be inconsistently normalized.
Third level keyCB ......
json_normalize(data = json, record_path = ['keyC', 'keyCB'], meta = ['keyA'])
0 keyA
0 keyCBA 1
1 keyCBB 1
2 keyCBC 1
3 keyCBD 1
4 keyCBA 31
5 keyCBB 31
6 keyCBC 31
7 keyCBD 31
#Uhhhh ! I was expecting
# keyCBA keyCBB keyCBC keyCBD KeyA
# 0 4 5 [{'key.. 15 1
# 1 34 35 [{'key.. 45 31
and the following completely bombs with a keyword error because of the missing keyEB
json_normalize(data = json, record_path = ['keyE', 'keyEB'], meta = ['keyA'])
Traceback (most recent call last):......
KeyError: 'keyEB'
#I was expecting
# keyEBA keyEBB keyA
# 0 19 20 1
# 1 NaN NaN 31
Are there any easy ways around this to get consistent behavior from jsons_normalize ?

Related

Need to add all the difference values but for a specific index value range

I have a data frame:
I have to calculate all the differences but separately for each event. In the data frame, you can see that after index 8 index 12 starts which means the start of a new event and that difference should be calculated separately. So This means as the difference between index_col is 4 the new event starts and that difference should be sum separately.
So the sum of events should be like this e.g
index_col 1-8 sum of Difference should be 20.96 (belongs to the first event)
index_col 12-17 sum of Difference should be 16.17(belongs to the second even)
and so on ...
index_col Depth(nm) Load(µN) Time (s) Difference
1 42.478033 432.482376 5.460979 8.70957
2 44.217959 432.163277 5.461261 1.73993
3 44.517313 432.764691 5.461824 3.36262
4 44.602024 433.754851 5.462669 2.37831
5 44.452232 434.808104 5.463514 1.8221
6 44.785705 435.698639 5.464358 1.1552
7 44.008191 436.724050 5.464922 1.02758
8 44.104820 438.753727 5.466611 1.04814
12 39.918249 390.597846 5.476275 7.61717
13 40.939905 391.229950 5.477120 2.66319
14 40.709209 392.333573 5.477965 1.99305
15 40.975959 393.208349 5.478810 1.88325
16 40.415786 395.135862 5.480218 1.00294
17 40.748377 396.057784 5.481062 1.13622
21 45.101152 441.052546 5.554368 5.64005
22 43.096024 442.489659 5.554931 2.13311
23 44.581075 442.264911 5.555213 1.48505
24 43.757947 443.295160 5.555776 2.34133
25 44.020544 444.209317 5.556621 2.15143
26 44.457026 445.121651 5.557466 2.2784
27 44.332075 446.131261 5.558310 1.36814
28 43.853956 447.344522 5.559155 1.0139
32 38.420457 381.697812 5.462362 5.80165
33 39.247295 382.417916 5.463206 2.51963
34 38.910364 383.542124 5.464051 1.67136
38 45.939504 467.899009 5.564736 6.58783
39 44.251143 469.194422 5.565299 1.40849
40 46.242257 468.823029 5.565581 1.99111
41 45.032736 469.930914 5.566144 1.95164
42 45.540791 470.765236 5.566989 2.50574
43 45.520035 471.821972 5.567834 1.91457
44 45.593076 472.835489 5.568678 1.24077
45 45.267980 474.618237 5.570086 1.05416
46 45.238412 475.640147 5.570931 1.038062
49 38.193023 392.286042 5.490368 8.13389
50 41.444420 391.411630 5.490650 3.2514
The way you add the data as plain text is very unhelpful. It would be much easier and faster if you add the data in the form index_col = ..., load = ... and so on.
That aside, this is my code:
index_col = [1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 32, 33, 34, 38, 39, 40, 41, 42, 43, 44, 45, 46, 49, 50]
depth = [42.478033, 44.217959, 44.517313, 44.602024, 44.452232, 44.785705, 44.008191, 44.10482, 39.918249, 40.939905, 40.709209, 40.975959, 40.415786, 40.748377, 45.101152, 43.096024, 44.581075, 43.757947, 44.020544, 44.457026, 44.332075, 43.853956, 38.420457, 39.247295, 38.910364, 45.939504, 44.251143, 46.242257, 45.032736, 45.540791, 45.520035, 45.593076, 45.26798, 45.238412, 38.193023, 41.44442]
load = [432.482376, 432.163277, 432.764691, 433.754851, 434.808104, 435.698639, 436.72405, 438.753727, 390.597846, 391.22995, 392.333573, 393.208349, 395.135862, 396.057784, 441.052546, 442.489659, 442.264911, 443.29516, 444.209317, 445.121651, 446.131261, 447.344522, 381.697812, 382.417916, 383.542124, 467.899009, 469.194422, 468.823029, 469.930914, 470.765236, 471.821972, 472.835489, 474.618237, 475.640147, 392.286042, 391.41163]
time = [5.460979, 5.461261, 5.461824, 5.462669, 5.463514, 5.464358, 5.464922, 5.466611, 5.476275, 5.47712, 5.477965, 5.47881, 5.480218, 5.481062, 5.554368, 5.554931, 5.555213, 5.555776, 5.556621, 5.557466, 5.55831, 5.559155, 5.462362, 5.463206, 5.464051, 5.564736, 5.565299, 5.565581, 5.566144, 5.566989, 5.567834, 5.568678, 5.570086, 5.570931, 5.490368, 5.49065]
difference = [8.70957, 1.73993, 3.36262, 2.37831, 1.8221, 1.1552, 1.02758, 1.04814, 7.61717, 2.66319, 1.99305, 1.88325, 1.00294, 1.13622, 5.64005, 2.13311, 1.48505, 2.34133, 2.15143, 2.2784, 1.36814, 1.0139, 5.80165, 2.51963, 1.67136, 6.58783, 1.40849, 1.99111, 1.95164, 2.50574, 1.91457, 1.24077, 1.05416, 1.03806, 8.13389, 3.2514]
df = pd.DataFrame({'index': index_col, 'depth': depth, 'load': load, 'time': time, 'difference': difference})
sum_diff = []
start = 0
for i in range(len(df)):
if i == len(df) - 1:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
else:
if df['index'][i] + 1 != df['index'][i + 1]:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
start = end
print(sum_diff)
Output:
[21.24345, 16.295820000000003, 18.411409999999997, 9.99264, 19.69237, 11.38529]
I checked if the calculation is correct by doing this manually:
print(sum(df['difference'][0:8]))
print(sum(df['difference'][8:14]))
print(sum(df['difference'][14:22]))
print(sum(df['difference'][22:25]))
print(sum(df['difference'][25:34]))
print(sum(df['difference'][34:36]))
and yes, I got the same output:
21.24345
16.295820000000003
18.411409999999997
9.99264
19.69237
11.38529
And elegant solution would be using groupby the dataframe based on the index_col differences and construct a dict for flexible use of sum. Take an empty dataframe and use it for the storage of the summed results.
You can do as follows:
df = pd.DataFrame(data)
result = pd.DataFrame(columns = ['event_no', 'sum'])
grouped_dict = dict(tuple(df.groupby(df['index_col'].diff().gt(1).cumsum())))
for index in grouped_dict:
result = result.append({'event_no': index+1, 'sum': grouped_dict[index]['difference'].sum()}, ignore_index=True)
And this will give you exactly what you want:
event_no sum
0 1.0 21.243450
1 2.0 16.295820
2 3.0 18.411410
3 4.0 9.992640
4 5.0 19.692372
5 6.0 11.385290
What does df.groupby(df['index_col'].diff().gt(1).cumsum()) do?
The diff() simply calculates the difference between consecutive indices in df['index_col']. The gt(1) returns whether each element in the df['index_col'].diff() is greater than 1 or not. the cumsum() then sums these boolean results. As index 0-7 is False, cumsum is 0 for each of these indexes. Then index 8 is True, So cumsum becomes 1 and remains same for the rest of the consecutive indices as they return False for gt(1).
The calculation goes in the same way for rest of the consecutive segments. So for df.groupby() we get inputs of groups of 0's to 5's as follows:
0
0
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
4
4
4
4
4
4
4
4
4
5
5
Hence group by is done based on these 5 values for your given input.
Hope that's clear now!

Getting date field from JSON url as pandas DataFrame

I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36
You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.

Convert dict constructor to Pandas MultiIndex dataframe

I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.
I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict variant. Note that the original data has a lot more labels and more rows as well.
The idea is that the original data contains rows of a task with index Task_n that has been performed by a participant with index Participant_n. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:
Participant_n | Task_n | val | dur
----------------------------------
1 | 1 | 12 | 2
1 | 1 | 3 | 4
1 | 1 | 4 | 12
1 | 2 | 11 | 11
1 | 2 | 34 | 4
The above example contains one participants, two tasks, with respectively three and two segments (rows).
In Python, with a dict structure this looks like this:
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
d = pd.DataFrame(data, columns=cols)
part_d = {}
for row in d.itertuples():
participant_n = row.Participant_n
participant = "participant" + str(participant_n)
task = "task" + str(row.Task_n)
if participant in part_d:
part_d[participant]['all_sum']['val'] += int(row.val)
part_d[participant]['all_sum']['dur'] += int(row.dur)
else:
part_d[participant] = {
'prof': 0 if participant_n < 20 else 1,
'all_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
if task in part_d[participant]:
# Get already existing keys
k = list(part_d[participant][task].keys())
k_int = []
# Only get the ints (i.e. not all_sum etc.)
for n in k:
# Get digit from e.g. seg1
n = n[3:]
try:
k_int.append(int(n))
except ValueError:
pass
# Increment max by 1
i = max(k_int) + 1
part_d[participant][task][f"seg{i}"] = {
'val': int(row.val),
'dur': int(row.dur),
}
part_d[participant][task]['task_sum']['val'] += int(row.val)
part_d[participant][task]['task_sum']['dur'] += int(row.dur)
else:
part_d[participant][task] = {
'seg1': {
'val': int(row.val),
'dur': int(row.dur),
},
'task_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
print(part_d)
In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):
{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}
Instead of a dictionary, I would like this to end up in a pd.DataFrame with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1 or seg1 I just used the indices.)
Participant Prof all_sum Task Task_sum Seg val dur
val dur val dur
====================================================================
participant1 0 220 1240 1 38 1138 1 25 83
2 4 68
3 9 987
2 182 102 1 98 98
2 84 4
--------------------------------------------------------------------
participant2 0 235 49 1 9 21 1 9 21
2 218 16 1 15 6
2 185 6
3 18 4
3 8 12 1 8 12
--------------------------------------------------------------------
participant3 0 31 214 1 19 166 1 7 78
2 12 88
2 12 48 1 12 48
Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?
Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:
Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg
The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?
For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0) should be without problems.
My only suggestion is to get rid of all your dictionary stuff. All of that code can be re-written in Pandas without much effort. This will likely speed up the transformation process as well but will take some time. To help you in the process I have rewritten the section you provided. The rest is up to you.
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum")
df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum")
df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1
df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum")
df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum")
df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]]
df = df.sort_index()
df
Output
seg val dur
Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur
1 220 1240 1 38 1138 1 25 83
1138 2 4 68
1138 3 9 987
2 182 102 1 98 98
102 2 84 4
2 235 49 1 9 21 1 9 21
2 218 16 1 15 6
16 2 185 6
16 3 18 4
3 8 12 1 8 12
3 31 214 1 19 166 1 7 78
166 2 12 88
2 12 48 1 12 48
Try to run this code and let me know what you think. Comment with any questions.
I faced a similar issue with data presentation and came up with the following helper functions for groupby with subtotals.
With this process it's possible to generate subtotals for an arbitrary number of group by columns, however the output data has a different format. Instead of the subtotals being put in their own columns, each subtotal adds an extra row to the data frame.
For interactive data exploration & analysis, I find this very helpful as its possible to get the subtotals with just a couple of lines of code
def get_subtotals(frame, columns, aggvalues, subtotal_level):
if subtotal_level == 0:
return frame.groupby(columns, as_index=False).agg(aggvalues)
elif subtotal_level == len(columns):
return pd.DataFrame(frame.agg(aggvalues)).transpose().assign(
**{c: np.nan for i, c in enumerate(columns)}
)
return frame.groupby(
columns[:subtotal_level],
as_index=False
).agg(aggvalues).assign(
**{c: np.nan for i, c in enumerate(columns[subtotal_level:])}
)
def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'):
gt = 1 if grand_totals else 0
out = pd.concat(
[get_subtotals(df, columns, aggvalues, i)
for i in range(len(columns)+gt)]
).sort_values(columns, na_position=totals_position)
out[columns] = out[columns].fillna('total')
return out.set_index(columns)
resuing the dataframe creation code from Gabriel A's answer
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
It is first necessary to add the seg column
df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1
Then we can use groupby_with_subtotals like this. Additionally, note that you can place the subtotals at the top and also include grand_totals by passing in grand_totals=True, totals_position='first'
groupby_columns = ['Participant_n', 'Task_n', 'seg']
groupby_aggs = {'val': 'sum', 'dur': 'sum'}
aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs)
aggdf
# outputs
dur val
Participant_n Task_n seg
1 1.0 1.0 83 25
2.0 68 4
3.0 987 9
total 1138 38
2.0 1.0 98 98
2.0 4 84
total 102 182
total total 1240 220
2 1.0 1.0 21 9
total 21 9
2.0 1.0 6 15
2.0 6 185
3.0 4 18
total 16 218
3.0 1.0 12 8
total 12 8
total total 49 235
3 1.0 1.0 78 7
2.0 88 12
total 166 19
2.0 1.0 48 12
total 48 12
total total 214 31
Here, the subtotals rows are marked with total, and the left most total indicates the subtotal level.
Once the aggregate data frame is created, its possible to access the subtotals using loc. example:
aggdf.loc[1,'total','total']
# outputs:
dur 1240
val 220
Name: (1, total, total), dtype: int64

How to make Pandas unpack JSON data into proper DataFrame instead of list of dicts

I'm trying to parse the data at http://dev.hsl.fi/tmp/citybikes/stations_20170503T071501Z into a Pandas DataFrame. Using read_json gives me a list of dicts instead of a proper DataFrame with the variable names as columns:
In [1]:
data = pd.read_json("http://dev.hsl.fi/tmp/citybikes/stations_20170503T071501Z")
print(data)
Out[1]:
result
0 {'name': '001 Kaivopuisto', 'coordinates': '60...
1 {'name': '002 Laivasillankatu', 'coordinates':...
.. ...
149 {'name': '160 Nokkala', 'coordinates': '60.147...
150 {'name': '997 Workshop Helsinki', 'coordinates...
[151 rows x 1 columns]
This happens with all orient option. I've tried json_normalize() to no avail as well and a few other things I found here. How could I make this into a sensible DataFrame? Thanks!
Option 1
Use pd.DataFrame on the list of dictionaries
pd.DataFrame(data['result'].values.tolist())
avl_bikes coordinates free_slots name operative style total_slots
0 12 60.155411,24.950391 18 001 Kaivopuisto True CB 30
1 3 60.159715,24.955212 9 002 Laivasillankatu True 12
2 0 60.158172,24.944808 16 003 Kapteeninpuistikko True 16
3 0 60.160944,24.941859 14 004 Viiskulma True 14
4 16 60.157935,24.936083 16 005 Sepänkatu True 32
Option 2
Use apply
data.result.apply(pd.Series)
avl_bikes coordinates free_slots name operative style total_slots
0 12 60.155411,24.950391 18 001 Kaivopuisto True CB 30
1 3 60.159715,24.955212 9 002 Laivasillankatu True 12
2 0 60.158172,24.944808 16 003 Kapteeninpuistikko True 16
3 0 60.160944,24.941859 14 004 Viiskulma True 14
4 16 60.157935,24.936083 16 005 Sepänkatu True 32
Option 3
Or you could fetch the json yourself and strip out the results
import urllib, json
url = "http://dev.hsl.fi/tmp/citybikes/stations_20170503T071501Z"
response = urllib.request.urlopen(url)
data = json.loads(response.read())
df = pd.DataFrame(data['result'])
df
avl_bikes coordinates free_slots name operative style total_slots
0 12 60.155411,24.950391 18 001 Kaivopuisto True CB 30
1 3 60.159715,24.955212 9 002 Laivasillankatu True 12
2 0 60.158172,24.944808 16 003 Kapteeninpuistikko True 16
3 0 60.160944,24.941859 14 004 Viiskulma True 14
4 16 60.157935,24.936083 16 005 Sepänkatu True 32
The approaches in the accepted answer work great, so this is just a more recent (2022) FYI:
In later versions of Pandas (1.0>), you can also use json_normalize (documentation).
json_obj = {
'key': 123,
'field1': 'blah',
'info': {
'contacts': {
'email': {
'foo': 'foo#abc.com',
'bar': 'bar#abc.com'
},
'tel': '123456789',
}
}
}
pd.json_normalize(json_obj)

output multiple files based on column value python pandas

i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.
You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))

Categories

Resources