I have a lot of data that I'd like to structure in a Pandas dataframe. However, I need a multi-index format for this. The Pandas MultiIndex feature has always confused me and also this time I can't get my head around it.
I built the structure as I want it as a dict, but because my actual data is much larger, I want to use Pandas instead. The code below is the dict variant. Note that the original data has a lot more labels and more rows as well.
The idea is that the original data contains rows of a task with index Task_n that has been performed by a participant with index Participant_n. Each row is a segment. Even though the original data does not have this distinction, I want to add this to my dataframe. In other words:
Participant_n | Task_n | val | dur
----------------------------------
1 | 1 | 12 | 2
1 | 1 | 3 | 4
1 | 1 | 4 | 12
1 | 2 | 11 | 11
1 | 2 | 34 | 4
The above example contains one participants, two tasks, with respectively three and two segments (rows).
In Python, with a dict structure this looks like this:
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
d = pd.DataFrame(data, columns=cols)
part_d = {}
for row in d.itertuples():
participant_n = row.Participant_n
participant = "participant" + str(participant_n)
task = "task" + str(row.Task_n)
if participant in part_d:
part_d[participant]['all_sum']['val'] += int(row.val)
part_d[participant]['all_sum']['dur'] += int(row.dur)
else:
part_d[participant] = {
'prof': 0 if participant_n < 20 else 1,
'all_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
if task in part_d[participant]:
# Get already existing keys
k = list(part_d[participant][task].keys())
k_int = []
# Only get the ints (i.e. not all_sum etc.)
for n in k:
# Get digit from e.g. seg1
n = n[3:]
try:
k_int.append(int(n))
except ValueError:
pass
# Increment max by 1
i = max(k_int) + 1
part_d[participant][task][f"seg{i}"] = {
'val': int(row.val),
'dur': int(row.dur),
}
part_d[participant][task]['task_sum']['val'] += int(row.val)
part_d[participant][task]['task_sum']['dur'] += int(row.dur)
else:
part_d[participant][task] = {
'seg1': {
'val': int(row.val),
'dur': int(row.dur),
},
'task_sum': {
'val': int(row.val),
'dur': int(row.dur),
}
}
print(part_d)
In the end result here I have some additional variables such as: task_sum (the sum over the task of a participant), all_sum (sum of all a participant's actions), and also prof which is an arbitrary boolean flag. The resulting dict looks like this (not beautified to save space. If you want to inspect, open in text editor as JSON or Python dict and beautify):
{'participant1': {'prof': 0, 'all_sum': {'val': 220, 'dur': 1240}, 'task1': {'seg1': {'val': 25, 'dur': 83}, 'task_sum': {'val': 38, 'dur': 1138}, 'seg2': {'val': 4, 'dur': 68}, 'seg3': {'val': 9, 'dur': 987}}, 'task2': {'seg1': {'val': 98, 'dur': 98}, 'task_sum': {'val': 182, 'dur': 102}, 'seg2': {'val': 84, 'dur': 4}}}, 'participant2': {'prof': 0, 'all_sum': {'val': 235, 'dur': 49}, 'task1': {'seg1': {'val': 9, 'dur': 21}, 'task_sum': {'val': 9, 'dur': 21}}, 'task2': {'seg1': {'val': 15, 'dur': 6}, 'task_sum': {'val': 218, 'dur': 16}, 'seg2': {'val': 185, 'dur': 6}, 'seg3': {'val': 18, 'dur': 4}}, 'task3': {'seg1': {'val': 8, 'dur': 12}, 'task_sum': {'val': 8, 'dur': 12}}}, 'participant3': {'prof': 0, 'all_sum': {'val': 31, 'dur': 214}, 'task1': {'seg1': {'val': 7, 'dur': 78}, 'task_sum': {'val': 19, 'dur': 166}, 'seg2': {'val': 12, 'dur': 88}}, 'task2': {'seg1': {'val': 12, 'dur': 48}, 'task_sum': {'val': 12, 'dur': 48}}}}
Instead of a dictionary, I would like this to end up in a pd.DataFrame with multiple indexes that looks like the representation below, or similar. (For simplicity's sake, instead of task1 or seg1 I just used the indices.)
Participant Prof all_sum Task Task_sum Seg val dur
val dur val dur
====================================================================
participant1 0 220 1240 1 38 1138 1 25 83
2 4 68
3 9 987
2 182 102 1 98 98
2 84 4
--------------------------------------------------------------------
participant2 0 235 49 1 9 21 1 9 21
2 218 16 1 15 6
2 185 6
3 18 4
3 8 12 1 8 12
--------------------------------------------------------------------
participant3 0 31 214 1 19 166 1 7 78
2 12 88
2 12 48 1 12 48
Is this a structure that is possible in Pandas? If not, which reasonable alternatives are?
Again I have to emphasise that in reality there is a lot more data and possibly more sub-levels. The solution thus has to be flexible, and efficient. If it makes things a lot simpler, I am willing to only have multi-index on one axis, and change the header to:
Participant Prof all_sum_val all_sum_dur Task Task_sum_val Task_sum_dur Seg
The main issue I am having is that I do not understand how I can build a multi index df if I don't know the dimensions in advance. I don't know in advance how many tasks or segments there will be. So I am pretty sure I can keep the loop construct from my initial dict approach and I guess I'd then have to append/concat to an initial empty DataFrame, but the question is then what the structure has to look like. It can't be a simple Series, because that does not take multi index in account. So how?
For the people who have read this far and want to try their hand at this, I think that my original code can be re-used for the most part (loop and variable assignment), but instead of a dict it have to be accessors to the DataFrame. That an import aspect: data should be easily readable with getters/setters, just as a regular DataFrame is. E.g. it should be easy to get the duration value for participant two, task 2, segment 2, and so on. But also, getting a subset of the data (e.g. where prof === 0) should be without problems.
My only suggestion is to get rid of all your dictionary stuff. All of that code can be re-written in Pandas without much effort. This will likely speed up the transformation process as well but will take some time. To help you in the process I have rewritten the section you provided. The rest is up to you.
import pandas as pd
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
df["Task Sum val"] = df.groupby(["Participant_n","Task_n"])["val"].transform("sum")
df["Task Sum dur"] = df.groupby(["Participant_n","Task_n"])["dur"].transform("sum")
df["seg"] =df.groupby(["Participant_n","Task_n"]).cumcount() + 1
df["All Sum val"] = df.groupby("Participant_n")["val"].transform("sum")
df["All Sum dur"] = df.groupby("Participant_n")["dur"].transform("sum")
df = df.set_index(["Participant_n","All Sum val","All Sum dur","Task_n","Task Sum val","Task Sum dur"])[["seg","val","dur"]]
df = df.sort_index()
df
Output
seg val dur
Participant_n All Sum val All Sum dur Task_n Task Sum val Task Sum dur
1 220 1240 1 38 1138 1 25 83
1138 2 4 68
1138 3 9 987
2 182 102 1 98 98
102 2 84 4
2 235 49 1 9 21 1 9 21
2 218 16 1 15 6
16 2 185 6
16 3 18 4
3 8 12 1 8 12
3 31 214 1 19 166 1 7 78
166 2 12 88
2 12 48 1 12 48
Try to run this code and let me know what you think. Comment with any questions.
I faced a similar issue with data presentation and came up with the following helper functions for groupby with subtotals.
With this process it's possible to generate subtotals for an arbitrary number of group by columns, however the output data has a different format. Instead of the subtotals being put in their own columns, each subtotal adds an extra row to the data frame.
For interactive data exploration & analysis, I find this very helpful as its possible to get the subtotals with just a couple of lines of code
def get_subtotals(frame, columns, aggvalues, subtotal_level):
if subtotal_level == 0:
return frame.groupby(columns, as_index=False).agg(aggvalues)
elif subtotal_level == len(columns):
return pd.DataFrame(frame.agg(aggvalues)).transpose().assign(
**{c: np.nan for i, c in enumerate(columns)}
)
return frame.groupby(
columns[:subtotal_level],
as_index=False
).agg(aggvalues).assign(
**{c: np.nan for i, c in enumerate(columns[subtotal_level:])}
)
def groupby_with_subtotals(frame, columns, aggvalues, grand_totals=False, totals_position='last'):
gt = 1 if grand_totals else 0
out = pd.concat(
[get_subtotals(df, columns, aggvalues, i)
for i in range(len(columns)+gt)]
).sort_values(columns, na_position=totals_position)
out[columns] = out[columns].fillna('total')
return out.set_index(columns)
resuing the dataframe creation code from Gabriel A's answer
cols = ['Participant_n', 'Task_n', 'val', 'dur']
data = [[1,1,25,83],
[1,1,4,68],
[1,1,9,987],
[1,2,98,98],
[1,2,84,4],
[2,1,9,21],
[2,2,15,6],
[2,2,185,6],
[2,2,18,4],
[2,3,8,12],
[3,1,7,78],
[3,1,12,88],
[3,2,12,48]]
df = pd.DataFrame(data, columns=cols)
It is first necessary to add the seg column
df['seg'] = df.groupby(['Participant_n', 'Task_n']).cumcount() + 1
Then we can use groupby_with_subtotals like this. Additionally, note that you can place the subtotals at the top and also include grand_totals by passing in grand_totals=True, totals_position='first'
groupby_columns = ['Participant_n', 'Task_n', 'seg']
groupby_aggs = {'val': 'sum', 'dur': 'sum'}
aggdf = groupby_with_subtotals(df, groupby_columns, groupby_aggs)
aggdf
# outputs
dur val
Participant_n Task_n seg
1 1.0 1.0 83 25
2.0 68 4
3.0 987 9
total 1138 38
2.0 1.0 98 98
2.0 4 84
total 102 182
total total 1240 220
2 1.0 1.0 21 9
total 21 9
2.0 1.0 6 15
2.0 6 185
3.0 4 18
total 16 218
3.0 1.0 12 8
total 12 8
total total 49 235
3 1.0 1.0 78 7
2.0 88 12
total 166 19
2.0 1.0 48 12
total 48 12
total total 214 31
Here, the subtotals rows are marked with total, and the left most total indicates the subtotal level.
Once the aggregate data frame is created, its possible to access the subtotals using loc. example:
aggdf.loc[1,'total','total']
# outputs:
dur 1240
val 220
Name: (1, total, total), dtype: int64
Related
I have a pandas dataset that looks at the number of n cases of an instance over time.
I have sorted the dataset in ascending order from the first recorded date and have created a new column called 'change'.
I am unsure however how to take the data from column n and map it onto the 'change' column such that each cell in the 'change' column represents the difference from the previous day.
For example, if on day 334 there were n = 14000 and on day 335 there were n = 14500 cases, in that corresponding 'change' cell I would want it to say '500'.
I have been trying things out for the past couple of hours but to no avail so have come here for some help.
I know this is wordier than I would like, but if you need any clarification let me know.
import pandas as pd
df = pd.DataFrame({
'date': [1,2,3,4,5,6,7,8,9,10],
'cases': [100, 120, 129, 231, 243, 212, 375, 412, 440, 1]
})
df['change'] = df.cases.diff()
OUTPUT
date cases change
0 1 100 NaN
1 2 120 20.0
2 3 129 9.0
3 4 231 102.0
4 5 243 12.0
5 6 212 -31.0
6 7 375 163.0
7 8 412 37.0
8 9 440 28.0
9 10 1 -439.0
I have a data frame:
I have to calculate all the differences but separately for each event. In the data frame, you can see that after index 8 index 12 starts which means the start of a new event and that difference should be calculated separately. So This means as the difference between index_col is 4 the new event starts and that difference should be sum separately.
So the sum of events should be like this e.g
index_col 1-8 sum of Difference should be 20.96 (belongs to the first event)
index_col 12-17 sum of Difference should be 16.17(belongs to the second even)
and so on ...
index_col Depth(nm) Load(µN) Time (s) Difference
1 42.478033 432.482376 5.460979 8.70957
2 44.217959 432.163277 5.461261 1.73993
3 44.517313 432.764691 5.461824 3.36262
4 44.602024 433.754851 5.462669 2.37831
5 44.452232 434.808104 5.463514 1.8221
6 44.785705 435.698639 5.464358 1.1552
7 44.008191 436.724050 5.464922 1.02758
8 44.104820 438.753727 5.466611 1.04814
12 39.918249 390.597846 5.476275 7.61717
13 40.939905 391.229950 5.477120 2.66319
14 40.709209 392.333573 5.477965 1.99305
15 40.975959 393.208349 5.478810 1.88325
16 40.415786 395.135862 5.480218 1.00294
17 40.748377 396.057784 5.481062 1.13622
21 45.101152 441.052546 5.554368 5.64005
22 43.096024 442.489659 5.554931 2.13311
23 44.581075 442.264911 5.555213 1.48505
24 43.757947 443.295160 5.555776 2.34133
25 44.020544 444.209317 5.556621 2.15143
26 44.457026 445.121651 5.557466 2.2784
27 44.332075 446.131261 5.558310 1.36814
28 43.853956 447.344522 5.559155 1.0139
32 38.420457 381.697812 5.462362 5.80165
33 39.247295 382.417916 5.463206 2.51963
34 38.910364 383.542124 5.464051 1.67136
38 45.939504 467.899009 5.564736 6.58783
39 44.251143 469.194422 5.565299 1.40849
40 46.242257 468.823029 5.565581 1.99111
41 45.032736 469.930914 5.566144 1.95164
42 45.540791 470.765236 5.566989 2.50574
43 45.520035 471.821972 5.567834 1.91457
44 45.593076 472.835489 5.568678 1.24077
45 45.267980 474.618237 5.570086 1.05416
46 45.238412 475.640147 5.570931 1.038062
49 38.193023 392.286042 5.490368 8.13389
50 41.444420 391.411630 5.490650 3.2514
The way you add the data as plain text is very unhelpful. It would be much easier and faster if you add the data in the form index_col = ..., load = ... and so on.
That aside, this is my code:
index_col = [1, 2, 3, 4, 5, 6, 7, 8, 12, 13, 14, 15, 16, 17, 21, 22, 23, 24, 25, 26, 27, 28, 32, 33, 34, 38, 39, 40, 41, 42, 43, 44, 45, 46, 49, 50]
depth = [42.478033, 44.217959, 44.517313, 44.602024, 44.452232, 44.785705, 44.008191, 44.10482, 39.918249, 40.939905, 40.709209, 40.975959, 40.415786, 40.748377, 45.101152, 43.096024, 44.581075, 43.757947, 44.020544, 44.457026, 44.332075, 43.853956, 38.420457, 39.247295, 38.910364, 45.939504, 44.251143, 46.242257, 45.032736, 45.540791, 45.520035, 45.593076, 45.26798, 45.238412, 38.193023, 41.44442]
load = [432.482376, 432.163277, 432.764691, 433.754851, 434.808104, 435.698639, 436.72405, 438.753727, 390.597846, 391.22995, 392.333573, 393.208349, 395.135862, 396.057784, 441.052546, 442.489659, 442.264911, 443.29516, 444.209317, 445.121651, 446.131261, 447.344522, 381.697812, 382.417916, 383.542124, 467.899009, 469.194422, 468.823029, 469.930914, 470.765236, 471.821972, 472.835489, 474.618237, 475.640147, 392.286042, 391.41163]
time = [5.460979, 5.461261, 5.461824, 5.462669, 5.463514, 5.464358, 5.464922, 5.466611, 5.476275, 5.47712, 5.477965, 5.47881, 5.480218, 5.481062, 5.554368, 5.554931, 5.555213, 5.555776, 5.556621, 5.557466, 5.55831, 5.559155, 5.462362, 5.463206, 5.464051, 5.564736, 5.565299, 5.565581, 5.566144, 5.566989, 5.567834, 5.568678, 5.570086, 5.570931, 5.490368, 5.49065]
difference = [8.70957, 1.73993, 3.36262, 2.37831, 1.8221, 1.1552, 1.02758, 1.04814, 7.61717, 2.66319, 1.99305, 1.88325, 1.00294, 1.13622, 5.64005, 2.13311, 1.48505, 2.34133, 2.15143, 2.2784, 1.36814, 1.0139, 5.80165, 2.51963, 1.67136, 6.58783, 1.40849, 1.99111, 1.95164, 2.50574, 1.91457, 1.24077, 1.05416, 1.03806, 8.13389, 3.2514]
df = pd.DataFrame({'index': index_col, 'depth': depth, 'load': load, 'time': time, 'difference': difference})
sum_diff = []
start = 0
for i in range(len(df)):
if i == len(df) - 1:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
else:
if df['index'][i] + 1 != df['index'][i + 1]:
end = i+1
sum_diff.append(sum(df['difference'][start:end]))
start = end
print(sum_diff)
Output:
[21.24345, 16.295820000000003, 18.411409999999997, 9.99264, 19.69237, 11.38529]
I checked if the calculation is correct by doing this manually:
print(sum(df['difference'][0:8]))
print(sum(df['difference'][8:14]))
print(sum(df['difference'][14:22]))
print(sum(df['difference'][22:25]))
print(sum(df['difference'][25:34]))
print(sum(df['difference'][34:36]))
and yes, I got the same output:
21.24345
16.295820000000003
18.411409999999997
9.99264
19.69237
11.38529
And elegant solution would be using groupby the dataframe based on the index_col differences and construct a dict for flexible use of sum. Take an empty dataframe and use it for the storage of the summed results.
You can do as follows:
df = pd.DataFrame(data)
result = pd.DataFrame(columns = ['event_no', 'sum'])
grouped_dict = dict(tuple(df.groupby(df['index_col'].diff().gt(1).cumsum())))
for index in grouped_dict:
result = result.append({'event_no': index+1, 'sum': grouped_dict[index]['difference'].sum()}, ignore_index=True)
And this will give you exactly what you want:
event_no sum
0 1.0 21.243450
1 2.0 16.295820
2 3.0 18.411410
3 4.0 9.992640
4 5.0 19.692372
5 6.0 11.385290
What does df.groupby(df['index_col'].diff().gt(1).cumsum()) do?
The diff() simply calculates the difference between consecutive indices in df['index_col']. The gt(1) returns whether each element in the df['index_col'].diff() is greater than 1 or not. the cumsum() then sums these boolean results. As index 0-7 is False, cumsum is 0 for each of these indexes. Then index 8 is True, So cumsum becomes 1 and remains same for the rest of the consecutive indices as they return False for gt(1).
The calculation goes in the same way for rest of the consecutive segments. So for df.groupby() we get inputs of groups of 0's to 5's as follows:
0
0
0
0
0
0
0
0
1
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
4
4
4
4
4
4
4
4
4
5
5
Hence group by is done based on these 5 values for your given input.
Hope that's clear now!
Want to convert JSON object to DataFrame.
This my JSON object
data = {'situation': {'OpenPlay': {'shots': 282,
'goals': 33,
'xG': 36.38206055667251,
'against': {'shots': 276, 'goals': 29, 'xG': 33.0840025995858}},
'FromCorner': {'shots': 46,
'goals': 2,
'xG': 2.861613758839667,
'against': {'shots': 46, 'goals': 4, 'xG': 3.420699148438871}},
'DirectFreekick': {'shots': 19,
'goals': 1,
'xG': 1.0674087516963482,
'against': {'shots': 10, 'goals': 0, 'xG': 0.6329493299126625}},
'SetPiece': {'shots': 14,
'goals': 1,
'xG': 0.6052199145779014,
'against': {'shots': 21, 'goals': 1, 'xG': 2.118571280501783}},
'Penalty': {'shots': 6,
'goals': 6,
'xG': 4.5670130252838135,
'against': {'shots': 2, 'goals': 1, 'xG': 1.5222634673118591}}}
Want output:
My code:
df = pd.json_normalize(data['situation']['OpenPlay'])
for i in range(1,4):
df = df.append(pd.json_normalize(data['situation'][type_of_play[i]]))
df = df.reset_index()
any efficient way of doing this?
First of all, your data lack of a '}' at the end.
Try this code:
obj = [pd.json_normalize(data['situation'][e]) for e in data['situation']]
pd.concat(obj, ignore_index=True)
You can load the data into a dataframe regularly, then run json_normalize on the column that contains the remaining dicts, and join it with the main dataframe:
df = pd.DataFrame(data['situation']).T.reset_index()
df = df.join(pd.json_normalize(df.against), lsuffix='_against', how='left').drop(columns=['against'])
Result:
index
shots_against
goals_against
xG_against
shots
goals
xG
0
OpenPlay
282
33
36.3821
276
29
33.084
1
FromCorner
46
2
2.86161
46
4
3.4207
2
DirectFreekick
19
1
1.06741
10
0
0.632949
3
SetPiece
14
1
0.60522
21
1
2.11857
4
Penalty
6
6
4.56701
2
1
1.52226
For efficiency, it is best to process the data outside Pandas, within the dictionary, and then create the dataframe.
You could use jmespath to extract the data, before passing it into pandas; should be more efficient; you could run tests to check the speed:
Summary idea for jmespath; if you are accessing a key, use ., if it's an array/list, use []:
import jmespath
expression = """{shots: *.*.shots[],
goals: *.*.goals[],
xG : *.*.xG[],
against_shots: *.*.against.shots[],
against_goals: *.*.against.goals[],
against_XG: *.*.against.xG[]
}"""
expression = jmespath.compile(expression)
expression = expression.search(data)
#dataframe
pd.DataFrame(expression)
shots goals xG against_shots against_goals against_XG
0 282 33 36.382061 276 29 33.084003
1 46 2 2.861614 46 4 3.420699
2 19 1 1.067409 10 0 0.632949
3 14 1 0.605220 21 1 2.118571
4 6 6 4.567013 2 1 1.522263
jmespath can be convenient, especially as the nesting in the dict/json becomes more convoluted; most efficient however, would be to use the dictionary data structure directly:
from collections import defaultdict
df = defaultdict(list)
for key, value in data['situation'].items():
df['shots'].append(value['shots'])
df['goals'].append(value['goals'])
df['xG'].append(value['xG'])
df['against_shots'].append(value['against']['shots'])
df['against_goals'].append(value['against']['goals'])
df['against_xG'].append(value['against']['xG'])
# create dataframe
pd.DataFrame(df)
shots goals xG against_shots against_goals against_xG
0 282 33 36.382061 276 29 33.084003
1 46 2 2.861614 46 4 3.420699
2 19 1 1.067409 10 0 0.632949
3 14 1 0.605220 21 1 2.118571
4 6 6 4.567013 2 1 1.522263
I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0
I have a problem with dataframes in Python. I am trying to copy certain rows to a new dataframe but I can't figure it out.
There are 2 arrays:
pokemon_data
# HP Attack Defense Sp. Atk Sp. Def Speed
0 1 45 49 49 65 65 45
1 2 60 62 63 80 80 60
2 3 80 82 83 100 100 80
3 4 80 100 123 122 120 80
4 5 39 52 43 60 50 65
... ... ... ... ... ... ... ...
795 796 50 100 150 100 150 50
796 797 50 160 110 160 110 110
797 798 80 110 60 150 130 70
798 799 80 160 60 170 130 80
799 800 80 110 120 130 90 70
800 rows × 7 columns
combats_data
First_pokemon Second_pokemon Winner
0 266 298 1
1 702 701 1
2 191 668 1
3 237 683 1
4 151 231 0
... ... ... ...
49995 707 126 0
49996 589 664 0
49997 303 368 1
49998 109 89 0
49999 9 73 0
50000 rows × 3 columns
I created third dataset with columns:
output1
HP0 Attack0 Defens0 Sp. Atk0 Sp. Def0 Speed0 HP1 Attack1 Defense1 Sp. Atk1 Sp. Def1 Speed1 Winner
What I'm trying to do is copy attributes from pokemon_data to output1 in order from combats_data.
HP0 and HP1 are respectivly HP from first Pokemon and HP from second Pokemon.
I want to use that data in neural networks with TensorFlow to predict what Pokemon would win.
For this type of wrangling, you should first "melt" or "tidy" the combats_data so each ID has its own row, then do a "join" or "merge" of the two dataframes.
You didn't provide a minimum reproducible example, so here's mine:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'var1': [10,20,30,40,50],
'var2': [15,25,35,45,55]})
df2 = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'outcome': [1,4]})
df2tidy = pd.melt(df2, id_vars=['outcome'], value_vars=['id1', 'id2'],
var_name='name', value_name='id')
df2tidy
# outcome name id
# 0 1 id1 1
# 1 4 id1 2
# 2 1 id2 3
# 3 4 id2 4
output = pd.merge(df2tidy, df1, on='id')
output
# outcome name id var1 var2
# 0 1 id1 1 10 15
# 1 4 id1 2 20 25
# 2 1 id2 3 30 35
# 3 4 id2 4 40 45
which you could then train some sort of classifier on outcome.
(Btw, you should make outcome a 0 or 1 (for pokemon1 vs pokemon2) instead of the actual ID of the winner.)
So i would like to create new array based on these two arrays. For example:
#ids represent pokemons and their attributes
pokemons = pd.DataFrame({'id': [1,2,3,4,5],
'HP': [10,20,30,40,50],
'Attack': [15,25,35,45,55],
'Defese' : [25,15,45,15,35]})
#here 0 or 1 represents whether first or second pokemon won
combats = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'winner': [0,1]})
#in output data i want to replace ids with attributes, the order is based on combats array
output = pd.DataFrame({'HP1': [10,20],
'Attack1': [15,25],
'Defense1': [25,15],
'HP2': [30,40],
'Attack2': [35,45],
'Defense2': [45,15],
'winner': [0,1]})
Not sure if its correct thinking. I want to train neural network to figure out what pokemon will win.
This is solution from user part from 4programmers.net forum.
import pandas as pd
if __name__ == "__main__":
pokemon_data = pd.DataFrame({
"Id": [1, 2, 3, 4, 5],
"HP": [45, 60, 80, 80, 39],
"Attack": [49, 62, 82, 100, 52],
"Defense": [49, 63, 83, 123, 43],
"Sp. Atk": [65, 80, 100, 122, 60],
"Sp. Def": [65, 80, 100, 120, 50],
"Speed": [45, 60, 80, 80, 65]})
combats_data = pd.DataFrame({
"First_pokemon": [1, 2, 3],
"Second_pokemon": [2, 3, 4],
"Winner": [1, 0, 1]})
output = pokemon_data.merge(combats_data, left_on="Id", right_on="First_pokemon")
output = output.merge(pokemon_data, left_on="Second_pokemon", right_on="Id",
suffixes=("_pokemon1", "_pokemon2"))
print(output)