I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?
I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)
If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64
I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.
Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()
I have a large data frame with a many columns. One of these columns is what's supposed to be a Unique ID and the other is a Year. Unfortunately, there are duplicates in the Unique ID column.
I know how to generate a list of all duplicates, but what I actually want to do is extract them out such that only the first entry (by year) remains. For example, the dataframe currently looks something like this (with a bunch of other columns):
ID Year
----------
123 1213
123 1314
123 1516
154 1415
154 1718
233 1314
233 1415
233 1516
And what I want to do is transform this dataframe into:
ID Year
----------
123 1213
154 1415
233 1314
While storing just the those duplicates in another dataframe:
ID Year
----------
123 1314
123 1516
154 1415
233 1415
233 1516
I could drop duplicates by year to keep the oldest entry, but I am not sure how to get just the duplicates into a list that I can store as another dataframe.
How would I do this?
Use duplicated
In [187]: d = df.duplicated(subset=['ID'], keep='first')
In [188]: df[~d]
Out[188]:
ID Year
0 123 1213
3 154 1415
5 233 1314
In [189]: df[d]
Out[189]:
ID Year
1 123 1314
2 123 1516
4 154 1718
6 233 1415
7 233 1516
I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe:
flag №
2017-18 389
2017-19 390
2017-20 391
2017-21 392
2017-22 393
2017-23 394
...
I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this:
№_new
391
392
393
394
395
...
442
443
444
It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?
You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0].
Last create new DataFrame with numpy.arange.
fg = '2017-20'
val = df.loc[df['flag'] == fg, '№'].values[0]
print (val)
391
df1 = pd.DataFrame({'№_new':np.arange(val, val+53)})
print (df1)
№_new
0 391
1 392
2 393
3 394
4 395
5 396
6 397
7 398
8 399
9 400
10 401
11 402
..
..
I have a data frame like below:
i_id q_id
month category_bucket
Aug Algebra Tutoring 187 64
Balloon Artistry 459 401
Carpet Installation or Replacement 427 243
Dance Lessons 181 46
Landscaping 166 60
Others 9344 4987
Tennis Instruction 161 61
Tree and Shrub Service 383 269
Wedding Photography 161 49
Window Repair 140 80
Wiring 439 206
July Algebra Tutoring 555 222
Balloon Artistry 229 202
Carpet Installation or Replacement 140 106
Dance Lessons 354 115
Landscaping 511 243
Others 9019 4470
Tennis Instruction 613 324
Tree and Shrub Service 130 100
Wedding Photography 425 191
Window Repair 444 282
Wiring 154 98
It's a multi-index data frame with month and category bucket as index. And i_id, q_id as columns
I got this by doing a groupby operation on a normal data frame like below
invites_combined.groupby(['month', 'category_bucket'])[["i_id","q_id"]].count()
I basically want a data frame where I have 4 columns 2 each for i_id, q-Id for both the months and a column for category_bucket. So basically converting the above multi-index data frame to single index so that I can access the values.
Currently it's difficult for me to access the values of i_id, q_id along for a particular month and category value.
If you feel there is an easier way to access the i_id and q_id values for each category and month without having to convert to single index that is fine too.
Single index would be easier to loop into each value for each combination of month and category though.
It seems you need reset_index for convert MultiIndex to columns:
df = df.reset_index()