Aggregations over specific columns of a large dataframe, with named output - python
I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.
Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()
Related
Interpolate and match missing values between two dataframes of different dimensions
I'm new to pandas and python in general. Currently I'm trying to interpolate and make the coordinates of two different dataframes match. The data comes from two different GEOTIFF files from the same source, one being temperature and the other being radiation. The file was converted to pandas with georasters. The radiation dataframe has more points and data, I want to upscale the temperature dataframe and have the same coordinates as the prior. Radiation Dataframe: row col value x y 0 197 2427 5.755 -83.9325 17.5075 1 197 2428 5.755 -83.93 17.5075 2 197 2429 5.755 -83.9275 17.5075 3 197 2430 5.755 -83.925 17.5075 4 197 2431 5.755 -83.9225 17.5075 1850011 rows × 5 columns Temperature Dataframe: row col value x y 0 59 725 26.8 -83.9583 17.5083 1 59 726 26.8 -83.95 17.5083 2 59 727 26.8 -83.9417 17.5083 3 59 728 26.8 -83.9333 17.5083 4 59 729 26.8 -83.925 17.5083 167791 rows × 5 columns Source of data "Gis data - LTAym_AvgDailyTotals (GeoTIFF)" Temperature Map Radiation (GHI) Map
In order to be able to change the information of a column, you have to use iloc. I said I gave the fourth column from the left and index 3, which is the same as column x, then I gave it your values, then I printed the result. import pandas as pd Radiation = {'row':["197","197","197","197","197"], 'col':["2427","2428","2429","2430","2431"], 'value':['5.755','5.755','5.755','5.755','5.755'], 'x':['-83.9325','-83.93','-83.9275','-83.925','-83.9225'], 'y':['17.5075','17.5075','17.5075','17.5075','17.5075'] } Temperature = { 'row':["59","59","59","59","59"], 'col':["725","726","727","728","729"], 'value':["26.8","26.8","26.8","26.8","26.8"], 'x':["-83.9583","-83.95","-83.9417","-83.9333","-83.925"], 'y':["17.5083","17.5083","17.5083","17.5083","17.5083"] } df1 = pd.DataFrame(Radiation) df2 = pd.DataFrame(Temperature) df1.iloc[4:,3]='1850011' df2.iloc[4:,3]='167791' Comparison = df1.compare(df2, keep_shape=True, keep_equal=True) print(df1) print(df2)
Create new Pandas.DataFrame with .groupby(...).agg(sum) then recover unsummed columns
I'm starting with a dataframe of baseabll seasons a section of which looks similar to this: Name Season AB H SB playerid 13047 A.J. Pierzynski 2013 503 137 1 746 6891 A.J. Pierzynski 2006 509 150 1 746 1374 Rod Carew 1977 616 239 23 1001942 1422 Stan Musial 1948 611 230 7 1009405 1507 Todd Helton 2000 580 216 5 432 1508 Nomar Garciaparra 2000 529 197 5 190 1509 Ichiro Suzuki 2004 704 262 36 1101 From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe. My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible. I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.
You can write your own condition how do you want to include non summed columns. col = df.columns.tolist() col.remove('playerid') df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns}) df: Name Season AB H SB playerid playerid 190 Nomar_Garciaparra 2000 529 197 5 190 432 Todd_Helton 2000 580 216 5 432 746 A.J._Pierzynski 4019 1012 287 2 1492 1101 Ichiro_Suzuki 2004 704 262 36 1101 1001942 Rod_Carew 1977 616 239 23 1001942 1009405 Stan_Musial 1948 611 230 7 1009405
If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns: stat_cols = ['AB', 'H', 'SB'] groupby_cols = ['playerid', 'Name'] results = df.groupby(groupby_cols)[stat_cols].sum() Results: AB H SB playerid Name 190 Nomar Garciaparra 529 197 5 432 Todd Helton 580 216 5 746 A.J. Pierzynski 1012 287 2 1101 Ichiro Suzuki 704 262 36 1001942 Rod Carew 616 239 23 1009405 Stan Musial 611 230 7 If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map: results = df.groupby('playerid')[stat_cols].sum() name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict() results['Name'] = results.index.map(name_map) Results: AB H SB Name playerid 190 529 197 5 Nomar Garciaparra 432 580 216 5 Todd Helton 746 1012 287 2 A.J. Pierzynski 1101 704 262 36 Ichiro Suzuki 1001942 616 239 23 Rod Carew 1009405 611 230 7 Stan Musial
groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column. Using the sample data above, one might use mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max } df_1 = df.groupby('playerid').agg(mapping) The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply. To expand this to larger data sets, you might use a dictionary comprehension. This would work well: dictionary = { x : sum for x in df.columns} dont_sum = {'Name': max, 'Season': max} dictionary.update(dont_sum) df_1 = df.groupby('playerid').agg(dictionary)
Extract duplicates into new dataframe with Pandas
I have a large data frame with a many columns. One of these columns is what's supposed to be a Unique ID and the other is a Year. Unfortunately, there are duplicates in the Unique ID column. I know how to generate a list of all duplicates, but what I actually want to do is extract them out such that only the first entry (by year) remains. For example, the dataframe currently looks something like this (with a bunch of other columns): ID Year ---------- 123 1213 123 1314 123 1516 154 1415 154 1718 233 1314 233 1415 233 1516 And what I want to do is transform this dataframe into: ID Year ---------- 123 1213 154 1415 233 1314 While storing just the those duplicates in another dataframe: ID Year ---------- 123 1314 123 1516 154 1415 233 1415 233 1516 I could drop duplicates by year to keep the oldest entry, but I am not sure how to get just the duplicates into a list that I can store as another dataframe. How would I do this?
Use duplicated In [187]: d = df.duplicated(subset=['ID'], keep='first') In [188]: df[~d] Out[188]: ID Year 0 123 1213 3 154 1415 5 233 1314 In [189]: df[d] Out[189]: ID Year 1 123 1314 2 123 1516 4 154 1718 6 233 1415 7 233 1516
find value in column and based on it create a new dataframe in pandas
I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe: flag № 2017-18 389 2017-19 390 2017-20 391 2017-21 392 2017-22 393 2017-23 394 ... I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this: №_new 391 392 393 394 395 ... 442 443 444 It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?
You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0]. Last create new DataFrame with numpy.arange. fg = '2017-20' val = df.loc[df['flag'] == fg, '№'].values[0] print (val) 391 df1 = pd.DataFrame({'№_new':np.arange(val, val+53)}) print (df1) №_new 0 391 1 392 2 393 3 394 4 395 5 396 6 397 7 398 8 399 9 400 10 401 11 402 .. ..
How to shift a column in Pandas DataFrame without losing value
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it without losing values. (This post is quite similar to How to shift a column in Pandas DataFrame but the validated answer doesn't give the desired output and I can't comment it). Does anyone know how to do it? ## x1 x2 ##0 206 214 ##1 226 234 ##2 245 253 ##3 265 272 ##4 283 291 Desired output: ## x1 x2 ##0 206 nan ##1 226 214 ##2 245 234 ##3 265 253 ##4 283 271 ##5 nan 291
Use loc to add a new blank row to the DataFrame, then perform the shift. df.loc[max(df.index)+1, :] = None df.x2 = df.x2.shift(1) The code above assumes that your index is integer based, which is the pandas default. If you're using a non-integer based index, replace max(df.index)+1 with whatever you want the new last index to be.