Selectively remove deprecated rows in a pandas dataframe - python
I have a Dataframe containing data that looks like below.
p,g,a,s,v
15,196,1399,16,5
15,196,948,5,1
15,196,1894,5,1
15,196,1616,5,1
15,196,1742,3,1
15,196,1742,4,4
15,196,1742,5,1
15,195,732,9,2
15,195,1765,11,7
15,196,1815,9,1
15,196,1399,11,8
15,196,1958,0,1
15,195,767,9,1
15,195,1765,11,8
15,195,886,9,1
15,195,1765,11,9
15,196,1958,5,1
15,196,1697,1,1
15,196,1697,4,1
Given multiple entries that have the same p, g, a, and s, I need to drop all but the one with the highest v. The reason is that the original source of this data is a kind of event log, and each line corresponds to a "new total". If it matters, the source data is ordered by time and includes a timestamp index, which I removed for brevity. The entry with the latest date would be the same as the entry with the highest v, as v only increases.
Pulling an example out of the above data, given this:
p,g,a,s,v
15,195,1765,11,7
15,195,1765,11,8
15,195,1765,11,9
I need to drop the first two rows and keep the last one.
If I understand correctly I think you want the following, this performs a groupby on your cols of interest and then takes the max value of column 'v' and we then call reset_index:
In [103]:
df.groupby(['p', 'g', 'a', 's'])['v'].max().reset_index()
Out[103]:
p g a s v
0 15 195 732 9 2
1 15 195 767 9 1
2 15 195 886 9 1
3 15 195 1765 11 9
4 15 196 948 5 1
5 15 196 1399 11 8
6 15 196 1399 16 5
7 15 196 1616 5 1
8 15 196 1697 1 1
9 15 196 1697 4 1
10 15 196 1742 3 1
11 15 196 1742 4 4
12 15 196 1742 5 1
13 15 196 1815 9 1
14 15 196 1894 5 1
15 15 196 1958 0 1
16 15 196 1958 5 1
Related
How to group and sum rows by ID and subtract from group of rows with same ID? [python]
I have the following dataframe: ID_A ID_B ID_C Value ID_B_Value_Sum ----------------------------------------------- 0 22 5 1 54 208 1 23 5 2 34 208 2 24 6 1 44 268 3 25 6 1 64 268 4 26 5 2 35 208 5 27 7 3 45 229 6 28 7 2 66 229 7 29 8 1 76 161 8 30 8 2 25 161 9 31 6 2 27 268 10 32 5 3 14 208 11 33 5 3 17 208 12 34 6 2 43 268 13 35 6 2 53 268 14 36 8 1 22 161 15 37 7 3 65 229 16 38 7 1 53 229 17 39 8 2 23 161 18 40 8 3 15 161 19 41 6 3 37 268 20 42 5 2 54 208 Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas. What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row. And so I want to produce the following output dataframe: ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New --------------------------------------------------------------- 0 22 5 1 54 208 XX 1 23 5 2 34 208 XX 2 24 6 1 44 268 204 3 25 6 1 64 268 XX 4 26 5 2 35 208 120 5 27 7 3 45 229 XX 6 28 7 2 66 229 XX 7 29 8 1 76 161 139 8 30 8 2 25 161 XX 9 31 6 2 27 268 XX 10 32 5 3 14 208 XX 11 33 5 3 17 208 XX 12 34 6 2 43 268 XX 13 35 6 2 53 268 XX 14 36 8 1 22 161 XX 15 37 7 3 65 229 XX 16 38 7 1 53 229 XX 17 39 8 2 23 161 XX 18 40 8 3 15 161 XX 19 41 6 3 37 268 XX 20 42 5 2 54 208 XX What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need: df['Value_Sum_New'] = (df['ID_B_Value_Sum'] - df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum') + df['Value'] ) output: ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New 0 22 5 1 54 208 208 1 23 5 2 34 208 119 2 24 6 1 44 268 204 3 25 6 1 64 268 224 4 26 5 2 35 208 120 5 27 7 3 45 229 164 6 28 7 2 66 229 229 7 29 8 1 76 161 139 8 30 8 2 25 161 138 9 31 6 2 27 268 172 10 32 5 3 14 208 191 11 33 5 3 17 208 194 12 34 6 2 43 268 188 13 35 6 2 53 268 198 14 36 8 1 22 161 85 15 37 7 3 65 229 184 16 38 7 1 53 229 229 17 39 8 2 23 161 136 18 40 8 3 15 161 161 19 41 6 3 37 268 268 20 42 5 2 54 208 139 explanation As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with: df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum') Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.
Pandas Dataframe Reshape/Alteration Question
I feel like this should be an easy solution, but it has eluded me a bit (long week). Say I have the following Pandas Dataframe (df): day x_count x_max y_count y_max 1 8 230 18 127 1 6 174 12 121 1 5 218 21 184 1 11 91 32 162 2 11 128 17 151 2 13 156 16 148 2 18 191 22 120 Etc. How can I collapse it down so that I have one row per day and each of the columns in my example are added across all of their days? For example: day x_count x_max y_count y_max 1 40 713 93 594 2 42 475 55 419 Is it best to reshape it or simply create a new one?
Read 4 lines of data into one row of pandas data frame
I have txt file with such values: 108,612,620,900 168,960,680,1248 312,264,768,564 516,1332,888,1596 I need to read all of this into a single row of data frame. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596 I have many such files and so I'll keep appending rows to this data frame. I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have : df = pd.read_csv(f,sep=",| ", header = None) But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data: df = pd.read_csv('test/t.txt', header=None) It gives you a DataFrame shaped like the CSV. Then concatenate: s = pd.concat((df.loc[i] for i in df.index), ignore_index=True) It gives you a Series: 0 108 1 612 2 620 3 900 4 168 5 960 6 680 7 1248 8 312 9 264 10 768 11 564 12 516 13 1332 14 888 15 1596 dtype: int64 Finally, if you really want a horizontal DataFrame: pd.DataFrame([s]) Gives you: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596 Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.
selecting a column from pandas pivot table
I have the below pivot table which I created from a dataframe using the following code: table = pd.pivot_table(df, values='count', index=['days'],columns['movements'], aggfunc=np.sum) movements 0 1 2 3 4 5 6 7 days 0 2777 51 2 1 6279 200 7 3 2 5609 110 32 4 3 4109 118 101 8 3 4 3034 129 109 6 2 2 5 2288 131 131 9 2 1 6 1918 139 109 13 1 1 7 1442 109 153 13 10 1 8 1085 76 111 13 7 1 9 845 81 86 8 8 10 646 70 83 1 2 1 1 As you can see from pivot table that it has 8 columns from 0-7 and now I want to plot some specific columns instead of all. I could not manage to select columns. Lets say I want to plot column 0 and column 2 against index. what should I use for y to select column 0 and column 2? plt.plot(x=table.index, y=??) I tried with y = table.value['0', '2'] and y=table['0','2'] but nothing works.
You cannot select ndarray for y if you need those two column values in a single plot you can use: plt.plot(table['0']) plt.plot(table['2']) If column names are intergers then: plt.plot(table[0]) plt.plot(table[2])
Issue with reindexing a multiindex
I am struggling to reindex a multiindex. Example code below: rng = pd.date_range('01/01/2000 00:00', '31/12/2004 23:00', freq='H') ts = pd.Series([h.dayofyear for h in rng], index=rng) daygrouped = ts.groupby(lambda x: x.dayofyear) daymean = daygrouped.mean() myindex = np.arange(1,367) myindex = np.concatenate((myindex[183:],myindex[:183])) daymean.reindex(myindex) gives (as expected): 184 184 185 185 186 186 187 187 ... 180 180 181 181 182 182 183 183 Length: 366, dtype: int64 BUT if I create a multindex: hourgrouped = ts.groupby([lambda x: x.dayofyear, lambda x: x.hour]) hourmean = hourgrouped.mean() myindex = np.arange(1,367) myindex = np.concatenate((myindex[183:],myindex[:183])) hourmean.reindex(myindex, level=1) I get: 1 1 1 2 1 3 1 4 1 ... 366 20 366 21 366 22 366 23 366 Length: 8418, dtype: int64 Any ideas on my mistake? - Thanks. Bevan
First, you have to specify level=0 instead of 1 (as it is the first level -> zero-based indexing -> 0). But, there is still a problem: the reindexing works, but does not seem to preserve the order of the provided index in the case of a MultiIndex: In [54]: hourmean.reindex([5,4], level=0) Out[54]: 4 0 4 1 4 2 4 3 4 4 4 ... 20 4 21 4 22 4 23 4 5 0 5 1 5 2 5 3 5 4 5 ... 20 5 21 5 22 5 23 5 dtype: int64 So getting a new subset of the index works, but it is in the same order as the original and not as the new provided index. This is possibly a bug with reindex on a certain level (I opened an issue to discuss this: https://github.com/pydata/pandas/issues/8241) A solution for now to reindex your series, is to create a MultiIndex and reindex with that (so not on a specified level, but with the full index, that does preserve the order). Doing this is very easy with MultiIndex.from_product as you already have myindex: In [79]: myindex2 = pd.MultiIndex.from_product([myindex, range(24)]) In [82]: hourmean.reindex(myindex2) Out[82]: 184 0 184 1 184 2 184 3 184 4 184 5 184 6 184 7 184 8 184 9 184 10 184 11 184 12 184 13 184 14 184 ... 183 9 183 10 183 11 183 12 183 13 183 14 183 15 183 16 183 17 183 18 183 19 183 20 183 21 183 22 183 23 183 Length: 8784, dtype: int64