Pandas: select by bigger than a value

Pandas: select by bigger than a value - python

My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]

(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.

Related

How to find an optimal solutions for 2 teams playing against each other?

I am given a table of teams A and B where for each pair of 2 players there is number. The rows represent players of players of team A and columns of players of the team B. If a number is positive, it means that the player A is better than the plyaer from the B team and vice versa if negative.
For example:
-710 415 527 -641 175 48
-447 -799 253 626 304 895
509 -523 -758 -678 -689 92
24 -318 -61 -9 174 255
487 408 696 861 -394 -67
Both teams know this table.
Now, what is done is that the team A reports 5 players, the team B can look at them and choose the best 5 players for them.
If we want to compere the teams we sum up the numbers on the given positions from the table knowing that each team has a captain who is counted twice (as if a team had 6 players and the captain is there twice), if the sum is positive, the team A is better.
The input are numbers a (the number of rows/players A) and b (columns/players B) and the table like this:
6
6
-54 -927 428 -510 911 93
-710 415 527 -641 175 48
-447 -799 253 626 304 895
509 -523 -758 -678 -689 92
24 -318 -61 -9 174 255
487 408 696 861 -394 -67
The output should be 1282.
So, what I did was that I put the numbers into a matrix like this:
a, b = int(input()), int(input())
matrix = [list(map(int,input().split())) for _ in range(a)]
I used a MinHeap and a MaxHeap for this. I put the rows into the MaxHeap because A team wants the biggest, then I get 5 best A players from it as follows:
for player, values in enumerate(matrix):
maxheap.enqueue(sum(values), player)
playersA = []
overallA = 0
for i in range(5):
ov, pl = maxheap.remove_max()
if i == 0: # it is a captain
playersA.append(pl)
overallA += ov
playersA.append(pl)
overallA += ov
The B team knowing the A players the uses the MinHeap to find its best 5 players:
for i in range(b):
player = []
ov = 0
for j in range(a): #take out a column of a matrix
player.append(matrix[j][i])
for rival in playersA: #counting only players already chosen by A
ov += player[rival]
minheap.enqueue(ov,i)
playersB = []
overallB = 0
for i in range(5):
ov, pl = minheap.remove_min()
if i == 0:
playersB.append(pl)
overallB += ov
playersB.append(pl)
overallB += ov
Having the players, then I count the sum from the matrix:
out = 0
for a in playersA:
for b in playersB:
out += matrix[a][b]
print(out)
However, this solution doesn't give the right solutions always. For example, it does for the input:
10
10
-802 -781 826 997 -403 243 -533 -694 195 182
103 182 -14 130 953 -900 43 334 -724 716
-350 506 184 691 -785 742 -303 -682 186 -520
25 -815 475 -407 -78 509 -512 714 898 243
758 -743 -504 -160 855 -792 -177 747 188 -190
333 -439 529 795 -500 112 625 -2 -994 282
824 498 -899 158 453 644 117 598 432 310
-799 594 933 -15 47 -687 68 480 -933 -631
741 400 979 -52 -78 -744 -573 -170 882 -610
-376 -928 -324 658 -538 811 -724 848 344 -308
But it doesn't for
11
11
279 475 -894 -641 -716 687 253 -451 580 -727 -509
880 -778 -867 -527 816 -458 -136 -517 217 58 740
360 -841 492 -3 940 754 -584 715 -389 438 -887
-739 664 972 838 -974 -802 799 258 628 3 815
952 -404 -273 -323 -948 674 687 233 62 -339 352
285 -535 -812 -452 -335 -452 -799 -902 691 195 -837
-78 56 459 -178 631 -348 481 608 -131 -575 732
-212 -826 -547 440 -399 -994 486 -382 -509 483 -786
-94 -983 785 -8 445 -462 -138 804 749 890 -890
-184 872 -341 776 447 -573 405 462 -76 -69 906
-617 704 292 287 464 -711 354 428 444 -42 45
So the question is: Can it be done like this or is there another fast algorithm ( O(n ** 2 ) / O(n ** 3) etc.), or I just gave to try all the possible combinations using brute force in O(n!) time complexity?

There is a way to do that with a polynomial complexity.
To show you why your solution doesn't work, let's consider an other simpler problem. Let's say each team only choose 2 players and there is no captain.
Let's also take a simple score matrix:
1 1 1 2 1
1 1 1 1 1
0 3 0 2 0
0 0 0 0 4
0 0 0 0 4
Here you can see that team A has no chance to win (as there are no negative numbers), but still they are going to try their best. Who should they pick?
Using your algorithm, team A should pick their best players and their ranking would be:
pa0 < pa1 = pa2 < pa3 = pa4
If they choose pa3 and pa4, who both have a score of 4 (which is bad, but not as bad as pa0's score of 6), team B will win by 8 (they will choose pb4 and an other player who doesn't matter).
On the other hand, if team A chose pa0 and pa1 (who are worse than pa3 and pa4 by your metric), the best team B can get is winning by 5 (if they choose pb3 and any other player)
Basically, your approximation fails to take into consideration that team B can only choose two players and thus can't take advantage of the pa0+pa1 weakness while it can easily exploit pa3+pa4's one.
A better solution would be for team A to evaluate each player's score only by taking into account their 2 worst scores (or 5 if 5 players are to be selected): this would make the ranking as such:
pa2 < pa3 = pa4 < pa0 < pa1
Still it would be an approximation: some combinations like pa2+pa3 are actually not as bad as they sound as, once again, the weaknesses are spread enough that team B can't exploit them all (although for this very example the approximation yields the best result).
What we really need to pick is not the two best players, but the best combination of two players, and sadly there is no way I know of other than trying all the $s!/(k!(s-k)!)$ combinations of k players among s (the size of the team). It is not so bad, though, as for k=2 that's only $s*(s-1)/2$ and for k=5 that's $s*(s-1)(s-2)(s-3)*(s-4)/5!$, which is still polynomial in complexity despite being in O(s^5). Adding a captain to the mix only multiplies the number of combinations by k. It also requires a twist on how to calculate the score but you should be able to find that.
Now that team A have selected their players, team B have the easy job to select theirs. This is way simpler as here each player can be chosen individually.
example of how this last algorithm should work with the score matrix provided in the beginning.
team A has 10 possible combinations: pa0+pa1, pa0+pa2, pa0+pa3, pa0+pa4, pa1+pa2, pa1+pa3, pa1+pa4, pa2+pa3, pa2+pa4, pa3+pa4. Their respective scores are: 5, 8, 7, 7, 7, 6, 6, 7, 7, 8.
The best combination is pa0+pa1, so that's what they send to team B.
Team B calculate each of its player's score against pa0+pa1: pb0:2, pb1:2, pb2:2, pb3:3, pb4:2. pb3 is the best, all the others are equals, thus team B sends pb3+pb4 (for example), and the "answer" is 5.

Filtering a labeled image by particle area

I have a labeled image of detected particles and a dataframe with the corresponding area of each labeled particle. What I want to do is filter out every particle on the image with an area smaller than a specified value.
I got it working with the example below, but I know there must be a smarter and especially faster way.
For example skipping the loop by comparing the image with the array.
Thanks for your help!
Example:
labels = df["label"][df.area > 5000].to_numpy()
mask = np.zeros(labeled_image.shape)
for label in labels:
mask[labeled_image == label] = 1
Dataframe:
label centroid-0 centroid-1 area
0 1 15 3681 191
1 2 13 1345 390
2 3 43 3746 885
3 4 32 3616 817
4 5 20 4250 137
... ... ... ...
3827 3828 4149 1620 130
3828 3829 4151 852 62
3829 3830 4155 330 236
3830 3831 4157 530 377
3831 3832 4159 3975 81

You can use isin to check equality to several labels. The resulting boolean array can be directly used as the mask after casting to the required type (e.g. int):
labels = df.loc[df.area.gt(5000), 'label']
mask = np.isin(labeled_image, labels).astype(int)

Create new Pandas.DataFrame with .groupby(...).agg(sum) then recover unsummed columns

I'm starting with a dataframe of baseabll seasons a section of which looks similar to this:
Name Season AB H SB playerid
13047 A.J. Pierzynski 2013 503 137 1 746
6891 A.J. Pierzynski 2006 509 150 1 746
1374 Rod Carew 1977 616 239 23 1001942
1422 Stan Musial 1948 611 230 7 1009405
1507 Todd Helton 2000 580 216 5 432
1508 Nomar Garciaparra 2000 529 197 5 190
1509 Ichiro Suzuki 2004 704 262 36 1101
From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe.
My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible.
I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.

You can write your own condition how do you want to include non summed columns.
col = df.columns.tolist()
col.remove('playerid')
df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns})
df:
Name Season AB H SB playerid
playerid
190 Nomar_Garciaparra 2000 529 197 5 190
432 Todd_Helton 2000 580 216 5 432
746 A.J._Pierzynski 4019 1012 287 2 1492
1101 Ichiro_Suzuki 2004 704 262 36 1101
1001942 Rod_Carew 1977 616 239 23 1001942
1009405 Stan_Musial 1948 611 230 7 1009405

If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns:
stat_cols = ['AB', 'H', 'SB']
groupby_cols = ['playerid', 'Name']
results = df.groupby(groupby_cols)[stat_cols].sum()
Results:
AB H SB
playerid Name
190 Nomar Garciaparra 529 197 5
432 Todd Helton 580 216 5
746 A.J. Pierzynski 1012 287 2
1101 Ichiro Suzuki 704 262 36
1001942 Rod Carew 616 239 23
1009405 Stan Musial 611 230 7
If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map:
results = df.groupby('playerid')[stat_cols].sum()
name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict()
results['Name'] = results.index.map(name_map)
Results:
AB H SB Name
playerid
190 529 197 5 Nomar Garciaparra
432 580 216 5 Todd Helton
746 1012 287 2 A.J. Pierzynski
1101 704 262 36 Ichiro Suzuki
1001942 616 239 23 Rod Carew
1009405 611 230 7 Stan Musial

groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column.
Using the sample data above, one might use
mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max }
df_1 = df.groupby('playerid').agg(mapping)
The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply.
To expand this to larger data sets, you might use a dictionary comprehension. This would work well:
dictionary = { x : sum for x in df.columns}
dont_sum = {'Name': max, 'Season': max}
dictionary.update(dont_sum)
df_1 = df.groupby('playerid').agg(dictionary)

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)

Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])

The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

find value in column and based on it create a new dataframe in pandas

I have a variable in the following format fg = 2017-20. It's a string. And also I have a dataframe:
flag №
2017-18 389
2017-19 390
2017-20 391
2017-21 392
2017-22 393
2017-23 394
...
I need to find this value (fg) in the column "flag" and select the appropriate value (in the example it will be 391) in the column "№". Then create new dataframe, in which there will also be a column "№". Add this value to this dataframe and iterate 53 times. The result should look like this:
№_new
391
392
393
394
395
...
442
443
444
It does not look difficult, but I can not find anything suitable based on other issues. Can someone advise anything, please?

You need boolean indexing with loc for filtering, then convert one item Series to scalar by convert to numpy array by values and select first value by [0].
Last create new DataFrame with numpy.arange.
fg = '2017-20'
val = df.loc[df['flag'] == fg, '№'].values[0]
print (val)
391
df1 = pd.DataFrame({'№_new':np.arange(val, val+53)})
print (df1)
№_new
0 391
1 392
2 393
3 394
4 395
5 396
6 397
7 398
8 399
9 400
10 401
11 402
..
..

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.