Comparing Columns in a Pandas Dataframe - python

I have a pandas data frame with racing results.
Place BibNum Time
0 1 2 5:50
1 2 4 8:09
2 3 7 10:27
3 4 3 11:12
4 5 1 12:13
...
34 1 5 2:03
35 2 9 4:35
36 3 7 5:36
What I would like to know is how can I get a count of how many times the BibNum showed up where the Place was 1, 2, 3 etc?
I know that I can do a "value_counts" but that is for how many times it shows up in a single column. I also looked into using numpy "where" but that is using a conditional like greater than or less than.

IIUC , this is what you need:
out = df.groupby(['Place','BibNum']).size()

Related

How to create new column in Pandas dataframe where each row is product of previous rows

I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120

Pandas groupby results assign to a new column

Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)

Pandas Count values across rows that are greater than another value in a different column

I have a pandas dataframe like this:
X a b c
1 1 0 2
5 4 7 3
6 7 8 9
I want to print a column called 'count' which outputs the number of values greater than the value in the first column('x' in my case). The output should look like:
X a b c Count
1 1 0 2 2
5 4 7 3 1
6 7 8 9 3
I would like to refrain from using 'lambda function' or 'for' loop or any kind of looping techniques since my dataframe has a large number of rows. I tried something like this but i couldn't get what i wanted.
df['count']=df [ df.iloc [:,1:] > df.iloc [:,0] ].count(axis=1)
I Also tried
numpy.where()
Didn't have any luck with that either. So any help will be appreciated. I also have nan as part of my dataframe. so i would like to ignore that when i count the values.
Thanks for your help in advance!
You can using ge(>=) with sum
df.iloc[:,1:].ge(df.iloc[:,0],axis = 0).sum(axis = 1)
Out[784]:
0 2
1 1
2 3
dtype: int64
After assign it back
df['Count']=df.iloc[:,1:].ge(df.iloc [:,0],axis=0).sum(axis=1)
df
Out[786]:
X a b c Count
0 1 1 0 2 2
1 5 4 7 3 1
2 6 7 8 9 3
df['count']=(df.iloc[:,2:5].le(df.iloc[:,0],axis=0).sum(axis=1) + df.iloc[:,2:5].ge(df.iloc[:,1],axis=0).sum(axis=1))
In case anyone needs such a solution, you can just add the output you get from '.le' and '.ge' in one line. Thanks to #Wen for the answer to my question though!!!

Python Pandas value-dependent column creation

I have a pandas DataFrame with columns "Time" and "A". For each row, df["Time"] is an integer timestamp and df["A"] is a float. I want to create a new column "B" which has the value of df["A"], but the one that occurs at or immediately before five seconds in the future. I can do this iteratively as:
for i in df.index:
df["B"][i] = df["A"][max(df[df["Time"] <= df["Time"][i]+5].index)]
However, the df has tens of thousands of records so this takes far too long, and I need to run this a few hundred times so my solution isn't really an option. I am somewhat new to pandas (and only somewhat less new to programming in general) so I'm not sure if there's an obvious solution to this supported by pandas.
It would help if I had a way of referencing the specific value of df["Time"] in each row while creating the column, so I could do something like:
df["B"] = df["A"][max(df[df["Time"] <= df["Time"][corresponding_row]+5].index)]
Thanks.
Edit: Here's an example of what my goal is. If the dataframe is as follows:
Time A
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
then I would like the result to be:
Time A B
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 9
20 9 9
where each line in B comes from the value of A in the row with Time greater by at most 5. So if Time is the index as well, then df["B"][0] = df["A"][4] since 4 is the largest time which is at most 5 greater than 0. In code, 4 = max(df["Time"][df["Time"] <= 0+5], which is why df["B"][0] is df["A"][4].
Use tshift. You may need to resample first to fill in any missing values. I don't have time to test this, but try this.
df['B'] = df.resample('s', how='ffill').tshift(5, freq='s').reindex_like(df)
And a tip for getting help here: if you provide a few rows of sample data and an example of your desired result, it's easy for us to copy/paste and try out a solution for you.
Edit
OK, looking at your example data, let's leave your Time column as integers.
In [59]: df
Out[59]:
A
Time
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
Make an array containing the first and last Time values and all the integers in between.
In [60]: index = np.arange(df.index.values.min(), df.index.values.max() + 1)
Make a new DataFrame with all the gaps filled in.
In [61]: df1 = df.reindex(index, method='ffill')
Make a new column with the same data shifted up by 5 -- that is, looking forward in time by 5 seconds.
In [62]: df1['B'] = df1.shift(-5)
And now drop all the filled-in times we added, taking only values from the original Time index.
In [63]: df1.reindex(df.index)
Out[63]:
A B
Time
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 NaN
20 9 NaN
How you fill in the last values, for which there is no "five seconds later" is up to you. Judging from your desired output, maybe use fillna with a constant value set to the last value in column A.

Sorting rows in csv file using Python Pandas

I have a quick question regarding sorting rows in a csv files using Pandas. The csv file which I have has the data that looks like:
quarter week Value
5 1 200
3 2 100
2 1 50
2 2 125
4 2 175
2 3 195
3 1 10
5 2 190
I need to sort in following way: sort the quarter and the corresponding weeks. So the output should look like following:
quarter week Value
2 1 50
2 2 125
2 3 195
3 1 10
3 2 100
4 2 175
5 1 200
5 2 190
My attempt:
df = df.sort('quarter', 'week')
But this does not produce the correct result. Any help/suggestions?
Thanks!
New answer, as of 14 March 2019
df.sort_values(by=["COLUMN"], ascending=False)
This returns a new sorted data frame, doesn't update the original one.
Note: You can change the ascending parameter according to your needs, without passing it, it will default to ascending=True
Note: sort has been deprecated in favour of sort_values, which you should use in Pandas 0.17+.
Typing help(df.sort) gives:
sort(self, columns=None, column=None, axis=0, ascending=True, inplace=False) method of pandas.core.frame.DataFrame instance
Sort DataFrame either by labels (along either axis) or by the values in
column(s)
Parameters
----------
columns : object
Column name(s) in frame. Accepts a column name or a list or tuple
for a nested sort.
[...]
Examples
--------
>>> result = df.sort(['A', 'B'], ascending=[1, 0])
[...]
and so you pass the columns you want to sort as a list:
>>> df
quarter week Value
0 5 1 200
1 3 2 100
2 2 1 50
3 2 2 125
4 4 2 175
5 2 3 195
6 3 1 10
7 5 2 190
>>> df.sort(["quarter", "week"])
quarter week Value
2 2 1 50
3 2 2 125
5 2 3 195
6 3 1 10
1 3 2 100
4 4 2 175
0 5 1 200
7 5 2 190
DataFrame object has no attribute sort

Categories

Resources