Convert python pandas rows to columns - python

Decade difference (kg) Version
0 1510 - 1500 -0.346051 v1.0h
1 1510 - 1500 -3.553251 A2011
2 1520 - 1510 -0.356409 v1.0h
3 1520 - 1510 -2.797978 A2011
4 1530 - 1520 -0.358922 v1.0h
I want to transform the pandas dataframe such that the 2 unique enteries in the Version column are transfered to become columns. How do I do that?
The resulting dataframe should not have a multiindex

In [28]: df.pivot(index='Decade', columns='Version', values='difference (kg)')
Out[28]:
Version A2011 v1.0h
Decade
1510 - 1500 -3.553251 -0.346051
1520 - 1510 -2.797978 -0.356409
1530 - 1520 NaN -0.358922
or
In [31]: df.pivot(index='difference (kg)', columns='Version', values='Decade')
Out[31]:
Version A2011 v1.0h
difference (kg)
-3.553251 1510 - 1500 None
-2.797978 1520 - 1510 None
-0.358922 None 1530 - 1520
-0.356409 None 1520 - 1510
-0.346051 None 1510 - 1500
both satisfy your requirements.

Related

how to find latest instance between two columns and get the a value from a different column for that instance?

I have a match history dataset and I want to find the latest up to date elo for the players. I'm working with python & pandas and a sample of the dataset is this:
tournament_date winner_id loser_id winner_elo loser_elo winner_delta loser_delta other_columns
----------------- ------------ ---------- ------------- ------------ -------------- --------------- ---------------
2017-08-24 512 543 1128 1102 6 -6 ...
2017-08-24 100 517 1153 1062 0.4 -0.4 ...
2017-08-24 512 547 1128 1114 3.4 -3.4 ...
2017-08-24 543 517 1102 1062 4.8 -4.8 ...
2017-08-24 547 100 1114 1153 11.2 -11.2 ...
2017-08-24 517 512 1062 1128 9.9 -9.9 ...
2017-08-24 543 100 1102 1153 9.1 -9.1 ...
2017-08-24 517 547 1062 1114 9.1 -9.1 ...
2017-08-26 543 517 1103 1089 5.2 -5.2 ...
2017-08-26 547 551 1119 1165 8.8 -8.8 ...
2017-08-26 543 557 1103 1214 8.5 -8.5 ...
2017-08-26 551 517 1165 1089 1 -1 ...
2017-08-26 557 547 1089 1119 7.8 -7.8 ...
2017-08-26 551 543 1165 1103 3 -3 ...
winner_elo and loser_elo are updated daily in my dataset but for every match there is a column for the delta change for winners and losers.
I want to find the latest entry for each player_id (either in winner_id or loser_id) and if it's in the winner_id column to do winner_elo + winner_delta to find the up to date elo or if the latest instance of a player is in the loser_id column to calculate loser_elo + loser_delta.
There are around 1000 unique player id's (500 unique winner_id's and 508 loser_id's). I tried grouping by winner_id, sorting by date and getting the max and similarly grouping by loser_id and sorting by date but I don't know how to compare them and find out which one is the latest and then do the calculation that it's required.
I can only think of solutions that involve for loops and if's but I guess there must be a better way
Edit: this is part of a web scraping project and I'm getting new data daily so I would prefer a solution that's suitable for newer entries
I would attack this by splitting the data frame in two: drop the loser data from one, the winner data from the other, and rename the columns as simply "id". "elo" and "delta". Simply concatenate the two frames, sort by date (most recent first), and groupby player ID.
Now, for each player, simply skim off the top (most recent) row and apply it to get the current rating.

Aggregations over specific columns of a large dataframe, with named output

I am looking for a way to aggregate over a large dataframe, possibly using groupby. Each group would be based on either pre-specified columns or regex, and the aggregation should produce a named output.
This produces a sample dataframe:
import pandas as pd
import itertools
import numpy as np
col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')
all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)
Above produces a dataframe with one year's worth of monthly data, with 36 columns with following names:
['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F',
'A.3.G', 'A.4.E', 'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E',
'A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G',
'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 'B.2.E', 'B.2.F',
'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E',
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G',
'B.8.E', 'B.8.F', 'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F',
'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E',
'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 'C.6.F', 'C.6.G',
'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F',
'C.9.G']
What I would like now is to be able aggregate over the dataframe and take certain column combinations and produce named outputs. For example, one rules might be that I will take all 'A.*.E' columns (that have any number in the middle), sum them and produce a named output column called 'A.SUM.E'. And then do the same for 'A.*.F', 'A.*.G' and so on.
I have looked into pandas 25 named aggregation which allows me to name my outputs but I couldn't see how to simultaneously capture the right column combinations and produce the right output names.
If you need to reshape the dataframe to make a workable solution, that is fine as well.
Note, I am aware I could do something like this in a Python loop but I am looking for a pandas way to do it.
Not a groupby solution and it uses a loop but I think it's nontheless rather elegant: first get a list of unique column from - to combinations using a set and then do the sums using filter:
cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
for c0, c1 in cols:
df[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Result:
A.1.E A.1.F A.1.G A.2.E ... B.SUM.G C.SUM.E C.SUM.F C.SUM.G
2018-08-31 978 746 408 109 ... 4061 5413 4102 4908
2018-09-30 923 649 488 447 ... 5585 3634 3857 4228
2018-10-31 911 359 897 425 ... 5039 2961 5246 4126
2018-11-30 77 479 536 509 ... 4634 4325 2975 4249
2018-12-31 608 995 114 603 ... 5377 5277 4509 3499
2019-01-31 138 612 363 218 ... 4514 5088 4599 4835
2019-02-28 994 148 933 990 ... 3907 4310 3906 3552
2019-03-31 950 931 209 915 ... 4354 5877 4677 5557
2019-04-30 255 168 357 800 ... 5267 5200 3689 5001
2019-05-31 593 594 824 986 ... 4221 2108 4636 3606
2019-06-30 975 396 919 242 ... 3841 4787 4556 3141
2019-07-31 350 312 104 113 ... 4071 5073 4829 3717
If you want to have the result in a new DataFrame, just create an empty one and add the columns to it:
result = pd.DataFrame()
for c0, c1 in cols:
result[f'{c0}.SUM.{c1}'] = df.filter(regex = f'{c0}\.\d+\.{c1}').sum(axis=1)
Update: using simple groupby (which is even more simple in this particular case):
def grouper(col):
c = col.split('.')
return f'{c[0]}.SUM.{c[-1]}'
df.groupby(grouper, axis=1).sum()

Handling Zeros or NaNs in a Pandas DataFrame operations

I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].

How to calculate the duration of events using Pandas given as strings?

After finding the following link regarding calculating time differences using Pandas, I'm still stuck attempting to fit that knowledge to my own data. Here's what my dataset looks like:
In [10]: df
Out[10]:
id time
0 420 1/3/2018 8:32
1 420 1/3/2018 8:36
2 420 1/3/2018 8:42
3 425 1/7/2018 12:35
4 425 1/7/2018 14:29
5 425 1/7/2018 16:15
6 425 1/7/2018 16:36
7 427 1/11/2018 20:50
8 428 1/13/2018 16:35
9 428 1/13/2018 17:36
I'd like to perform a groupby or another function on ID where the output is:
In [11]: pd.groupby(df[id])
Out [11]:
id time (duration)
0 420 0:10
1 425 4:01
2 427 0:00
3 428 1:01
The types for id and time are int64 and object respectively. Using python3 and pandas 0.20.
Edit:
Coming from SQL, this appears that it would be functionally equivalent to:
select id, max(time) - min(time)
from df
group by id
Edit 2:
Thank you all for the quick responses. All of the solutions give me some version of the following error. Not sure what is relevant to my particular dataset that I'm missing here:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
groupby with np.ptp
df.groupby('id').time.apply(np.ptp)
id
420 00:10:00
425 04:01:00
427 00:00:00
428 01:01:00
Name: time, dtype: timedelta64[ns]
Group the dataframe by event IDs and select the smallest and the largest times:
df1 = df.groupby('id').agg([max, min])
Find the difference:
(df1[('time','max')] - df1[('time','min')]).reset_index()
# id 0
#0 420 00:10:00
#1 425 04:01:00
#2 427 00:00:00
#3 428 01:01:00
You need to sort the dataframe by time and group by id before getting the difference between time in each group.
df['time'] = pd.to_datetime(df['time'])
df.sort_values(by='time').groupby('id')['time'].apply(lambda g: g.max() - g.min()).reset_index(name='duration')
Output:
id duration
0 420 00:10:00
1 425 04:01:00
2 427 00:00:00
3 428 01:01:00

Pandas: select by bigger than a value

My dataframe has a column called dir, it has several values, I want to know how many the values passes a certain point. For example:
df['dir'].value_counts().sort_index()
It returns a Series
0 855
20 881
40 2786
70 3777
90 3964
100 4
110 2115
130 3040
140 1
160 1697
180 1734
190 3
200 618
210 3
220 1451
250 895
270 2167
280 1
290 1643
300 1
310 1894
330 1
340 965
350 1
Name: dir, dtype: int64
Here, I want to know the number of the value passed 500. In this case, it's all except 100, 140, 190,210, 280,300,330,350.
How can I do that?
I can get away with df['dir'].value_counts()[df['dir'].value_counts() > 500]
(df['dir'].value_counts() > 500).sum()
This gets the value counts and returns them as a series of Truth Values. The parens treats this whole thing like a series. .sum() counts the True values as 1 and the False values as 0.

Categories

Resources