pandas - take N last values from a group

pandas - take N last values from a group - python

I have a dataframe that looks something like that (the date is in the format: dd/mm/yyyy):
Param1 Param2 date value
1 a b 30/10/2007 5
2 a b 31/10/2007 8
3 a b 01/11/2007 9
4 a b 01/12/2007 3
5 a b 02/12/2007 2
6 a b 01/03/2008 11
7 b c 05/10/2008 7
8 b c 06/10/2008 13
9 b c 07/10/2008 19
10 b c 08/11/2008 22
11 b c 09/11/2008 35
12 b c 08/12/2008 5
what I need to do, is to group by Param1 and Param2, and to create N (in my case, 3) additional columns for the 3 last previous values, that are at least 30 days away from the current row.
So the output should look something like that:
Param1 Param2 date value prev_1 prev_2 prev_3
1 a b 30/10/2007 5 None None None
2 a b 31/10/2007 8 None None None
3 a b 01/11/2007 9 None None None
4 a b 01/12/2007 3 9 8 5
5 a b 02/12/2007 2 9 8 5
6 a b 01/03/2008 11 2 3 9
7 b c 05/10/2008 7 None None None
8 b c 06/10/2008 13 None None None
9 b c 07/10/2008 19 None None None
10 b c 08/11/2008 22 19 13 7
11 b c 09/11/2008 35 19 13 7
12 b c 08/12/2008 5 22 19 13
I've tried using set_index, stack and related functions, but I just couldn't figure it out (without an ugly for).
Any help will be appreciated!
EDIT: while it is similar to this question: question
It is not exactly the same, because you can't do a simple shift as you need to check the condition of at least 30 days gap.

Here is my suggestion:
data.date = pd.to_datetime(data.date, dayfirst=True)
data['ind'] = data.index
def func(a):
aa = data[(data.ind<a.ind)\
&(data.Param1==a.Param1)&(data.Param2==a.Param2)&(data.date<=(a.date-np.timedelta64(30, 'D')))]
aaa = [np.nan]*3+list(aa.value.values)
aaaa = pd.Series(aaa[::-1][:3], index=['prev_1', 'prev_2', 'prev_3'])
return pd.concat([a, aaaa])
data.apply(func, 1).drop('ind',1)

Related

How to pivot dataframe and transpose 1 row

I want to pivot this dataframe and convert the columns to a second level multiindex or column.
Original dataframe:
Type VC C B Security
0 Standard 2 2 2 A
1 Standard 16 13 0 B
2 Standard 52 35 2 C
3 RI 10 10 0 A
4 RI 10 15 31 B
5 RI 10 15 31 C
Desired dataframe:
Type A B C
0 Standard VC 2 16 52
1 Standard C 2 13 35
2 Standard B 2 0 2
3 RI VC 10 10 10
11 RI C 10 15 15
12 RI B 0 31 31

You could try as follows:
Use df.pivot and then transpose using df.T.
Next, chain df.sort_index to rearrange the entries, and apply df.swaplevel to change the order of the MultiIndex.
Lastly, consider getting rid of the Security as columns.name, and adding an index.name for the unnamed variable, e.g. Subtype here.
If you want the MultiIndex as columns, you can of course simply use df.reset_index at this stage.
res = (df.pivot(index='Security', columns='Type').T
.sort_index(level=[1,0], ascending=[False, False])
.swaplevel(0))
res.columns.name = None
res.index.names = ['Type','Subtype']
print(res)
A B C
Type Subtype
Standard VC 2 16 52
C 2 13 35
B 2 0 2
RI VC 10 10 10
C 10 15 15
B 0 31 31

Pandas dataframe get unique value of a column

I'm trying to get the unique available value for each site. The original pandas dataframe is with three columns:
Site
Available
Capacity
A
7
20
A
7
20
A
8
20
B
15
35
B
15
35
C
12
25
C
12
25
C
11
25
and I want to get the unique available of each site. The desired table is like below:
Site
Unique Available
A
7
8
B
15
C
12
11

You can get the lists of unique available per site with GroupBy.unique()
>>> df.groupby('Site')['Available'].unique()
Site
A [7, 8]
B [15]
C [12, 11]
Name: Available, dtype: object
Then with explode() you can expand these lists and with reset_index() get the index back to a column:
>>> df.groupby('Site')['Available'].unique().explode().reset_index()
Site Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11
Otherwise simply get both columns and remove duplicates:
>>> df[['Site', 'Available']].drop_duplicates()
Site Available
0 A 7
2 A 8
3 B 15
5 C 12
7 C 11

Approach with: GroupBy.apply() + Series.drop_duplicates()
(df.groupby('Site')['Available']
.apply(lambda s: s.drop_duplicates())
.reset_index(level=1, drop=True)
.reset_index(name='Unique Available')
)
Result:
Site Unique Available
0 A 7
1 A 8
2 B 15
3 C 12
4 C 11

How to melt multiple columns into one column?

I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance

You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.

Finding difference between two columns of a dataframe along with groupby

I saw a primitive version of this question here
but i my dataframe has diffrent names and i want to calculate separately for them
A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
i want to groupby A and then find diffence between alternate B and C.something like this
A B C dA
0 a 3 5 6
1 a 6 9 NaN
2 b 3 8 16
3 b 11 19 NaN
i tried doing
df['dA']=df.groupby('A')(['C']-['B'])
df['dA']=df.groupby('A')['C']-df.groupby('A')['B']
none of them helped
what mistake am i making?

IIUC, here is one way to perform the calculation:
# create the data frame
from io import StringIO
import pandas as pd
data = '''idx A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python').set_index('idx')
Now, compute dA. I look last value of C less first value of B, as grouped by A. (Is this right? Or is it max(C) less min(B)?). If you're guaranteed to have the A values in pairs, then #BenT's shift() would be more concise.
dA = (
(df.groupby('A')['C'].transform('last') -
df.groupby('A')['B'].transform('first'))
.drop_duplicates()
.rename('dA'))
print(pd.concat([df, dA], axis=1))
A B C dA
idx
0 a 3 5 6.0
1 a 6 9 NaN
2 b 3 8 16.0
3 b 11 19 NaN
I used groupby().transform() to preserve index values, to support the concat operation.

Reassigning strings by value count in Pandas

I have a number of string-based columns in a pandas dataframe that I'm looking to use on scikitlearn classification models. I know I have to use oneHotEncoder to properly encode the variables, but first, I want to reduce the variation in the columns, taking out either the strings that appear less than x% of the time in the column, or are not among the top x strings by count in the column.
Here's an example:
df1 = pd.DataFrame({'a':range(22), 'b':list('aaaaaaaabbbbbbbcccdefg'), 'c':range(22)})
df1
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 d 18
19 19 e 19
20 20 f 20
21 21 g 21
As you can see, a, b, and c appear in column b more than 10% of the time, so I'd like to keep them. On the other hand, d, e, f, and g appear less than 10% (actually about 5% of the time), so I'd like to bucket these by changing them into 'other':
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 c
16 c
17 c
18 other
19 other
20 other
21 other
I'd similarly like to be able to say that I only want to keep the values that appear in the top 2 in terms of frequency, so that column b looks like this:
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 other
16 other
17 other
18 other
19 other
20 other
21 other
I don't see an obvious way to do this in Pandas, although I admittedly know a lot more about this in R. Any ideas? Any thoughts on how to make this robust to Nones, which may appear more than 10% of the time or sit in the top x number of values?

This is kinda contorted, but it's kind of a complicated question.
First, get the counts:
In [24]: sizes = df1["b"].value_counts()
In [25]: sizes
Out[25]:
b
a 8
b 7
c 3
d 1
e 1
f 1
g 1
dtype: int64
Now, pick the indices you don't like:
In [27]: bad = sizes.index[sizes < df1.shape[0]*0.1]
In [28]: bad
Out[28]: Index([u'd', u'e', u'f', u'g'], dtype='object')
Finally, assign "other" to those rows containing bad indices:
In [34]: df1.loc[df1["b"].isin(bad), "b"] = "other"
In [36]: df1
Out[36]:
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 other 18
19 19 other 19
20 20 other 20
21 21 other 21
[22 rows x 3 columns]
You can use sizes.sort() and get the last n values from the result in order to find just the top two indices.
Edit: you should be able to do something like this, replacing all instances of "b" with filterByColumn:
def filterDataFrame(df1, filterByColumn):
sizes = df1[filterByColumn].value_counts()
...

Here's my solution:
def cleanupData(inputCol, fillString, cutoffPercent=None, cutoffNum=31):
col=inputCol
col.fillna(fillString, inplace=True)
valueCounts=col.value_counts()
totalAmount=sum(valueCounts)
if cutoffPercent is not None and cutoffNum is not None:
raise NameError("both cutoff percent and number have values. Please only give one of these values")
if cutoffPercent is not None:
cutoffAmount=cutoffPercent*totalAmount
valuesToKeep=valueCounts[valueCounts>cutoffAmount]
valuesToKeep=valuesToKeep.index.tolist()
numValuesKept=len(valuesToKeep)
print "keeping "+str(numValuesKept)+" unique values in the returned column"
if cutoffNum is not None:
valueNames=valueCounts.index.tolist()
valuesToKeep=valueNames[0:cutoffNum]
newlist=[]
for row in col:
if any(row in element for element in valuesToKeep):
newlist.append(row)
else:
newlist.append("Other")
return newlist
##
cleanupData(df1['b'], "Other", cutoffNum=2)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other']

assign your frame to variable (I call it x) and then count per value in the column "b"
d = {s: sum(x["b"].values==s) for s in set(x["b"].values)}
You can then use this mask to assign a new value if d[s] is below a certain threshold
x[x["b"].values==s] = "Call me Peter Maffay"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - take N last values from a group - python

Related

How to pivot dataframe and transpose 1 row

Pandas dataframe get unique value of a column

How to melt multiple columns into one column?

Finding difference between two columns of a dataframe along with groupby

Reassigning strings by value count in Pandas

Categories

Resources