I have a dataframe that looks like this:
'0' '1' '2'
0 5 4 0
1 3 0 0
2 1 0 2
Where the name of the columns ('0', '1', '2', ...) represent user ids, the index represents movie ids, and each entry denotes the rating given by the user to that movie.
I would like to make a new dataframe, based on the previous one, that is like this:
user_id movie_id rating
0 0 0 5
1 0 1 3
2 0 2 1
3 1 0 4
4 1 1 0
5 1 2 0
6 2 0 0
7 2 1 0
8 2 2 2
I am new to pandas and was wondering how to do this without iterating through all the entries.
You can get it with stack(), and then reset_index():
df = df.stack().reset_index()
df.columns = ['user_id','movie_id','rating']
print(df)
user_id movie_id rating
0 0 0 5
1 0 1 4
2 0 2 0
3 1 0 3
4 1 1 0
5 1 2 0
6 2 0 1
7 2 1 0
8 2 2 2
I have a dataframe which consists of PartialRoutes (which result together in full routes) and a treatment variable and I am trying to reduce the dataframe to the full routes by grouping these together and keeping the treatment variable.
To make this more clear, the df looks like
PartialRoute Treatment
0 1
1 0
0 0
0 0
1 0
2 0
3 0
0 0
1 1
2 0
where every 0 in 'Partial Route' starts a new group, which means I always want to group all values until a new route starts/ a new 0 in index.
So in this example there exists 4 groups
PartialRoute Treatment
0 1
1 0
-----------------
0 0
-----------------
0 0
1 0
2 0
3 0
-----------------
0 0
1 1
2 0
-----------------
and the result should look like
Route Treatment
0 1
1 0
2 0
3 1
Is there any solution to solve this elegant?
Create groups by comparing by Series.eq with cumulative sum by Series.cumsum and then aggregate per groups, e.g. by sum or max:
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 1 1
1 2 0
2 3 0
3 4 1
Detail:
print (df['PartialRoute'].eq(0).cumsum())
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 4
8 4
9 4
Name: PartialRoute, dtype: int32
If first value of DataFrame is not 0 get different groups - starting by 0:
print (df)
PartialRoute Treatment
0 1 1
1 1 0
2 0 0
3 0 0
4 1 0
5 2 0
6 3 0
7 0 0
8 1 1
9 2 0
print (df['PartialRoute'].eq(0).cumsum())
0 0
1 0
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: PartialRoute, dtype: int32
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 0 1
1 1 0
2 2 0
3 3 1
I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0 # --> gets removed since this row appears after id 1 already had a status of 1
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.
The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
However, this runs very slowly, any way to fix this or to alternatively speed up the computation?
First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:
#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Another solution is use custom lambda function with Series.idxmax:
def f(x):
if x['new'].any():
return x.iloc[:x['new'].idxmax()+1, :]
else:
return x
df1 = (df.assign(new=(df['Status'] == 1))
.groupby(df['Id'], group_keys=False)
.apply(f).drop('new', axis=1))
print (df1)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:
m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]
m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
.apply(lambda x: x.shift(fill_value=0).cumsum())
.eq(0))
df = df[m2.reindex(df.index, fill_value=True)]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Let's start with this dataset.
l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])
We will find the status=1 index for each id.
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
index
id
1 4
2 8
Now we join over df_ with status_1_indice
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
Notice .fillna(np.inf) for id's that dont have status=1. Result:
level_0 id status index
0 0 1 0 4.000000
1 1 1 0 4.000000
2 2 1 0 4.000000
3 3 1 0 4.000000
4 4 1 1 4.000000
5 5 2 0 8.000000
6 6 1 0 4.000000
7 7 2 0 8.000000
8 8 2 1 8.000000
9 9 3 0 inf
10 10 2 0 8.000000
11 11 3 0 inf
Required dataframe can be obtained by:
join_table.query('level_0 <= index')[['id', 'status']]
Together:
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 2 1
9 3 0
11 3 0
I cant vouch for the performance but this is more straight forward than the method in question.
Following is the dataframe I have. The 'Target' column is the desired output.
Group Item Value Target
1 0 5 0
1 1 4 0
1 0 6 0
1 0 3 1
1 1 2 0
1 0 1 1
2 1 8 0
2 0 9 0
2 0 7 1
In a given Group, if Item == 1, then I am trying to find the first future/next row where the Value is less than the corresponding Value for Item == 1. For example, in the second row, the Item == 1 and the corresponding Value is 4. The first future row where Value is less than 4 is the 4th row which has a Value of 3. Thereby, Target column specifies the find with a 1. It could be possible where two Item==1 has the same future row where conditions satisfy. In that case, we can also have a 1 in Target.
import pandas as pd
df = pd.DataFrame({'Group1': [1,1,1,1,1,1,2,2,2], 'Item': [0,1,0,0,1,0,1,0,0], 'Value': [5,4,6,3,2,1,8,9,7]})
df['next_Value'] = df.groupby(['Group'])['Value'].shift(-1)
Create a help key with cumsum , then we try to get the first value of each group by using transform, and compare each value within the group with the first value , if it is less we should return 1
df['helpkey']=df.groupby('Group').Item.cumsum()
df['New']=(df.Value<df.groupby(['Group','helpkey']).Value.transform('first')).astype(int)
df
Out[51]:
Group Item Value Target helpkey New
0 1 0 5 0 0 0
1 1 1 4 0 1 0
2 1 0 6 0 1 0
3 1 0 3 1 1 1
4 1 1 2 0 2 0
5 1 0 1 1 2 1
6 2 1 8 0 1 0
7 2 0 9 0 1 0
8 2 0 7 1 1 1
Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?
Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?