Add a column to a dataset whose values are filled by groups

Add a column to a dataset whose values are filled by groups - python

I have a dataset, which contains columns: week, shop, Item number and price. Also I have an array of unique numbers, which are equal to Item Numbers, but in different order.
I want to add new columns to this dataset based on these unique numbers. First of all, I need to group this dataset by week and shop. Then in particular week and particular shop I need to find an Item number which is equal to new column name (Element from array of unique numbers). If there is no such field fill with null.
Then i should fill all fields in a particular week and particular shop with price of this Item number.
Here some code that I've tried, but it works very slow, because the amount of rows is very big.
#real dataset
data2
weeks = data2['Week'].unique()
for k in range(len(Unique_number)):
for i in range(len(weeks)):
temp_array = data2.loc[data2["Week"] == weeks[i]]
stores = temp_array['Shop'].unique()
for j in range(len(stores)):
temp_array2 = temp_array.loc[data2["Shop"] == stores[j]]
price = temp_array2.loc[temp_array2["Item number"] == Unique_number[k], "Price"]
if (price.empty):
price = 0
else:
price = price.values[0]
data2.loc[(data2["Week"] == weeks[i]) & (data2["Shop"] == stores[j]),Unique_number[k]] = price
I want something like this
Unique_numbers = [0,1,2,3]
dataframe before
week; shop; Item number; price
1 1 0 2
1 2 1 3
2 1 3 4
2 1 2 5
3 4 1 6
3 1 2 7
dataframe after
week; shop; Item number; price; 0; 1; 2; 3
1 1 0 2 2 0 0 0
1 2 1 3 0 3 0 0
2 1 3 4 0 0 5 4
2 1 2 5 0 0 5 4
3 4 1 6 0 6 0 0
3 1 2 7 0 0 7 0

Setup
u = df['Item number'].to_numpy()
w = np.asarray(Unique_numbers)
g = [df.week, df.shop]
Using some broadcasted comparison here (assumes that all of your price values are greater than 0).
pd.DataFrame(
np.equal.outer(u, w) * df['price'].to_numpy()[:, None]).groupby(g).transform('max')
0 1 2 3
0 2 0 0 0
1 0 3 0 0
2 0 0 5 4
3 0 0 5 4
4 0 6 0 0
5 0 0 7 0

This turns out a combination of pivot and merge:
df.merge(df.pivot_table(index=['week', 'shop'],
columns='Item number',
values='price',
fill_value=0)
.reindex(Unique_numbers, axis=1),
left_on=['week', 'shop'],
right_index=True,
how='left'
)
Output:
week shop Item number price 0 1 2 3
0 1 1 0 2 2 0 0 0
1 1 2 1 3 0 3 0 0
2 2 1 3 4 0 0 5 4
3 2 1 2 5 0 0 5 4
4 3 4 1 6 0 6 0 0
5 3 1 2 7 0 0 7 0

Related

How to separate entries based on rows and columns in pandas dataframe

I have a dataframe that looks like this:
'0' '1' '2'
0 5 4 0
1 3 0 0
2 1 0 2
Where the name of the columns ('0', '1', '2', ...) represent user ids, the index represents movie ids, and each entry denotes the rating given by the user to that movie.
I would like to make a new dataframe, based on the previous one, that is like this:
user_id movie_id rating
0 0 0 5
1 0 1 3
2 0 2 1
3 1 0 4
4 1 1 0
5 1 2 0
6 2 0 0
7 2 1 0
8 2 2 2
I am new to pandas and was wondering how to do this without iterating through all the entries.

You can get it with stack(), and then reset_index():
df = df.stack().reset_index()
df.columns = ['user_id','movie_id','rating']
print(df)
user_id movie_id rating
0 0 0 5
1 0 1 4
2 0 2 0
3 1 0 3
4 1 1 0
5 1 2 0
6 2 0 1
7 2 1 0
8 2 2 2

Group one column of dataframe by variable index

I have a dataframe which consists of PartialRoutes (which result together in full routes) and a treatment variable and I am trying to reduce the dataframe to the full routes by grouping these together and keeping the treatment variable.
To make this more clear, the df looks like
PartialRoute Treatment
0 1
1 0
0 0
0 0
1 0
2 0
3 0
0 0
1 1
2 0
where every 0 in 'Partial Route' starts a new group, which means I always want to group all values until a new route starts/ a new 0 in index.
So in this example there exists 4 groups
PartialRoute Treatment
0 1
1 0
-----------------
0 0
-----------------
0 0
1 0
2 0
3 0
-----------------
0 0
1 1
2 0
-----------------
and the result should look like
Route Treatment
0 1
1 0
2 0
3 1
Is there any solution to solve this elegant?

Create groups by comparing by Series.eq with cumulative sum by Series.cumsum and then aggregate per groups, e.g. by sum or max:
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 1 1
1 2 0
2 3 0
3 4 1
Detail:
print (df['PartialRoute'].eq(0).cumsum())
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 4
8 4
9 4
Name: PartialRoute, dtype: int32
If first value of DataFrame is not 0 get different groups - starting by 0:
print (df)
PartialRoute Treatment
0 1 1
1 1 0
2 0 0
3 0 0
4 1 0
5 2 0
6 3 0
7 0 0
8 1 1
9 2 0
print (df['PartialRoute'].eq(0).cumsum())
0 0
1 0
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: PartialRoute, dtype: int32
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 0 1
1 1 0
2 2 0
3 3 1

Efficiently Drop Rows in a Pandas Dataframe

I have a dataset like:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
1 0 # --> gets removed since this row appears after id 1 already had a status of 1
2 0
3 0
3 0
I want to drop all rows of an id after its status became 1, i.e. my new dataset will be:
Id Status
1 0
1 0
1 0
1 0
1 1
2 0
2 0
3 0
3 0
I want to learn how to implement this computation efficiently since I have a very large (200 GB+) dataset.
The solution I currently have is to find the index of the first 1 and slice each group that way. In cases where no 1 exists, return the group unchanged:
def remove(series):
indexless = series.reset_index(drop=True)
ones = indexless[indexless['Status'] == 1]
if len(ones) > 0:
return indexless.iloc[:ones.index[0] + 1]
else:
return indexless
df.groupby('Id').apply(remove).reset_index(drop=True)
However, this runs very slowly, any way to fix this or to alternatively speed up the computation?

First idea is create cumulative sum per groups with boolean mask, but also necessary shift for avoid lost first 1:
#pandas 0.24+
s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift(fill_value=0).cumsum())
#pandas below
#s = (df['Status'] == 1).groupby(df['Id']).apply(lambda x: x.shift().fillna(0).cumsum())
df = df[s == 0]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 3 0
9 3 0
Another solution is use custom lambda function with Series.idxmax:
def f(x):
if x['new'].any():
return x.iloc[:x['new'].idxmax()+1, :]
else:
return x
df1 = (df.assign(new=(df['Status'] == 1))
.groupby(df['Id'], group_keys=False)
.apply(f).drop('new', axis=1))
print (df1)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0
Or a bit modified first solution - filter only groups with 1 and apply solutyion only there:
m = df['Status'].eq(1)
ids = df.loc[m, 'Id'].unique()
print (ids)
[1]
m1 = df['Id'].isin(m)
m2 = (m[m1].groupby(df['Id'])
.apply(lambda x: x.shift(fill_value=0).cumsum())
.eq(0))
df = df[m2.reindex(df.index, fill_value=True)]
print (df)
Id Status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
8 2 0
9 3 0
10 3 0

Let's start with this dataset.
l =[[1,0],[1,0],[1,0],[1,0],[1,1],[2,0],[1,0], [2,0], [2,1],[3,0],[2,0], [3,0]]
df_ = pd.DataFrame(l, columns = ['id', 'status'])
We will find the status=1 index for each id.
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
index
id
1 4
2 8
Now we join over df_ with status_1_indice
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
Notice .fillna(np.inf) for id's that dont have status=1. Result:
level_0 id status index
0 0 1 0 4.000000
1 1 1 0 4.000000
2 2 1 0 4.000000
3 3 1 0 4.000000
4 4 1 1 4.000000
5 5 2 0 8.000000
6 6 1 0 4.000000
7 7 2 0 8.000000
8 8 2 1 8.000000
9 9 3 0 inf
10 10 2 0 8.000000
11 11 3 0 inf
Required dataframe can be obtained by:
join_table.query('level_0 <= index')[['id', 'status']]
Together:
status_1_indice = df_[df_['status']==1].reset_index()[['index', 'id']].set_index('id')
join_table = df_.join(status_1_indice, on='id').reset_index().fillna(np.inf)
required_df = join_table.query('level_0 <= index')[['id', 'status']]
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
7 2 0
8 2 1
9 3 0
11 3 0
I cant vouch for the performance but this is more straight forward than the method in question.

Conditional search in future rows within groupby in pandas

Following is the dataframe I have. The 'Target' column is the desired output.
Group Item Value Target
1 0 5 0
1 1 4 0
1 0 6 0
1 0 3 1
1 1 2 0
1 0 1 1
2 1 8 0
2 0 9 0
2 0 7 1
In a given Group, if Item == 1, then I am trying to find the first future/next row where the Value is less than the corresponding Value for Item == 1. For example, in the second row, the Item == 1 and the corresponding Value is 4. The first future row where Value is less than 4 is the 4th row which has a Value of 3. Thereby, Target column specifies the find with a 1. It could be possible where two Item==1 has the same future row where conditions satisfy. In that case, we can also have a 1 in Target.
import pandas as pd
df = pd.DataFrame({'Group1': [1,1,1,1,1,1,2,2,2], 'Item': [0,1,0,0,1,0,1,0,0], 'Value': [5,4,6,3,2,1,8,9,7]})
df['next_Value'] = df.groupby(['Group'])['Value'].shift(-1)

Create a help key with cumsum , then we try to get the first value of each group by using transform, and compare each value within the group with the first value , if it is less we should return 1
df['helpkey']=df.groupby('Group').Item.cumsum()
df['New']=(df.Value<df.groupby(['Group','helpkey']).Value.transform('first')).astype(int)
df
Out[51]:
Group Item Value Target helpkey New
0 1 0 5 0 0 0
1 1 1 4 0 1 0
2 1 0 6 0 1 0
3 1 0 3 1 1 1
4 1 1 2 0 2 0
5 1 0 1 1 2 1
6 2 1 8 0 1 0
7 2 0 9 0 1 0
8 2 0 7 1 1 1

Finding efficiently pandas (part of) rows with unique values

Given a pandas dataframe with a row per individual/record. A row includes a property value and its evolution across time (0 to N).
A schedule includes the estimated values of a variable 'property' for a number of entities from day 1 to day 10 in the following example.
I want to filter entities with unique values for a given period and get those values
csv=',property,1,2,3,4,5,6,7,8,9,10\n0,100011,0,0,0,0,3,3,3,3,3,0\n1,100012,0,0,0,0,2,2,2,8,8,0\n2, \
100012,0,0,0,0,2,2,2,2,2,0\n3,100012,0,0,0,0,0,0,0,0,0,0\n4,100011,0,0,0,0,2,2,2,2,2,0\n5, \
180011,0,0,0,0,2,2,2,2,2,0\n6,110012,0,0,0,0,0,0,0,0,0,0\n7,110011,0,0,0,0,3,3,3,3,3,0\n8, \
110012,0,0,0,0,3,3,3,3,3,0\n9,110013,0,0,0,0,0,0,0,0,0,0\n10,100011,0,0,0,0,3,3,3,3,4,0'
from StringIO import StringIO
import numpy as np
schedule = pd.read_csv(StringIO(csv), index_col=0)
print schedule
property 1 2 3 4 5 6 7 8 9 10
0 100011 0 0 0 0 3 3 3 3 3 0
1 100012 0 0 0 0 2 2 2 8 8 0
2 100012 0 0 0 0 2 2 2 2 2 0
3 100012 0 0 0 0 0 0 0 0 0 0
4 100011 0 0 0 0 2 2 2 2 2 0
5 180011 0 0 0 0 2 2 2 2 2 0
6 110012 0 0 0 0 0 0 0 0 0 0
7 110011 0 0 0 0 3 3 3 3 3 0
8 110012 0 0 0 0 3 3 3 3 3 0
9 110013 0 0 0 0 0 0 0 0 0 0
10 100011 0 0 0 0 3 3 3 3 4 0
I want to find records/individuals for who property has not changed during a given period and the corresponding unique values
Here is what i came with : I want to locate individuals with property in [100011, 100012, 1100012] between days 7 and 10
props = [100011, 100012, 1100012]
begin = 7
end = 10
res = schedule['property'].isin(props)
df = schedule.ix[res, begin:end]
print "df \n%s " %df
We have :
df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
res = df.apply(lambda x: np.unique(x).size == 1, axis=1)
print "res : %s\n" %res
df_f = df.ix[res,]
print "df filtered %s \n" % df_f
res = pd.Series(df_f.values.ravel()).unique().tolist()
print "unique values : %s " %res
Giving :
res :
0 True
1 False
2 True
3 True
4 True
10 False
dtype: bool
df filtered
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
unique values : [3, 2, 0]
As those operations need to be run many times (in millions) on a million rows dataframe, i need to be able to run it as quickly as possible.
(#MaxU) : schedule can be seen as a database/repository updated many times. The repository is then requested as well many times for unique values
Would you have some ideas for improvements/ alternate ways ?

Given your df
7 8 9
0 3 3 3
1 2 8 8
2 2 2 2
3 0 0 0
4 2 2 2
10 3 3 4
You can simplify your code to:
df_f = df[df.apply(pd.Series.nunique, axis=1) == 1]
print(df_f)
7 8 9
0 3 3 3
2 2 2 2
3 0 0 0
4 2 2 2
And the final step to:
res = df_f.iloc[:,0].unique().tolist()
print(res)
[3, 2, 0]
It's not fully vectorised, but maybe this clarifies things a bit towards that?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add a column to a dataset whose values are filled by groups - python

Related

How to separate entries based on rows and columns in pandas dataframe

Group one column of dataframe by variable index

Efficiently Drop Rows in a Pandas Dataframe

Conditional search in future rows within groupby in pandas

Finding efficiently pandas (part of) rows with unique values

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add a column to a dataset whose values ​are filled by groups - python

Related

How to separate entries based on rows and columns in pandas dataframe

Group one column of dataframe by variable index

Efficiently Drop Rows in a Pandas Dataframe

Conditional search in future rows within groupby in pandas

Finding efficiently pandas (part of) rows with unique values

Categories

Resources

Add a column to a dataset whose values are filled by groups - python