How to separate entries based on rows and columns in pandas dataframe - python

I have a dataframe that looks like this:
'0' '1' '2'
0 5 4 0
1 3 0 0
2 1 0 2
Where the name of the columns ('0', '1', '2', ...) represent user ids, the index represents movie ids, and each entry denotes the rating given by the user to that movie.
I would like to make a new dataframe, based on the previous one, that is like this:
user_id movie_id rating
0 0 0 5
1 0 1 3
2 0 2 1
3 1 0 4
4 1 1 0
5 1 2 0
6 2 0 0
7 2 1 0
8 2 2 2
I am new to pandas and was wondering how to do this without iterating through all the entries.

You can get it with stack(), and then reset_index():
df = df.stack().reset_index()
df.columns = ['user_id','movie_id','rating']
print(df)
user_id movie_id rating
0 0 0 5
1 0 1 4
2 0 2 0
3 1 0 3
4 1 1 0
5 1 2 0
6 2 0 1
7 2 1 0
8 2 2 2

Related

replacing the value of one column conditional on two other columns in pandas

I have a data-frame df:
year ID category
1 1 0
2 1 1
3 1 1
4 1 0
1 2 0
2 2 0
3 2 1
4 2 0
I want to create a new column such that: for a particular 'year' if the 'category' is 1, the 'new-category' will be always 1 for the upcoming years:
year ID category new_category
1 1 0 0
2 1 1 1
3 1 1 1
4 1 0 1
1 2 0 0
2 2 0 0
3 2 1 1
4 2 0 1
I have tried if-else condition but I am getting the same 'category' column
for row in range(1,df.category[i-1]):
df['new_category'] = df['category'].replace('0',df['category'].shift(1))
But I am not getting the desired column
TRY:
df['new_category'] = df.groupby('ID')['category'].cummax()
OUTPUT:
year ID category new_category
0 1 1 0 0
1 2 1 1 1
2 3 1 1 1
3 4 1 0 1
4 1 2 0 0
5 2 2 0 0
6 3 2 1 1
7 4 2 0 1

Creating a column that assigns max value of set of rows by condition to all rows in that group

I have a dataframe that looks like this:
data metadata
A 0
A 1
A 2
A 3
A 4
B 0
B 1
B 2
A 0
A 1
B 0
A 0
A 1
B 0
df.data contains two different categories, A and B. df.metadata stores a running count the number of times a category appears consecutively before the category changes. I want to create a column consecutive_count that assigns the max value of metadata per consecutive group to every row in that group. It should look like this:
data metadata consecutive_count
A 0 4
A 1 4
A 2 4
A 3 4
A 4 4
B 0 2
B 1 2
B 2 2
A 0 1
A 1 1
B 0 0
A 0 1
A 1 1
B 0 0
Please advise. Thank you.
Method 1:
You may try transform max on groupby of each group of data
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df.groupby(s).metadata.transform('max')
Out[96]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0
Method 2:
Since metadata is sorted per group, you may reverse dataframe and do groupby cummax
s = df.data.ne(df.data.shift()).cumsum()
df['consecutive_count'] = df[::-1].groupby(s).metadata.cummax()
Out[101]:
data metadata consecutive_count
0 A 0 4
1 A 1 4
2 A 2 4
3 A 3 4
4 A 4 4
5 B 0 2
6 B 1 2
7 B 2 2
8 A 0 1
9 A 1 1
10 B 0 0
11 A 0 1
12 A 1 1
13 B 0 0

Group one column of dataframe by variable index

I have a dataframe which consists of PartialRoutes (which result together in full routes) and a treatment variable and I am trying to reduce the dataframe to the full routes by grouping these together and keeping the treatment variable.
To make this more clear, the df looks like
PartialRoute Treatment
0 1
1 0
0 0
0 0
1 0
2 0
3 0
0 0
1 1
2 0
where every 0 in 'Partial Route' starts a new group, which means I always want to group all values until a new route starts/ a new 0 in index.
So in this example there exists 4 groups
PartialRoute Treatment
0 1
1 0
-----------------
0 0
-----------------
0 0
1 0
2 0
3 0
-----------------
0 0
1 1
2 0
-----------------
and the result should look like
Route Treatment
0 1
1 0
2 0
3 1
Is there any solution to solve this elegant?
Create groups by comparing by Series.eq with cumulative sum by Series.cumsum and then aggregate per groups, e.g. by sum or max:
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 1 1
1 2 0
2 3 0
3 4 1
Detail:
print (df['PartialRoute'].eq(0).cumsum())
0 1
1 1
2 2
3 3
4 3
5 3
6 3
7 4
8 4
9 4
Name: PartialRoute, dtype: int32
If first value of DataFrame is not 0 get different groups - starting by 0:
print (df)
PartialRoute Treatment
0 1 1
1 1 0
2 0 0
3 0 0
4 1 0
5 2 0
6 3 0
7 0 0
8 1 1
9 2 0
print (df['PartialRoute'].eq(0).cumsum())
0 0
1 0
2 1
3 2
4 2
5 2
6 2
7 3
8 3
9 3
Name: PartialRoute, dtype: int32
df1 = df.groupby(df['PartialRoute'].eq(0).cumsum())['Treatment'].sum().reset_index()
print (df1)
PartialRoute Treatment
0 0 1
1 1 0
2 2 0
3 3 1

Add a column to a dataset whose values ​are filled by groups

I have a dataset, which contains columns: week, shop, Item number and price. Also I have an array of unique numbers, which are equal to Item Numbers, but in different order.
I want to add new columns to this dataset based on these unique numbers. First of all, I need to group this dataset by week and shop. Then in particular week and particular shop I need to find an Item number which is equal to new column name (Element from array of unique numbers). If there is no such field fill with null.
Then i should fill all fields in a particular week and particular shop with price of this Item number.
Here some code that I've tried, but it works very slow, because the amount of rows is very big.
#real dataset
data2
weeks = data2['Week'].unique()
for k in range(len(Unique_number)):
for i in range(len(weeks)):
temp_array = data2.loc[data2["Week"] == weeks[i]]
stores = temp_array['Shop'].unique()
for j in range(len(stores)):
temp_array2 = temp_array.loc[data2["Shop"] == stores[j]]
price = temp_array2.loc[temp_array2["Item number"] == Unique_number[k], "Price"]
if (price.empty):
price = 0
else:
price = price.values[0]
data2.loc[(data2["Week"] == weeks[i]) & (data2["Shop"] == stores[j]),Unique_number[k]] = price
I want something like this
Unique_numbers = [0,1,2,3]
dataframe before
week; shop; Item number; price
1 1 0 2
1 2 1 3
2 1 3 4
2 1 2 5
3 4 1 6
3 1 2 7
dataframe after
week; shop; Item number; price; 0; 1; 2; 3
1 1 0 2 2 0 0 0
1 2 1 3 0 3 0 0
2 1 3 4 0 0 5 4
2 1 2 5 0 0 5 4
3 4 1 6 0 6 0 0
3 1 2 7 0 0 7 0
Setup
u = df['Item number'].to_numpy()
w = np.asarray(Unique_numbers)
g = [df.week, df.shop]
Using some broadcasted comparison here (assumes that all of your price values are greater than 0).
pd.DataFrame(
np.equal.outer(u, w) * df['price'].to_numpy()[:, None]).groupby(g).transform('max')
0 1 2 3
0 2 0 0 0
1 0 3 0 0
2 0 0 5 4
3 0 0 5 4
4 0 6 0 0
5 0 0 7 0
This turns out a combination of pivot and merge:
df.merge(df.pivot_table(index=['week', 'shop'],
columns='Item number',
values='price',
fill_value=0)
.reindex(Unique_numbers, axis=1),
left_on=['week', 'shop'],
right_index=True,
how='left'
)
Output:
week shop Item number price 0 1 2 3
0 1 1 0 2 2 0 0 0
1 1 2 1 3 0 3 0 0
2 2 1 3 4 0 0 5 4
3 2 1 2 5 0 0 5 4
4 3 4 1 6 0 6 0 0
5 3 1 2 7 0 0 7 0

Python:how to get unique values over 2 different columns?

I have a dataframe like the following
df
idA idB yA yB
0 3 2 0 1
1 0 1 0 0
2 0 4 0 1
3 0 2 0 1
4 0 3 0 0
I would like to have a unique y for each id. So
df
id y
0 0 0
1 1 0
2 2 1
3 3 3
4 4 1
First create new DataFrame by flatten columns selected by iloc with numpy.ravel, then sort_values and drop_duplicates by id column:
df2 = (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()})
.sort_values('id')
.drop_duplicates(subset=['id'])
.reset_index(drop=True))
print (df2)
id y
0 0 0
1 1 0
2 2 1
3 3 0
4 4 1
Detail:
print (pd.DataFrame({'id':df.iloc[:,:2].values.ravel(),
'y': df.iloc[:,2:4].values.ravel()}))
id y
0 3 0
1 2 1
2 0 0
3 1 0
4 0 0
5 4 1
6 0 0
7 2 1
8 0 0
9 3 0

Categories

Resources