I have a pandas DataFrame that looks similar to the one below:
df = pd.DataFrame({
'label': [0, 0, 2, 3, 8, 8, 9],
'value1': [2, 1, 9, 8, 7, 4, 2],
'value2': [0, 1, 9, 4, 2, 3, 1],
})
>>> df
label value1 value2
0 0 2 0
1 0 1 1
2 2 9 9
3 3 8 4
4 8 7 2
5 8 4 3
6 9 2 1
Values in the label column are not complete (not range(0, n, 1)) due to previously slicing. I would like to reorder this label and assign a sequential range of ascending values so that it becomes:
>>> df
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
I currently use the code below. Because my real DataFrame has thousands of unique values any suggestions to do this a bit more efficiently (not including looping over every unique value) would be appreciated.
for new_idx, idx in enumerate(df.label.unique()):
df.loc[df['label'] == idx, ['label']] = new_idx
Thanks in advance
Use factorize for improve performance:
df['label'] = pd.factorize(df['label'])[0] + 1
print (df)
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
Another idea with Series.rank:
df['label'] = df['label'].rank(method='dense').astype(int)
print (df)
label value1 value2
0 1 2 0
1 1 1 1
2 2 9 9
3 3 8 4
4 4 7 2
5 4 4 3
6 5 2 1
Working same only of same ordering:
#dta changed for see difference
df = pd.DataFrame({
'label': [0, 10, 10, 3, 8, 8, 9],
'value1': [2, 1, 9, 8, 7, 4, 2],
'value2': [0, 1, 9, 4, 2, 3, 1],
})
df['label1'] = pd.factorize(df['label'])[0] + 1
df['label2'] = df['label'].rank(method='dense').astype(int)
print (df)
label value1 value2 label1 label2
0 0 2 0 1 1
1 10 1 1 2 5
2 10 9 9 2 5
3 3 8 4 3 2
4 8 7 2 4 3
5 8 4 3 4 3
6 9 2 1 5 4
Related
I have a small subset of data here:
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.Series(time)
df2 = df2.transpose()
df3 = df1*df2
Df1 is a column of data and df2 is a row of data. I need a dataframe that is going to be 3x9 where the row is multiplied by each value in the column to make one large dataframe.
The end result should look like:
df3 = [2 4 2 4 2 4 2 4 2
4 8 4 8 4 8 4 8 4
6 12 6 12 6 12 6 12 6 ]
They way I currently have it for my larger dataset, only a few datapoints are correctly multiplied and most are nans.
Dot(product) is one of the solutions to this problem
import pandas as pd
days = [1, 2, 3]
time = [2, 4, 2, 4, 2, 4, 2, 4, 2]
df1 = pd.DataFrame(days)
df2 = pd.DataFrame(time)
# use dot
df3 = df1.dot(df2.T)
df3
Output
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
Try this:
df1.dot(df2.to_frame().T)
Output:
0 1 2 3 4 5 6 7 8
0 2 4 2 4 2 4 2 4 2
1 4 8 4 8 4 8 4 8 4
2 6 12 6 12 6 12 6 12 6
I have an carid and I would like to see all buyers who had something to do with this carid. So I would like to have all buyers who have bought carid 3.
How do I do that?
import pandas as pd
d = {'Buyerid': [1,1,2,2,3,3,3,4,5,5,5],
'Carid': [1,2,3,4,4,1,2,4,1,3,5]}
df = pd.DataFrame(data=d)
print(df)
Buyerid Carid
0 1 1
1 1 2
2 2 3
3 2 4
4 3 4
5 3 1
6 3 2
7 4 4
8 5 1
9 5 3
10 5 5
# What I want
Buyerid Carid
2 2 3
3 2 4
8 5 1
9 5 3
10 5 5
I have already tested this df = df.loc[df['Carid']==3,'Buyerid'], but this only gives me the line with CardID 3 but not the complete buyer.
How to select rows from a DataFrame based on column values
I looked at that, but I only get this here
Buyerid Carid
2 2 3
9 5 3
Do the following:
import pandas as pd
d = {'Buyerid': [1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5],
'Carid': [1, 2, 3, 4, 4, 1, 2, 4, 1, 3, 5]}
df = pd.DataFrame(data=d)
# get all buyers
buyers = set(df.loc[df['Carid'] == 3, 'Buyerid'])
# boolean mask for filtering
mask = df['Buyerid'].isin(buyers)
print(df[mask])
Output
Buyerid Carid
2 2 3
3 2 4
8 5 1
9 5 3
10 5 5
You can use df.loc:
df.loc[df['Carid']==3,'Buyerid']
I have the following DataFrame:
>>> df = pd.DataFrame({"a": [1, 1, 1, 1, 2, 2, 3, 3, 3], "b": [1, 5, 7, 9, 2, 4, 6, 14, 5], "c": [1, 0, 0, 1, 1, 1, 1, 0, 1]})
>>> df
a b c
0 1 1 1
1 1 5 0
2 1 7 0
3 1 9 1
4 2 2 1
5 2 4 1
6 3 6 1
7 3 14 0
8 3 5 1
I want to calculate the mode of column c for every unique value in a and then select the rows where c has this value.
This is my own solution:
>>> major_types = df.groupby(['a'])['c'].apply(lambda x: pd.Series.mode(x)[0])
>>> df = df.merge(major_types, how="left", right_index=True, left_on="a", suffixes=("", "_major"))
>>> df = df[df['c'] == df['c_major']].drop(columns="c_major", axis=1)
Which would output the following:
>>> df
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
It is very insufficient for large DataFrames. Any idea on what to do?
IIUC, GroupBy.transform instead apply + merge
df.loc[df['c'].eq(df.groupby('a')['c'].transform(lambda x: x.mode()[0]))]
a b c
1 1 5 0
2 1 7 0
4 2 2 1
5 2 4 1
6 3 6 1
8 3 5 1
Or
s = df.groupby(['a','c'])['c'].transform('size')
df.loc[s.eq(s.groupby(df['c']).transform('max'))]
import pandas as pd
import numpy as np
data=[]
columns = ['A', 'B', 'C']
data = [[0, 10, 5], [0, 12, 5], [2, 34, 13], [2, 3, 13], [4, 5, 8], [2, 4, 8], [1, 2, 4], [1, 3, 4], [3, 8, 12],[4,10,12],[6,7,12]]
df = pd.DataFrame(data, columns=columns)
print(df)
# A B C
# 0 0 10 5
# 1 0 12 5
# 2 2 34 13
# 3 2 3 13
# 4 4 5 8
# 5 2 4 8
# 6 1 2 4
# 7 1 3 4
# 8 3 8 12
# 9 4 10 12
# 10 6 7 12
Now I want to create two data frames df_train and df_test such that no two numbers of column 'C' are in the same set. eg. in column C the element 5 should be either in the training set or testing set .So, the rows [0, 10, 5], [0, 12, 5], [2, 34, 13] will either go in training set or testing set but not in both.This choosing of elements of column C should be done randomly.
I am stuck on this step and cannot proceed.
First sample your df , then groupby C get the cumcount distinct the duplicated value within the same group.
s=df.sample(len(df)).groupby('C').cumcount()
s
Out[481]:
5 0
7 0
2 0
1 0
0 1
6 1
10 0
4 1
3 1
8 1
9 2
dtype: int64
test=df.loc[s[s==1].index]
train=df.loc[s[s==0].index]
test
Out[483]:
A B C
0 0 10 5
6 1 2 4
4 4 5 8
3 2 3 13
8 3 8 12
train
Out[484]:
A B C
5 2 4 8
7 1 3 4
2 2 34 13
1 0 12 5
10 6 7 12
The question is not so clear of what the expected output of the two train and test set dataframe should looks like.
Anyway, I will try to answer.
I think you can first sort the dataframe values:
df_sorted = df.sort_values(['C'], ascending=[True])
print(df_sorted)
Out[1]:
A B C
6 1 2 4
7 1 3 4
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
8 3 8 12
9 4 10 12
10 6 7 12
2 2 34 13
3 2 3 13
Then split the sorted dataframe:
uniqe_c = df_sorted['C'].unique().tolist()
print(uniqe_c)
Out[2]:
[4, 5, 8, 12, 13]
train_set = df[df['C'] <= uniqe_c[2]]
val_set = df[df['C'] > uniqe_c[2]]
print(train_set)
# Train set dataframe
Out[3]:
A B C
0 0 10 5
1 0 12 5
4 4 5 8
5 2 4 8
6 1 2 4
7 1 3 4
print(val_set)
# Test set dataframe
Out[4]:
A B C
2 2 34 13
3 2 3 13
8 3 8 12
9 4 10 12
10 6 7 12
From 11 samples, after the split, 6 samples go to the train set and 5 samples go to the validation set. So, checked and no missing samples in the total combined two dataframes.
I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.