I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.
I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.
Say my DataFrame is:
source|target|
----------------
| 1 | 2 |
| 2 | 1 |
| 4 | 3 |
| 2 | 7 |
| 3 | 4 |
I want it to become:
source|target|
----------------
| 1 | 2 |
| 4 | 3 |
| 2 | 7 |
df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
source target
0 1 2
2 4 3
3 2 7
explanation:
np.sort(df.values,axis=1) is sorting DataFrame column wise
array([[1, 2],
[1, 2],
[3, 4],
[2, 7],
[3, 4]], dtype=int64)
then making a dataframe from it and checking non duplicated using prefix ~ on duplicated
~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()
0 True
1 False
2 True
3 True
4 False
dtype: bool
and using this as mask getting final output
source target
0 1 2
2 4 3
3 2 7
Related
I have the following dataframe, where I would like to sort the columns according to the name.
1 | 13_1 | 13_10| 13_2 | 2 | 3
9 | 31 | 2 | 1 | 3 | 4
I am trying to sort the columns in the following way:
1 | 2 | 3 | 13_1 | 13_2 | 13_10
9 | 3 | 4 | 31 | 1 | 2
I've been trying to solve this using df.sort_index(axis=1, inplace=True), however the result turns out to be the same as my initial dataframe. I.e:
1 | 13_1 | 13_10| 13_2 | 2 | 3
9 | 31 | 2 | 1 | 3 | 4
It seems it recognizes 13_1 as 1.31 and not as 13.1. Furthermore, I tried a conversion of the column names from string to float. However, this turns out to treat 13_1 and 13_10 both as 13.1 giving me duplicate column names.
natsort
from natsort import natsorted
df = df.reindex(natsorted(df.columns), axis=1)
# 1 2 3 13_1 13_2 13_10
#0 9 3 4 31 1 2
first of all, natsort from the other answers looks awesome, I'd totally use that.
In case you don't want to install a new package:
Seems like you want to sort, numerically, first by the number before the _ and then by the number after it as a tie break. this means you just want a tuple sort order, when splitting to tuple by _.
try this:
df = df[sorted(df.columns, key=lambda x: tuple(map(int,x.split('_'))))]
Output:
1 2 3 13_1 13_2 13_10
9 3 4 31 1 2
Here is one way using natsorted
from natsort import natsorted, ns
df=df.reindex(columns=natsorted(df.columns))
Out[337]:
1 2 3 13_1 13_2 13_10
0 9 3 4 31 1 2
Another way we stack with pandas no 3rd party lib :-)
idx=df.columns.to_series().str.split('_',expand=True).astype(float).reset_index(drop=True).sort_values([0,1]).index
df=df.iloc[:,idx]
Out[355]:
1 2 3 13_1 13_2 13_10
0 9 3 4 31 1 2
I have the following type of data, I am doing predictions:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 5
Output: 6
7
8
I want to predict one at a time and feed it back to the input as the value for the last column. I use this function but it is not working well:
def moving_window(num_future_pred):
preds_moving = []
moving_test_window = [test_X[0,:].tolist()]
moving_test_window = np.array(moving_test_window)
for j in range(1, len(test_Y)):
moving_test_window = [test_X[j,:].tolist()]
moving_test_window = np.array(moving_test_window)
pred_one_step = model.predict(moving_test_window[:,:,:])
preds_moving.append(pred_one_step[0,0])
pred_one_step = pred_one_step.reshape((1,1,1))
moving_test_window =
np.concatenate((moving_test_window[:,:4,:], pred_one_step), axis= 1)
return preds_moving
preds_moving = moving_window(len(test_Y))
What I want:
Input: 1 | 2 | 3 | 4 | 5
1 | 2 | 3 | 4 | 6
1 | 2 | 3 | 4 | 17
Output: 6
17
18
Basically to make the first prediction [1,2,3,4,5] --> 6 and then remove the last column [5] from the next inputs and add the predicted value at each time.
What it does now, it just takes all the inputs as they are and makes predictions for each row. Any idea appreciated!
I am trying to convert a table containing string columns and array columns to a table with string columns only
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |[2,3] | [4,5] |
| 2 |[6,7,8] | [8,9,10] |
+-----+--------------------+--------------------+
How can I get expected result like that:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 | 2 | 4 |
| 1 | 3 | 5 |
| 2 | 6 | 8 |
| 2 | 7 | 9 |
| 2 | 8 | 10 |
+-----+--------------------+--------------------+
The confusion comes from mixing scalar columns and list columns.
Under the assumption that -for every row- col2 and col3 are of the same length, we can first translate all scalar columns into list columns and then concatenate:
df = pd.DataFrame({'col1': [1,2],
'col2': [[2,3] , [6,7,8]],
'col3': [[4,5], [8,9,10]]})
# First, we turn all columns into list columns
df['col1'] = df['col1'].apply(lambda x: [x]) * df['col2'].apply(len)
# Then we concatenate the lists
df.apply(np.concatenate)
Output:
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
4 2 8 10
Conver the columns to lists and after that to numpy.array, finally convert them to a DataFrame:
vals1 = np.array(df.col2.values.tolist())
vals2 = np.array(df.col3.values.tolist())
col1 = np.repeat(df.col1, vals1.shape[1])
df = pd.DataFrame(np.column_stack((col1, vals1.ravel(), vals2.ravel())), columns=df.columns)
print(df)
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .
train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean') - train.surface.mean())
here i have set the values of grouby by "cat" feature with the mean of suface . Cool
now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .
mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)
So , far there is no problem . Now , i would use two columns in groupby .
train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean') - train.surface.mean())
but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .
Thanks
for example my train is
+------+------+-------+
| Cat1 | Cat2 | surface |
+------+------+-------+
| 1 | 3 | 10 |
+------+------+-------+
| 2 | 2 | 12 |
+------+------+-------+
| 3 | 1 | 12 |
+------+------+-------+
| 1 | 3 | 5 |
+------+------+-------+
| 2 | 2 | 10 |
+------+------+-------+
| 3 | 2 | 13 |
+------+------+-------+
my test is
+------+------+
| Cat1 | Cat2 |
+------+------+
| 1 | 2 |
+------+------+
| 2 | 1 |
+------+------+
| 3 | 1 |
+------+------+
| 1 | 3 |
+------+------+
| 2 | 3 |
+------+------+
| 3 | 1 |
+------+------+
Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5
Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .
i hope that you have got me .
You can use
groupby().means() to calculate means
reset_index() to convert indexes Cat1, Cat2 into columns again
merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).
.
headers = ['Cat1', 'Cat2', 'surface']
train_data = [
[1, 3, 10],
[2, 2, 12],
[3, 1, 12],
[1, 3, 5],
[2, 2, 10],
[3, 2, 13],
]
test_data = [
[1, 2],
[2, 1],
[3, 1],
[1, 3],
[2, 3],
[3, 1],
]
import pandas as pd
train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])
print('--- train ---')
print(train)
print('--- test ---')
print(test)
print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()
print(means)
print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])
print(means)
print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')
print(result)
print('--- result (fillna)---')
result = result.fillna(0)
print(result)
Result:
--- train ---
Cat1 Cat2 surface
0 1 3 10
1 2 2 12
2 3 1 12
3 1 3 5
4 2 2 10
5 3 2 13
--- test ---
Cat1 Cat2
0 1 2
1 2 1
2 3 1
3 1 3
4 2 3
5 3 1
--- means ---
surface
Cat1 Cat2
1 3 7.5
2 2 11.0
3 1 12.0
2 13.0
--- means (dataframe) ---
Cat1 Cat2 surface
0 1 3 7.5
1 2 2 11.0
2 3 1 12.0
3 3 2 13.0
--- result ----
Cat1 Cat2 surface
0 1 2 NaN
1 2 1 NaN
2 3 1 12.0
3 1 3 7.5
4 2 3 NaN
5 3 1 12.0
--- result (fillna)---
Cat1 Cat2 surface
0 1 2 0.0
1 2 1 0.0
2 3 1 12.0
3 1 3 7.5
4 2 3 0.0
5 3 1 12.0
I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).