python pandas elegant dataframe access rows 2:end - python

I have a dataframe, dF = pd.DataFrame(X) where X is a numpy array of doubles. I want to remove the last row from the dataframe. I know for the first row I can do something like this dF.ix[1:]. I want to do something similar for the last row. I know in matlab you could do something like this dF[1:end-1]. What is a good and readable way to do this with pandas?
The end goal is to achieve this:
first matrix
1 2 3
4 5 6
7 8 9
second matrix
a b c
d e f
g h i
now get rid of first row of first matrix and last row of second matrix and horizontally concatentate them like so:
4 5 6 a b c
7 8 9 d e f
done. In matlab a = firstMatrix. b = secondMatrix. c = [a[2:end,:] b[1:end-1,:]] where c is the resulting matrix.

you can do it this way:
In [129]: df1
Out[129]:
c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9
In [130]: df2
Out[130]:
c1 c2 c3
0 a b c
1 d e f
2 g h i
In [131]: df1.iloc[1:].reset_index(drop=1).join(df2.iloc[:-1].reset_index(drop=1), rsuffix='_2')
Out[131]:
c1 c2 c3 c1_2 c2_2 c3_2
0 4 5 6 a b c
1 7 8 9 d e f
Or a pure NumPy solution:
In [132]: a1 = df1.values
In [133]: a1
Out[133]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [134]: a2 = df2.values
In [135]: a2
Out[135]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']], dtype=object)
In [136]: a1[1:]
Out[136]:
array([[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [137]: a2[:-1]
Out[137]:
array([['a', 'b', 'c'],
['d', 'e', 'f']], dtype=object)
In [138]: np.concatenate((a1[1:], a2[:-1]), axis=1)
Out[138]:
array([[4, 5, 6, 'a', 'b', 'c'],
[7, 8, 9, 'd', 'e', 'f']], dtype=object)

Related

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?
Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Pandas dataframe remove rows by aggregated data

I have a dataframe like this
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
a
b
c
0
1
9
3
1
1
5
6
2
2
1
9
I want to keep 'a' iff the sum of 'b's under the same 'a' is greater than 10.
For this case, the desire output is:
a
b
c
0
1
9
3
1
1
5
6
My solution is as below:
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
tmp_ = test1.groupby("a").sum().reset_index()
test1[test1["a"].isin(tmp_[tmp_["b"]>10]["a"].to_list())]
I am just wondering if there is a more elegant way to do that?
Group by 'a' and use transform
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
b_sum_by_a = test1.groupby('a')['b'].transform('sum')
test1 = test1[b_sum_by_a > 10]
>>> test1
a b c
0 1 9 3
1 1 5 6

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Dataframe groupby when specific values are encountered on a given row

I have a dataframe and I would like to group(or slice)it. The dataframe is in a form of
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
I would like to first group the dataframe on column A and B. Then,each group is further split by a certain value into smaller groups with consecutive rows. For example,after grouping the dataframe by columns A and B,I would like to refine the grouping on the third level each time I encounter a 0 in column C. So the grouped dataframe is like
A B C
a b 1
a b 0
a b 1
a b 2
a b 0
a e 3
a e 3
f g 6
f g 7
f g 0
Grouping a dataframe by column values like columns A and B in the example is simple but I dont know how to further group on level 3 into consecutive rows with certain cut points. Thank you in advance if you could help.
To do so the approach is alway the same: create an extra column (or several sometimes) that represents your specific grouping logic, then group against it:
df.groupby(['A', 'B', 'cut_point']).groups
Out[139]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}
df['cut_point'] = (df.C==0).cumsum().shift().fillna(0)
df.groupby(['A', 'B', 'cut_point']).groups
Out[141]:
{('a', 'b', 0.0): Int64Index([0, 1], dtype='int64'),
('a', 'b', 1.0): Int64Index([2, 3, 4], dtype='int64'),
('a', 'e', 2.0): Int64Index([5, 6], dtype='int64'),
('f', 'g', 2.0): Int64Index([7, 8, 9], dtype='int64')}

Pandas keep column after multiple aggregations

I'm trying to do multiple aggragations over a pandas dataframe, the problem is that I want to keep the column over I aggregate
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg('sum')
X Y
0 A 4
1 B 6
That's good but what I want is multiple aggregations like this
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg(['sum', 'mean'])
It gives me
Y
sum mean
X
A 4 2
B 6 3
But I want this
X Y
sum mean
0 A 4 2
1 B 6 3
To move X from the index to a column use reset_index:
In [4]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [5]: df3.groupby('X', as_index=False).agg(['sum', 'mean']).reset_index()
Out[5]:
X Y
sum mean
0 A 4 2
1 B 6 3

Categories

Resources