Delete duplicate values from pandas DataFrame [duplicate] - python

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.
import pandas
import numpy as np
import random
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]
print df.shape
df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
print df.shape
# output
(10, 6)
(10, 6)
[Finished in 1.0s]

You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:
In [106]:
df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
A B C D new2
new
1 -1.698741 -0.550839 -0.073692 0.618410 1
3 0.519596 1.686003 1.395585 1.298783 2
4 1.557550 1.249577 0.214546 -0.077569 4
5 -0.183454 -0.789351 -0.374092 -1.824240 5
7 -1.176468 0.546904 0.666383 -0.315945 7
8 -1.224640 -0.650131 -0.394125 0.765916 8
10 -1.045131 0.726485 -0.194906 -0.558927 5
if you passed inplace=True it would work:
In [108]:
df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
A B C D new2
new
1 0.334352 -0.355528 0.098418 -0.464126 1
3 -0.394350 0.662889 -1.012554 -0.004122 2
4 -0.288626 0.839906 1.335405 0.701339 4
5 0.973462 -0.818985 1.020348 -0.306149 5
7 -0.710495 0.580081 0.251572 -0.855066 7
8 -1.524862 -0.323492 -0.292751 1.395512 8
10 -1.164393 0.455825 -0.483537 1.357744 5

Related

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Unique values in pandas

Hi so I've just started learning python.And I am trying to learn pandas and I have this doubt on how to find the unique start and stop values in a data frame.Can someone help me out here
As you did not provide an example dataset, let's assume this one:
import numpy as np
np.random.seed(1)
df = pd.DataFrame({'start': np.random.randint(0,10,5),
'stop': np.random.randint(0,10,5),
}).T.apply(sorted).T
start stop
0 0 5
1 1 8
2 7 9
3 5 6
4 0 9
To get unique values for a given column (here start):
>>> df['start'].unique()
array([0, 1, 7, 5])
For all columns at once:
>>> df.apply(pd.unique, result_type='reduce')
start [0, 1, 7, 5]
stop [5, 8, 9, 6]
dtype: object

matching two different arrays and making a new array in python

I have two two-dimensional arrays, and I have to create a new array filtering through the 2nd array where 1st column indexes match. The arrays are of different size.
basically the idea is as follow:
file A
#x y
1 2
3 4
2 2
5 4
6 4
7 4
file B
#x1 y1
0 1
1 1
11 1
5 1
7 1
My expected output 2D array should look like
#newx newy
1 1
5 1
7 1
I tried it following way:
match =[]
for i in range(len(x)):
if x[i] == x1[i]:
new_array = x1[i]
match.append(new_array)
print match
This does not seem to work. Please suggest a way to create the new 2D array
Try np.isin.
arr1 = np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]])
arr2 = np.array([[0,1,11,5,7], [1,1,1,1,1]])
arr2[:,np.isin(arr2[0], arr1[0])]
array([[1, 5, 7],
[1, 1, 1]])
np.isin(arr2[0], arr1[0]) checks whether each element of arr2[0] is in arr1[0]. Then, we use the result as the boolean index array to select elements in arr2.
If you make a set out of the first element in A, then it is fairly easy to find the elements in B to keep like:
Code:
a = ((1, 2), (3, 4), (2, 2), (5, 4), (6, 4), (7, 4))
b = ((0, 1), (1, 1), (11, 1), (5, 1), (7, 1))
in_a = {i[0] for i in a}
new_b = [i for i in b if i[0] in in_a]
print(new_b)
Results:
[(1, 1), (5, 1), (7, 1)]
Output results to file as:
with open('output.txt', 'w') as f:
for value in new_b:
f.write(' '.join(str(v) for v in value) + '\n')
#!/usr/bin/env python3
from io import StringIO
import pandas as pd
fileA = """x y
1 2
3 4
2 2
5 4
6 4
7 4
"""
fileB = """x1 y1
0 1
1 1
11 1
5 1
7 1
"""
df1 = pd.read_csv(StringIO(fileA), delim_whitespace=True, index_col="x")
df2 = pd.read_csv(StringIO(fileB), delim_whitespace=True, index_col="x1")
df = pd.merge(df1, df2, left_index=True, right_index=True)
print(df["y1"])
# 1 1
# 5 1
# 7 1
https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
If you use pandas:
import pandas as pd
A = pd.DataFrame({'x': pd.Series([1,3,2,5,6,7]), 'y': pd.Series([2,4,2,4,4,4])})
B = pd.DataFrame({'x1': pd.Series([0,1,11,5,7]), 'y1': 1})
C = A.join(B.set_index('x1'), on='x')
Then if you wanted to drop the unneeded row/columns and rename the columns:
C = A.join(B.set_index('x1'), on='x')
C = C.drop(['y'], axis=1)
C.columns = ['newx', 'newy']
which gives you:
>>> C
newx newy
0 1 1.0
3 5 1.0
5 7 1.0
If you are going to work with arrays, dataframes, etc - pandas is definitely worth a look: https://pandas.pydata.org/pandas-docs/stable/10min.html
Assuming that you have (x, y) pairs in your 2-D arrays, a simple loop may work:
arr1 = [[1, 2], [3, 4], [2, 2]]
arr2 = [[0, 1], [1, 1], [11, 1]]
result = []
for pair1 in arr1:
for pair2 in arr2:
if (pair1[0] == pair2[0]):
result.append(pair2)
print(result)
Not the best solution for smaller arrays, but for really large arrays, works fast -
import numpy as np
import pandas as pd
n1 = np.transpose(np.array([[1,3,2,5,6,7], [2,4,2,4,4,4]]))
n2 = np.transpose(np.array([[0,1,11,5, 7], [1,1,1,1,1]]))
np.array(pd.DataFrame(n1).merge(pd.DataFrame(n2), on=0, how='inner').drop('1_x', axis=1))

how to get a 2d numpy array from a pandas dataframe? - wrong shape

I want to get a 2d-numpy array from a column of a pandas dataframe df having a numpy vector in each row. But if I do
df.values.shape
I get: (3,) instead of getting: (3,5)
(assuming that each numpy vector in the dataframe has 5 dimensions, and that the dataframe has 3 rows)
what is the correct method?
Ideally, avoid getting into this situation by finding a different way to define the DataFrame in the first place. However, if your DataFrame looks like this:
s = pd.Series([np.random.randint(20, size=(5,)) for i in range(3)])
df = pd.DataFrame(s, columns=['foo'])
# foo
# 0 [4, 14, 9, 16, 5]
# 1 [16, 16, 5, 4, 19]
# 2 [7, 10, 15, 13, 2]
then you could convert it to a DataFrame of shape (3,5) by calling pd.DataFrame on a list of arrays:
pd.DataFrame(df['foo'].tolist())
# 0 1 2 3 4
# 0 4 14 9 16 5
# 1 16 16 5 4 19
# 2 7 10 15 13 2
pd.DataFrame(df['foo'].tolist()).values.shape
# (3, 5)
I am not sure what you want. But df.values.shape seems to be giving the correct result.
import pandas as pd
import numpy as np
from pandas import DataFrame
df3 = DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
print df3
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711
#1 0.246188 0.628944 0.528552 0.179939 -0.019213
#2 0.080049 0.579549 1.790376 -1.301700 1.372702
df3.values.shape
#(3L, 5L)
df3["a"]
#0 -0.221059
#1 0.246188
#2 0.080049
df3[:1]
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711

Loading a table in numpy with row- and column-indices, like in R?

I would like to load a table in numpy, so that the first row and first column would be considered text labels. Something equivalent to this R code:
read.table("filename.txt", row.header=T)
Where the file is a delimited text file like this:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
So that read in I will have an array:
[[5,4,3,2],
[1,0,9,9],
[8,7,6,5]]
With some sort of:
rownames ["X","Y","Z"]
colnames ["A","B","C","D"]
Is there such a class / mechanism?
Numpy arrays aren't perfectly suited to table-like structures. However, pandas.DataFrames are.
For what you're wanting, use pandas.
For your example, you'd do
data = pandas.read_csv('filename.txt', delim_whitespace=True, index_col=0)
As a more complete example (using StringIO to simulate your file):
from StringIO import StringIO
import pandas as pd
f = StringIO("""A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5""")
x = pd.read_csv(f, delim_whitespace=True, index_col=0)
print 'The DataFrame:'
print x
print 'Selecting a column'
print x['D'] # or "x.D" if there aren't spaces in the name
print 'Selecting a row'
print x.loc['Y']
This yields:
The DataFrame:
A B C D
X 5 4 3 2
Y 1 0 9 9
Z 8 7 6 5
Selecting a column
X 2
Y 9
Z 5
Name: D, dtype: int64
Selecting a row
A 1
B 0
C 9
D 9
Name: Y, dtype: int64
Also, as #DSM pointed out, it's very useful to know about things like DataFrame.values or DataFrame.to_records() if you do need a "raw" numpy array. (pandas is built on top of numpy. In a simple, non-strict sense, each column of a DataFrame is stored as a 1D numpy array.)
For example:
In [2]: x.values
Out[2]:
array([[5, 4, 3, 2],
[1, 0, 9, 9],
[8, 7, 6, 5]])
In [3]: x.to_records()
Out[3]:
rec.array([('X', 5, 4, 3, 2), ('Y', 1, 0, 9, 9), ('Z', 8, 7, 6, 5)],
dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8'), ('D', '<i8')])

Categories

Resources