The copy() method in Python does not work properly - python

I have a pandas dataframe that I would like to make a duplicate of and do some operations on the duplicated version without affecting the original one. I use ".copy()" method but for some reason it doesn't work! Here is my code:
import pandas as pd
import numpy as np
x = np.array([1,2])
df = pd.DataFrame({'A': [x, x, x], 'B': [4, 5, 6]})
duplicate = df.copy()
duplicate['A'].values[0][[0,1]] = 0
print(duplicate)
print(df)
A B
0 [0, 0] 4
1 [0, 0] 5
2 [0, 0] 6
A B
0 [0, 0] 4
1 [0, 0] 5
2 [0, 0] 6
As you can see "df" (the original dataset) gets affected as well. Does anyone know why, and how this should be done correctly?

The problem is actually in the list value rather than the df itself. When you are copying the dataframe, even if it's by default a deep copy, it's not doing deepcopy on the value itself, so if the value is a list, the reference is copied over, you can tell this by the fact that even though you only tried to modify the first row, but all values of A in your duplicate are modified.
The proper way is probably:
import pandas as pd
import numpy as np
from copy import deepcopy # <- **
x = np.array([1,2])
df = pd.DataFrame({'A': [x, x, x], 'B': [4, 5, 6]})
duplicate = df.copy()
duplicate['A'] = duplicate["A"].apply(deepcopy) # <- **
duplicate['A'].values[0][[0,1]] = 0
print(duplicate)
print(df)
A B
0 [0, 0] 4
1 [1, 2] 5
2 [1, 2] 6
A B
0 [1, 2] 4
1 [1, 2] 5
2 [1, 2] 6

Related

replacing values with zeros

I have a numpy array, I want to replace whole values to zeros except some range of index.
1
2
3
4
5
I tried
Import numpy as np
data=np.loadtxt('data.txt')
print(data)
expected output
0
0
3
0
0
You can traverse the array with a for loop and check if the traversed element is in a list of desired selected values:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
nums = [3]
for i in range(len(a)):
if a[i] in nums:
pass
else:
a[i] = 0
print(a)
Output:
[0 0 3 0 0]
As you're working with a numpy array, use vectorial methods.
Here isin to form a boolean mask for replacement:
data = np.array([1, 2, 3, 4, 5])
l = [3]
data[~np.isin(data, l)] = 0
data
# array([0, 0, 3, 0, 0])

How to assign a certain value into a new column in a pandas dataframe depending on index

Suppose I have two pandas data frames, one actually more like a series
import pandas as pd
import numpy as np
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3], [4, 5, 6] ,[7,8,9]],columns=["I", "L","P"])
B = pd.DataFrame(index=[1, 3, 4], data=[[10], [40] ,[70]])
I would like to add a new column to A, called "B" with values depending on the index. That means if the index element is shared on both, A and B, then the value of that row (corresponding to that index) of B should be added. Otherwise 0. The result should look like this
A = pd.DataFrame(index=[0, 1, 2], data=[[1, 2, 3,0], [4, 5, 6,10] ,[7,8,9,0]],columns=["I", "L","P","B"])
A
How can this be achieved efficiently in Python / pandas?
reindex with assign
A['B'] = B[0].reindex(A.index,fill_value=0)
A
Out[55]:
I L P B
0 1 2 3 0
1 4 5 6 10
2 7 8 9 0

Check an array if there are two or more of same elements in a row

I have an input array
1 2 2
3 2 1
1 2 2
1 2 3
1 1 3
My output should be
3 2 1
1 2 3
This means all the elements repeated in the rows should be deleted. Is it possible to do that in numpy?
You can first sort each row and then look at the differences between consecutive elements per row: if there is any 0 in a row, it means that row has duplicates and needs to be dropped:
# sort and take the difference within rows
sorted_arr = np.sort(arr)
diffs = np.diff(sorted_arr)
# form a boolean array: does a row contain any duplicates?
to_drop = (diffs == 0).any(axis=1)
# invert the mask and index into original array
result = arr[~to_drop]
to get
>>> result
array([[3, 2, 1],
[1, 2, 3]])
Similar to #Hichkas's answer but implemented with numpy
import numpy as np
x = np.array([[1,2,2],
[3,2,1],
[1,2,2],
[1,2,3],
[1,1,3]])
ans = np.empty((0,3), int)
for row in x:
if row.shape[0] <= np.unique(row).shape[0]:
ans = np.vstack((ans,row))
print(ans)
x = [[1,2,2],
[3,2,1],
[1,2,2],
[1,2,3],
[1,1,3]]
[*filter(lambda x: set(x).__len__() > 2, x)]
# [[3, 2, 1], [1, 2, 3]]
# or rewritten for numpy
import numpy as np
x = np.array([[1,2,2],
[3,2,1],
[1,2,2],
[1,2,3],
[1,1,3]])
np.array([*filter(lambda x: set(x.tolist()).__len__() > 2, x)])
# array([[3, 2, 1], [1, 2, 3]])
Use this
list_test = [ [1 ,2 ,2],
[3 ,2 ,1],
[1 ,2 ,2],
[1 ,2 ,3],
[1 ,1 ,3]]
for x in list_test:
if len(x)>len(set(x)):
list_test.remove(x)

Make a Pandas MultiIndex from a product of iterables?

I have a utility function for creating a Pandas MultiIndex when I have two or more iterables and I want an index key for each unique pairing of the values in those iterables. It looks like this
import pandas as pd
import itertools
def product_index(values, names=None):
"""Make a MultiIndex from the combinatorial product of the values."""
iterable = itertools.product(*values)
idx = pd.MultiIndex.from_tuples(list(iterable), names=names)
return idx
And could be used like:
a = range(3)
b = list("ab")
product_index([a, b])
To create
MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
This works perfectly fine, but it seems like a common usecase and I am surprised I had to implement it myself. So, my question is, what have I missed/misunderstood in the Pandas library itself that offers this functionality?
Edit to add: This function has been added to Pandas as MultiIndex.from_product for the 0.13.1 release.
This is a very similar construction (but using cartesian_product which for larger arrays is faster than itertools.product)
In [2]: from pandas.tools.util import cartesian_product
In [3]: MultiIndex.from_arrays(cartesian_product([range(3),list('ab')]))
Out[3]:
MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
could be added as a convience method, maybe MultiIndex.from_iterables(...)
pls open an issue (and PR if you'd like)
FYI I very rarely actually construct a multi-index 'manually', almost always easier to actually construct a frame and just set_index.
In [10]: df = DataFrame(dict(A = np.arange(6),
B = ['foo'] * 3 + ['bar'] * 3,
C = np.ones(6)+np.arange(6)%2)
).set_index(['C','B']).sortlevel()
In [11]: df
Out[11]:
A
C B
1 bar 4
foo 0
foo 2
2 bar 3
bar 5
foo 1
[6 rows x 1 columns]

multi-column factorize in pandas

The pandas factorize function assigns each unique value in a series to a sequential, 0-based index, and calculates which index each series entry belongs to.
I'd like to accomplish the equivalent of pandas.factorize on multiple columns:
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]
That is, I want to determine each unique tuple of values in several columns of a data frame, assign a sequential index to each, and compute which index each row in the data frame belongs to.
Factorize only works on single columns. Is there a multi-column equivalent function in pandas?
You need to create a ndarray of tuple first, pandas.lib.fast_zip can do this very fast in cython loop.
import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
print pd.factorize(pd.lib.fast_zip([df.x, df.y]))[0]
the output is:
[0 1 2 2 1 0]
I am not sure if this is an efficient solution. There might be better solutions for this.
arr=[] #this will hold the unique items of the dataframe
for i in df.index:
if list(df.iloc[i]) not in arr:
arr.append(list(df.iloc[i]))
so printing the arr would give you
>>>print arr
[[1,1],[1,2],[2,2]]
to hold the indices, i would declare an ind array
ind=[]
for i in df.index:
ind.append(arr.index(list(df.iloc[i])))
printing ind would give
>>>print ind
[0,1,2,2,1,0]
You can use drop_duplicates to drop those duplicated rows
In [23]: df.drop_duplicates()
Out[23]:
x y
0 1 1
1 1 2
2 2 2
EDIT
To achieve your goal, you can join your original df to the drop_duplicated one:
In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]:
x y index
0 1 1 0
1 1 2 1
2 2 2 2
3 2 2 2
4 1 2 1
5 1 1 0
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

Categories

Resources