I am modifying a dataframe within a function but I do not want it to change the global variable.
I use two different ways to change my dataframe and they affect my global variable differently. The first method to add a new column by assigning a non-existent column modifies the global dataframe. By concatenation of a new column the global dataframe remains unchanged.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data['d'] = [1, 2, 3]
mutation(df)
print(df)
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data = pd.concat([data,pd.DataFrame([1,2,3], columns=['d'])], axis =1)
mutation(df)
print(df)
I expect that when I print df after calling the function I see columns a, b and c. But, the first method also shows column d.
When you pass the data object to the function, you are actually passing its reference to the function. So when you do in-place mutations on the object it points to, you can see these mutations outside of the function as well.
If you want to keep your original data un-mutated, pass a clone of the original data frame as follows:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
def mutation(data):
data['d'] = [1, 2, 3]
mutation(df.copy())
print(df)
Output:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
The function operated on the clone, so the original data frame is unmodified.
The second example you've done is not an in-place operation on the original data frame: It instead creates a new data frame. So in the second example, your original DF is not modified.
Related
having a random df
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
cols_in = list(df)[0:2]+list(df)[4:]
now:
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i,cols_in])
obviously in the cycle, x return an error due to col_in assignment in iloc.
How could be possible apply a mixed style slicing of df like in append function ?
It seems like you want to exclude one column? There is no column 4, so depending on which columns you are after, something like this might be what you are after:
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
If you want to get the column indeces from column names you can do:
cols = ['A', 'B', 'D']
cols_in = np.nonzero(df.columns.isin(cols))[0]
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i, cols_in].to_list())
x
Output:
[[1, 2, 4], [4, 5, 7], [7, 8, 10], [10, 11, 13], [14, 15, 17]]
I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]
I have a pandas dataframe df. And lets say I wanted to share df with you guys here to allow you to easily recreate df in your own notebook.
Is there a command or function that will generate the pandas dataframe create statement? I realize that for a lot of data the statement would be quite large seeing that it must include the actual data, so a header would be ideal.
Essentially, a command that I can run on df and get something like this:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
or
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
I'm not sure how to even phrase this question. Like taking a dataframe and deconstructing the top 5 rows or something?
We usually using read_clipboard
pd.read_clipboard()
Out[328]:
col1 col2
0 1 3
1 2 4
Or If you have the df save it into dict so that we can easily convert it back to the sample we need
df.head(5).to_dict()
Let's consider a dataframe A with three columns: a, b and c. Suppose we have also Series B of the the same size as A. In each row it contains the name of one of the A's column. I want to construct the Series which would contains the values form the table A at columns specified by B.
The simplest example would be the following:
idxs = np.arange(0, 5)
A = pd.DataFrame({
'a': [3, 1, 5, 7, 8],
'b': [5, 6, 7, 3, 1],
'c': [2, 7, 8, 2, 1],
}, index=idxs)
B = pd.Series(['b', 'c', 'c', 'a', 'a'], index=idxs)
I need to apply some operation which will give the result identical to the following series:
C = pd.Series([5, 7, 8, 7, 8], index=idxs)
In such a simple example one can perform 'broadcasting' as following on pure numpy arrays:
d = {'a':0, 'b':1, 'c':2 }
AA = A.rename(columns=d).as_matrix()
BB = B.apply(lambda x: d[x]).as_matrix()
CC = AA[idxs, BB]
That works, but in my real problem I have multiindexed Dataframe, and things become more complicated.
Is it possible to do so, using pandas tools?
The first thing that comes into my mind is:
A['idx'] = B;
C = A.apply(lambda x: x[x['idx']], axis=1)
It works!
You can use DataFrame.lookup:
pd.Series(A.lookup(B.index, B), index=B.index)
0 5
1 7
2 8
3 7
4 8
dtype: int64
A NumPy solution involving broadcasting is:
A.values[B.index, (A.columns.values == B[:, None]).argmax(1)]
# array([5, 7, 8, 7, 8])
I want to convert pandas dataframe into a list.
For example, I have a dataframe like below, and I want to make list with all columns.
Dataframe (df)
A B C
0 4 8
1 5 9
2 6 10
3 7 11
Expected result
[[0,1,2,3], [4,5,6,7], [8,9,10,11]]
If I use df.values.tolist(), it will return in row-based order list like below.
[[0,4,8], [1,5,9], [2,6,10], [3,7,11]]
It is possible to transpose the dataframe, but I want to know whether there are better solutions.
I think simpliest is transpose.
Use T or numpy.ndarray.transpose:
df1 = df.T.values.tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Or:
df1 = df.values.transpose().tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Another answer with list comprehension, thank you John Galt:
L = [df[x].tolist() for x in df.columns]