I am using pandas v0.25.3. and am inexperienced but learning.
I have a dataframe and would like to swap the contents of two columns leaving the columns labels and sequence intact.
df = pd.DataFrame ({"A": [(1),(2),(3),(4)],
'B': [(5),(6),(7),(8)],
'C': [(9),(10),(11),(12)]})
This yields a dataframe,
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I want to swap column contents B and C to get
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
I have tried looking at pd.DataFrame.values which sent me to numpy array and advanced slicing and got lost.
Whats the simplest way to do this?.
You can assign numpy array:
#pandas 0.24+
df[['B','C']] = df[['C','B']].to_numpy()
#oldier pandas versions
df[['B','C']] = df[['C','B']].values
Or use DataFrame.assign:
df = df.assign(B = df.C, C = df.B)
print (df)
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
Or just use:
df['B'], df['C'] = df['C'], df['B'].copy()
print(df)
Output:
A B C
0 1 9 5
1 2 10 6
2 3 11 7
3 4 12 8
You can also swap the labels:
df.columns = ['A','C','B']
If your DataFrame is very large, I believe this would require less from your computer than copying all the data.
If the order of the columns is important, you can then reorder them:
df = df.reindex(['A','B','C'], axis=1)
Related
Is there a way to set column names for arguments as column index position, rather than column names?
Every example that I see is written with column names on value_vars. I need to use the column index.
For instance, instead of:
df2 = pd.melt(df,value_vars=['asset1','asset2'])
Using something similar to:
df2 = pd.melt(df,value_vars=[0,1])
Select columns names by indexing:
df = pd.DataFrame({
'asset1':list('acacac'),
'asset2':[4]*6,
'A':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]
})
df2 = pd.melt(df,
id_vars=df.columns[[0,1]],
value_vars=df.columns[[2,3]],
var_name= 'c_name',
value_name='Value')
print (df2)
asset1 asset2 c_name Value
0 a 4 A 7
1 c 4 A 8
2 a 4 A 9
3 c 4 A 4
4 a 4 A 2
5 c 4 A 3
6 a 4 D 1
7 c 4 D 3
8 a 4 D 5
9 c 4 D 7
10 a 4 D 1
11 c 4 D 0
I'm working on a project where my original dataframe is:
A B C label
0 1 2 2 Nan
1 2 4 5 7
2 3 6 5 Nan
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
But, I have an array with new labels for certain points (for that I only used columns A and B) in the original dataframe. Something like this:
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
My goal is to add the new labels to the original dataframe. I know that the combination of A and B unique is. What is the fastest way to assign the new label to the correct row?
This is my try:
y_labeled = np.array(y).astype('float64')
current_position = 0
for point in X_labeled:
row = df.loc[(df['A'] == point[0]) & (df['B'] == point[1])]
df.at[row.index, 'label'] = y_labeled[current_position]
current_position += 1
Wanted output (rows with index 1 and 2 are changed):
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
For small datasets may this be okay with I'm currently using it for datasets with more than 25000 labels. Is there a way that is faster?
Also, in some cases I used all columns expect the column 'label'. That dataframe exists out of 64 columns so my method can not be used here. Has someone an idea to improve this?
Thanks in advance
Best solution is to make your arrays into a dataframe and use df.update():
new = pd.DataFrame(X_labeled, columns=['A', 'B'])
new['label'] = y_labeled
new = new.set_index(['A', 'B'])
df = df.set_index(['A', 'B'])
df.update(new)
df = df.reset_index()
Here's a numpy based approach aimed at performance. To vectorize this we want a way to check membership of the rows in X_labeled in columns A and B. So what we can do, is view these two columns as 1D arrays (based on this answer) and then we can use np.in1d to index the dataframe and assign the values in y_labeled:
import numpy as np
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
a = df.values[:,:2].astype(int) #indexing on A and B
def view_as_1d(a):
a = np.ascontiguousarray(a)
return a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[-1])))
ix = np.in1d(view_as_1d(a), view_as_1d(X_labeled))
df.loc[ix, 'label'] = y_labeled
print(df)
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
So I have a file 500 columns by 600 rows and want to take the average of all columns for rows 200-400:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400]
Then create a new column of the averages of all rows across all columns. And extract only that newly created column:
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean']
But how can I prevent the indexes from resetting when I extract the new column?
I think is not necessary create new column in sliced_df, only rename name of Series and if need output as DataFrame add to_frame. Indexes are not resetting, see sample bellow:
#random dataframe
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
print (df)
A B C D E
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#in real data use df.iloc[200:400]
sliced_df=df.iloc[2:4]
print (sliced_df)
A B C D E
2 2 2 1 0 8
3 4 0 9 6 2
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
2 2.6
3 4.2
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
2 2.6
3 4.2
Python counts from 0, so maybe need change slice from 200:400 to 100:300, see difference:
sliced_df=df.iloc[1:3]
print (sliced_df)
A B C D E
1 0 4 2 5 2
2 2 2 1 0 8
final_ser = sliced_df.mean(axis=1).rename('mean')
print (final_ser)
1 2.6
2 2.6
Name: mean, dtype: float64
final_df = sliced_df.mean(axis=1).rename('mean').to_frame()
print (final_df)
mean
1 2.6
2 2.6
Use copy() function as follows:
df = pd.read_csv('file.csv', sep= '\s+')
sliced_df=df.iloc[200:400].copy()
sliced_df['mean'] = sliced_df.mean(axis=1)
final_df = sliced_df['mean'].copy()
Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?
Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6
If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])