I am currently working on a dataframe from a cross-tab operation.
pd.crosstab(data['One'],data['two'], margins=True).apply(lambda r: r/len(data)*100,axis = 1)
Columns come out in the following order
A B C D E All
B
C
D
E
All 100
But I want the columns ordered as shown below:
A C D B E All
B
C
D
E
All 100
Is there a easy way to organize the columns?
when I use colnames=['C', 'D','B','E'] it returns an error:
'AssertionError: arrays and names must have the same length '
You can use reindex or reindex_axis or change order by subset:
colnames=['C', 'D','B','E']
new_cols = colnames + ['All']
#solution 1 change ordering by reindexing
df1 = df.reindex_axis(new_cols,axis=1)
#solution 2 change ordering by reindexing
df1 = df.reindex(columns=new_cols)
#solution 3 change order by subset
df1 = df[new_cols]
print (df1)
C D B E All
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN 100.0
To specify the columns of any dataframe in pandas, just index with a list of the columns in the order you want:
columns = ['A', 'C', 'D', 'B', 'E', 'All']
df2 = df.loc[:, columns]
print(df2)
Related
I have a dataframe with a multiindex, as per the following example:
dates = pandas.date_range(datetime.date(2020,1,1), datetime.date(2020,1,4))
columns = ['a', 'b', 'c']
index = pandas.MultiIndex.from_product([dates,columns])
panel = pandas.DataFrame(index=index, columns=columns)
This gives me a dataframe like this:
a b c
2020-01-01 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-02 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-03 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
2020-01-04 a NaN NaN NaN
b NaN NaN NaN
c NaN NaN NaN
I have another 2-dimensional dataframe, as follows:
df = pandas.DataFrame(index=dates, columns=columns, data=numpy.random.rand(len(dates), len(columns)))
Resulting in the following:
a b c
2020-01-01 0.540867 0.426181 0.220182
2020-01-02 0.864340 0.432873 0.487878
2020-01-03 0.017099 0.181050 0.373139
2020-01-04 0.764557 0.097839 0.499788
I would like to assign to the [a,a] cell, across all dates, and the [a,b] cell, across all dates etc.
Something akin to the following:
for i in df.columns:
for j in df.columns:
panel.xs(i, level=1).loc[j] = df[i] * df[j]
Of course this doesn't work, because I am attempting to set a value on a copy of a slice
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I tried several variations:
panel.loc[:,'a'] # selects all rows, and column a
panel.loc[(:, 'a'), 'a'] # invalid syntax
etc...
How can I select index level 1 (eg: row 'a'), column 'a', across all index level 0 - and be able to set the values?
Try broadcasing on the values:
a = df.to_numpy()
panel = pd.DataFrame((a[...,None] * a[:,None,:]).reshape(-1, df.shape[1]),
index=panel.index, columns=panel.columns)
Output:
a b c
2020-01-01 a 0.292537 0.230507 0.119089
b 0.230507 0.181630 0.093837
c 0.119089 0.093837 0.048480
2020-01-02 a 0.747084 0.374149 0.421692
b 0.374149 0.187379 0.211189
c 0.421692 0.211189 0.238025
2020-01-03 a 0.000292 0.003096 0.006380
b 0.003096 0.032779 0.067557
c 0.006380 0.067557 0.139233
2020-01-04 a 0.584547 0.074803 0.382116
b 0.074803 0.009572 0.048899
c 0.382116 0.048899 0.249788
df1 = pd.DataFrame(columns=['a','B','c','D'])
df2 = pd.DataFrame({'a':[1,2],'B':[3,4]})
I'd like to update the empty columns in df1 with df2 but keeping the original column order. I tried
df1.combine_first(df2)
but this changes the order of the columns
B D a c
0 3 NaN 1 NaN
1 4 NaN 2 NaN
Try with reindex
df2.reindex(df1.columns, axis=1)
Out[44]:
a B c D
0 1 3 NaN NaN
1 2 4 NaN NaN
I have an empty pandas dataframe (df), a list of (index, column) pairs (pair_list), and a list of corresponding values (value_list). I want to assign the value in value_list to the corresponding position in df according to pair_list. The following code is what I am using currently, but it is slow. Is there any faster way to do it?
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3], columns=['a', 'b','c','d'])
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
for pos, item in enumerate(pair_list):
df.at[item] = value_list[pos]
The output of the code should be:
a b c d
0 3 NaN NaN 4
1 NaN NaN 2 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
One idea is create a MultiIndex by MultiIndex.from_tuples, then create a Series, reshape by Series.unstack and add missing columns, index values by DataFrame.reindex:
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
mux = pd.MultiIndex.from_tuples(pair_list)
cols = ['a', 'b','c','d']
idx = [0,1,2,3]
df = pd.Series(value_list, index=mux).unstack().reindex(index=idx, columns=cols)
print (df)
a b c d
0 3.0 NaN NaN 4.0
1 NaN NaN 2.0 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
I have a df1, example:
B A C
B 1
A 1
C 2
,and a df2, example:
C E D
C 2 3
E 1
D 2
The column and row 'C' is common in both dataframes.
I would like to combine these dataframes such that I get,
B A C D E
B 1
A 1
C 2 2 3
D 1
E 2
Is there an easy way to do this? pd.concat and pd.append do not seem to work. Thanks!
Edit: df1.combine_first(df2) works (thanks #jezarel), but can we keep the original ordering?
There is problem combine_first always sorted columns namd index, so need reindex with combine columns names:
idx = df1.columns.append(df2.columns).unique()
print (idx)
Index(['B', 'A', 'C', 'E', 'D'], dtype='object')
df = df1.combine_first(df2).reindex(index=idx, columns=idx)
print (df)
B A C E D
B NaN 1.0 NaN NaN NaN
A NaN NaN 1.0 NaN NaN
C 2.0 NaN NaN 2.0 3.0
E NaN NaN NaN NaN 1.0
D NaN NaN 2.0 NaN NaN
More general solution:
c = df1.columns.append(df2.columns).unique()
i = df1.index.append(df2.index).unique()
df = df1.combine_first(df2).reindex(index=i, columns=c)
The following is example of data I have in excel sheet.
A B C
1 2 3
4 5 6
I am trying to get the columns name using the following code:
p1 = list(df1t.columns.values)
the output is like this
[A, B, C, 'Unnamed: 3', 'unnamed 4', 'unnamed 5', .....]
I checked the excel sheet, there is only three columns named A, B, and C. Other columns are blank. Any suggestion?
Just in case anybody stumbles over this problem: The issue can also arise if the excel sheet contains empty cells that are formatted with a background color:
import pandas as pd
df1t = pd.read_excel('test.xlsx')
print(df1t)
A B C Unnamed: 3
0 1 2 3 NaN
1 4 5 6 NaN
One option is to drop the 'Unnamed' columns as described here:
https://stackoverflow.com/a/44272830/11826257
df1t = df1t[df1t.columns.drop(list(df1t.filter(regex='Unnamed:')))]
print(df1t)
A B C
0 1 2 3
1 4 5 6
There is problem some cells are not empty but contains some whitespaces.
If need columns names with filtering Unnamed:
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Sample with file:
df = pd.read_excel('https://dl.dropboxusercontent.com/u/84444599/file_unnamed_cols.xlsx')
print (df)
A B C Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 4.0 6.0 8.0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Another solution:
cols = df.columns[~df.columns.str.startswith('Unnamed:')]
print (cols)
Index(['A', 'B', 'C'], dtype='object')
And for return all columns by cols use:
print (df[cols])
A B C
0 4.0 6.0 8.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
And if necessary remove all NaNs rows:
print (df[cols].dropna(how='all'))
A B C
0 4.0 6.0 8.0