In [142]:
import pandas as pd
df = pd.DataFrame([[0,1,2,3]], columns=['a', 'b', 'c', 'd'])
df1 = pd.DataFrame()
d = {'name' : pd.Panel(items=['x', 'y', 'z'])}
d['name']['x']
Out[142]:
Index([], dtype='object') Empty DataFrame
0 rows × 0 columns
This doesn't seem to work:
In [143]:
d['name']['x'] = df
d['name']['x']
Out[143]:
Index([], dtype='object') Empty DataFrame
0 rows × 0 columns
But this does:
In [144]:
df1 = df
df1
Out[144]:
a b c d
0 0 1 2 3
1 rows × 4 columns
Is there something about Panels that I'm missing?
Related
Suppose I have the following dataset (2 rows, 2 columns, headers are Char0 and Char1):
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
I would like to one-hot encode the columns Char0 and Char1, so:
df = pd.concat([df, pd.get_dummies(df["Char0"], prefix='Char0')], axis=1)
df = pd.concat([df, pd.get_dummies(df["Char1"], prefix='Char1')], axis=1)
df.drop(['Char0', "Char1"], axis=1, inplace=True)
which results in a dataframe with column headers Char0_A, Char0_B, Char1_B, Char1_C.
Now, I would like to, for each column, have an indication for both A, B, C, and D (even though, there is currently no 'D' in the dataset). In this case, this would mean 8 columns: Char0_A, Char0_B, Char0_C, Char0_D, Char1_A, Char1_B, Char1_C, Char1_D.
Can somebody help me out?
Use get_dummies with all columns and then add DataFrame.reindex with all possible combinations of columns created by itertools.product:
dataset = [['A', 'B'], ['B', 'C']]
columns = ['Char0', 'Char1']
df = pd.DataFrame(dataset, columns=columns)
vals = ['A','B','C','D']
from itertools import product
cols = ['_'.join(x) for x in product(df.columns, vals)]
print (cols)
['Char0_A', 'Char0_B', 'Char0_C', 'Char0_D', 'Char1_A', 'Char1_B', 'Char1_C', 'Char1_D']
df1 = pd.get_dummies(df).reindex(cols, axis=1, fill_value=0)
print (df1)
Char0_A Char0_B Char0_C Char0_D Char1_A Char1_B Char1_C Char1_D
0 1 0 0 0 0 1 0 0
1 0 1 0 0 0 0 1 0
I have a sample python code:
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
ddf.sort_values(by='Id')
The above snippet produces ' FutureWarning: 'Id' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version'. And it does become a error when I try this under recent version of python. I am quite new to python and pandas. How do I resolve this issue?
Here the best is convert column Id to index with DataFrame.set_index for avoid index.name same with one of columns name:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf = ddf.set_index('Id')
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'col3'], dtype='object')
Better for sorting by index is DataFrame.sort_index:
print (ddf.sort_index())
col1 col3
Id
1 A a
2 B b
3 A x
Your solution working, if change index.name for different:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
Set different index.name by DataFrame.rename_axis or set by scalar:
ddf = ddf.rename_axis('newID')
#alternative
#ddf.index.name = 'newID'
print (ddf.index.name)
newID
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
So now is possible distinguish between index level and columns names, because sort_values working with both:
print(ddf.sort_values(by='Id'))
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
print (ddf.sort_values(by='newID'))
#same like sorting by index
#print (ddf.sort_index())
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
Simple add .values
ddf.index=ddf['Id'].values
ddf.sort_values(by='Id')
Out[314]:
col1 Id col3
1 A 1 a
2 B 2 b
3 A 3 x
Both your columns and row index contain 'Id', a simple solution would be to not set the (row) index as 'Id'.
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.sort_values(by='Id')
Out[0]:
col1 Id col3
1 A 1 a
2 B 2 b
0 A 3 x
Or set the index when you create the df:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'col3': ['x','a','b']},
index=[3,1,2])
ddf.sort_index()
Out[1]:
col1 col3
1 A a
2 B b
3 A x
I have two Pandas dataframes, df1 and df2. I would like to combine these into a single dataframe (df) but drop any rows where the value that appears in the 'A' column of df1 but is not present in the 'A' column of df2.
Input:
[in] df1 = A B
0 i y
1 ii y
[in] df2 = A B
0 ii x
1 i y
2 iii z
3 iii z
Desired output:
[out] df = A B
0 i y
1 ii y
2 ii x
3 i y
In the example above, all rows were added to df except those in df2 with 'iii' in the 'A' column, because 'iii' does not appear anywhere in column 'A' of df1.
To take this a step further, the initial number of dataframes is not limited to two. There could be three or more, and I would want to drop any column 'A' values that do not appear in ALL of the dataframes.
How can I make this happen?
Thanks in advance!
This will work for any generic list of dataframes. Also, order of dataframes does not matter.
df1 = pd.DataFrame([['i', 'y'], ['ii', 'y']], columns=['A', 'B'])
df2 = pd.DataFrame([['ii', 'x'], ['i', 'y'], ['iii', 'z'], ['iii', 'z']], columns=['A', 'B'])
dfs = [df1, df2]
set_A = set.intersection(*[set(dfi.A.tolist()) for dfi in dfs])
df = pd.concat([dfi[dfi.A.isin(set_A)] for dfi in dfs])
How do I merge the following datasets:
df = A
date abc
1 a
1 b
1 c
2 d
2 dd
3 ee
3 df
df = B
date ZZZ
1 a
2 b
3 c
I want to get smth like this:
date abc ZZZ
1 a a
1 b a
1 c a
2 d b
2 dd b
3 ee c
3 df c
I tried this code:
aa = pd.merge(A, B, left_on="date", right_on="date", how="left", validate="m:1")
But I have the following mistake:
TypeError: merge() got an unexpected keyword argument 'validate'
I update my pandas using (conda update pandas), but still get the same error
Please, advise me this issue.
According to df.merge docs validate was added in version 0.21.0. You are using an older version so you should update the version of pandas you are using.
As #DeepSpace mentioned, you may need to upgrade your pandas.
To replicate the check in earlier versions, you can do something like this:
import pandas as pd
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c'])
x = [i for i in df2.index if i in set(df1.index)]
len(x) == len(set(x)) # True
df1 = pd.DataFrame(index=['a', 'a', 'b', 'b', 'c'])
df2 = pd.DataFrame(index=['a', 'b', 'c', 'a'])
y = [i for i in df2.index if i in set(df1.index)]
len(y) == len(set(y)) # False
Consider the following DataFrame:
arrays = [['foo', 'bar', 'bar', 'bar'],
['A', 'B', 'C', 'D']]
tuples = list(zip(*arrays))
columnValues = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.rand(4,4), columns = columnValues)
print(df)
foo bar
A B C D
0 0.859664 0.671857 0.685368 0.939156
1 0.155301 0.495899 0.733943 0.585682
2 0.124663 0.467614 0.622972 0.567858
3 0.789442 0.048050 0.630039 0.722298
Say I want to remove the first column, like so:
df.drop(df.columns[[0]], axis = 1, inplace = True)
print(df)
bar
B C D
0 0.671857 0.685368 0.939156
1 0.495899 0.733943 0.585682
2 0.467614 0.622972 0.567858
3 0.048050 0.630039 0.722298
This produces the expected result, however the column labels foo and Aare retained:
print(df.columns.levels)
[['bar', 'foo'], ['A', 'B', 'C', 'D']]
Is there a way to completely drop a column, including its labels, from a MultiIndex DataFrame?
EDIT: As suggested by John, I had a look at https://github.com/pydata/pandas/issues/12822. What I got from it is that it's not a bug, however I believe the suggested solution (https://github.com/pydata/pandas/issues/2770#issuecomment-76500001) does not work for me. Am I missing something here?
df2 = df.drop(df.columns[[0]], axis = 1)
print(df2)
bar
B C D
0 0.969674 0.068575 0.688838
1 0.650791 0.122194 0.289639
2 0.373423 0.470032 0.749777
3 0.707488 0.734461 0.252820
print(df2.columns[[0]])
MultiIndex(levels=[['bar', 'foo'], ['A', 'B', 'C', 'D']],
labels=[[0], [1]])
df2.set_index(pd.MultiIndex.from_tuples(df2.columns.values))
ValueError: Length mismatch: Expected axis has 4 elements, new values have 3 elements
New Answer
As of pandas 0.20, pd.MultiIndex has a method pd.MultiIndex.remove_unused_levels
df.columns = df.columns.remove_unused_levels()
Old Answer
Our savior is pd.MultiIndex.to_series()
it returns a series of tuples restricted to what is in the DataFrame
df.columns = pd.MultiIndex.from_tuples(df.columns.to_series())