I have a sample python code:
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
ddf.sort_values(by='Id')
The above snippet produces ' FutureWarning: 'Id' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version'. And it does become a error when I try this under recent version of python. I am quite new to python and pandas. How do I resolve this issue?
Here the best is convert column Id to index with DataFrame.set_index for avoid index.name same with one of columns name:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf = ddf.set_index('Id')
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'col3'], dtype='object')
Better for sorting by index is DataFrame.sort_index:
print (ddf.sort_index())
col1 col3
Id
1 A a
2 B b
3 A x
Your solution working, if change index.name for different:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
Set different index.name by DataFrame.rename_axis or set by scalar:
ddf = ddf.rename_axis('newID')
#alternative
#ddf.index.name = 'newID'
print (ddf.index.name)
newID
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
So now is possible distinguish between index level and columns names, because sort_values working with both:
print(ddf.sort_values(by='Id'))
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
print (ddf.sort_values(by='newID'))
#same like sorting by index
#print (ddf.sort_index())
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
Simple add .values
ddf.index=ddf['Id'].values
ddf.sort_values(by='Id')
Out[314]:
col1 Id col3
1 A 1 a
2 B 2 b
3 A 3 x
Both your columns and row index contain 'Id', a simple solution would be to not set the (row) index as 'Id'.
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.sort_values(by='Id')
Out[0]:
col1 Id col3
1 A 1 a
2 B 2 b
0 A 3 x
Or set the index when you create the df:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'col3': ['x','a','b']},
index=[3,1,2])
ddf.sort_index()
Out[1]:
col1 col3
1 A a
2 B b
3 A x
Related
I have a huge dataframe which I get from a .csv file. After defining the columns I only want to use the one I need. I used Python 3.8.1 version and it worked great, although raising the "FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative."
If I try to do the same in Python 3.10.x I get a KeyError now: "[’empty’] not in index"
In order to get slice/get rid of columns I don't need I use the .loc function like this:
df = df.loc[:, ['laenge','Timestamp', 'Nick']]
How can I get the same result with .reindex function (or any other) without getting the KeyError?
Thanks
If need only columns which exist in DataFrame use numpy.intersect1d:
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
Same output is if use DataFrame.reindex with remove only missing values columns:
df = df.reindex(['laenge','Timestamp', 'Nick'], axis=1).dropna(how='all', axis=1)
Sample:
df = pd.DataFrame({'laenge': [0,5], 'col': [1,7], 'Nick': [2,8]})
print (df)
laenge col Nick
0 0 1 2
1 5 7 8
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
print (df)
Nick laenge
0 2 0
1 8 5
Use reindex:
df = pd.DataFrame({'A': [0], 'B': [1], 'C': [2]})
# A B C
# 0 0 1 2
df.reindex(['A', 'C', 'D'], axis=1)
output:
A C D
0 0 2 NaN
If you need to get only the common columns, you can use Index.intersection:
cols = ['A', 'C', 'E']
df[df.columns.intersection(cols)]
output:
A C
0 0 2
if I have these dataframes
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
df1
index col1 col2
0 1 a h
1 2 b e
2 3 c l
3 4 d p
df2
index col1 col2
0 1 a h
1 2 e e
2 3 f lp
3 4 d p
I want to merge them and see whether or not the rows are different and get an output like this
index col1 col1_validation col2 col2_validation
0 1 a True h True
1 2 b False e True
2 3 c False l False
3 4 d True p True
how can I achieve that?
It looks like col1 and col2 from your "merged" dataframe are just taken from df1. In that case, you can simply compare the col1, col2 between the original data frames and add those as columns:
cols = ["col1", "col2"]
val_cols = ["col1_validation", "col2_validation"]
# (optional) new dataframe, so you don't mutate df1
df = df1.copy()
new_cols = (df1[cols] == df2[cols])
df[val_cols] = new_cols
You can merge and compare the two data frames with something similar to the following:
df1 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','b','c','d'],
'col2': ['h','e','l','p']})
df2 = pd.DataFrame({'index': [1,2,3,4],
'col1': ['a','e','f','d'],
'col2': ['h','e','lp','p']})
# give columns unique name when merging
df1.columns = df1.columns + '_df1'
df2.columns = df2.columns + '_df2'
# merge/combine data frames
combined = pd.concat([df1, df2], axis = 1)
# add calculated columns
combined['col1_validation'] = combined['col1_df1'] == combined['col1_df2']
combined['col12validation'] = combined['col2_df1'] == combined['col2_df2']
I have this dataframe :
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B', 'B'], 'col2': ['A1', 'B1', 'B1', 'B1', 'A1']})
col1 col2
0 A A1
1 A B1
2 B B1
3 B B1
4 B A1
I did a groupby. The result was a multiindex column
df = df.groupby(['col1']).agg({'col2': ['nunique','count']})
col2
nunique count
col1
A 2 2
B 2 3
Then, I did a jointplot from seaborn library
sns.jointplot(x=['col2','nunique'],y=['col2','count'],data=df,kind='scatter')
I got this error
TypeError: only integer scalar arrays can be converted to a scalar index
My question is :
Is there a way to split the multiindex column into two seperate columns like this?
col1 col2_unique col2_count
A 2 2
B 2 3
or
Is there a ways to jointplot a multiindex column?
Thank you for help!
You can change aggregate by specify column col2 in list and in agg use only aggregate function for avoid MultiIndex in columns:
df = df.groupby(['col1'])['col2'].agg(['nunique','count'])
print(df)
nunique count
col1
A 2 2
B 2 3
sns.jointplot(x='nunique', y='count', data=df, kind='scatter')
Or flatten MultiIndex if need use dictinary in agg - e.g. aggregate another column:
df = df.groupby(['col1']).agg({'col2': ['nunique','count'], 'col1':['min']})
df.columns = df.columns.map('_'.join)
print (df)
col1_min col2_nunique col2_count
col1
A A 2 2
B B 2 3
I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})
I've found a behavior in pandas DataFrames that I don't understand.
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), index=['one', 'one', 'two'], columns=['col1', 'col2', 'col3'])
new_data = pd.Series({'col1': 'new', 'col2': 'new', 'col3': 'new'})
df.iloc[0] = new_data
# resulting df looks like:
# col1 col2 col3
#one new new new
#one 9 6 1
#two 8 3 7
But if I try to add a dictionary instead, I get this:
new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
df.iloc[0] = new_data
#
# col1 col2 col3
#one col2 col3 col1
#one 2 1 7
#two 5 8 6
Why is this happening? In the process of writing up this question, I realized that most likely df.loc is only taking the keys from new_data, which also explains why the values are out of order. But, again, why is this the case? If I try to create a DataFrame from a dictionary, it handles the keys as if they were columns:
pd.DataFrame([new_data])
# col1 col2 col3
#0 new new new
Why is that not the default behavior in df.loc?
It's the difference between how a dictionary iterates and how a pandas series is treated.
A pandas series matches it's index to columns when being assigned to a row and matches to index if being assigned to a column. After that, it assigns the value that corresponds to that matched index or column.
When an object is not a pandas object with a convenient index object to match off of, pandas will iterate through the object. A dictionary iterates through it's keys and that's why you see the dictionary keys in that rows slots. Dictionaries are not sorted and that's why you see shuffled keys in that row.
just how to do it
this is a compact way, how to fulfill your task. I removed the index of your df, as "one" appeared twice and this prevents unique indexing.
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df
col1 col2 col3
0 1 6 1
1 4 2 3
2 6 2 3
>>> new_data
{'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df.loc[0, new_data.keys()] = new_data.values()
>>> df
col1 col2 col3
0 new new new
1 4 2 3
2 6 2 3
a compact way
using an intermediate cast to pd.Series
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df
col1 col2 col3
0 5 7 9
1 8 7 8
2 5 3 3
>>> new_data
{'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df.loc[0] = pd.Series(new_data)
>>> df
col1 col2 col3
0 new1 new2 new3
1 8 7 8
2 5 3 3