Updating a pandas DataFrame row with a dictionary - python

I've found a behavior in pandas DataFrames that I don't understand.
df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), index=['one', 'one', 'two'], columns=['col1', 'col2', 'col3'])
new_data = pd.Series({'col1': 'new', 'col2': 'new', 'col3': 'new'})
df.iloc[0] = new_data
# resulting df looks like:
# col1 col2 col3
#one new new new
#one 9 6 1
#two 8 3 7
But if I try to add a dictionary instead, I get this:
new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
df.iloc[0] = new_data
#
# col1 col2 col3
#one col2 col3 col1
#one 2 1 7
#two 5 8 6
Why is this happening? In the process of writing up this question, I realized that most likely df.loc is only taking the keys from new_data, which also explains why the values are out of order. But, again, why is this the case? If I try to create a DataFrame from a dictionary, it handles the keys as if they were columns:
pd.DataFrame([new_data])
# col1 col2 col3
#0 new new new
Why is that not the default behavior in df.loc?

It's the difference between how a dictionary iterates and how a pandas series is treated.
A pandas series matches it's index to columns when being assigned to a row and matches to index if being assigned to a column. After that, it assigns the value that corresponds to that matched index or column.
When an object is not a pandas object with a convenient index object to match off of, pandas will iterate through the object. A dictionary iterates through it's keys and that's why you see the dictionary keys in that rows slots. Dictionaries are not sorted and that's why you see shuffled keys in that row.

just how to do it
this is a compact way, how to fulfill your task. I removed the index of your df, as "one" appeared twice and this prevents unique indexing.
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df
col1 col2 col3
0 1 6 1
1 4 2 3
2 6 2 3
>>> new_data
{'col1': 'new', 'col2': 'new', 'col3': 'new'}
>>>
>>> df.loc[0, new_data.keys()] = new_data.values()
>>> df
col1 col2 col3
0 new new new
1 4 2 3
2 6 2 3

a compact way
using an intermediate cast to pd.Series
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['col1', 'col2', 'col3'])
>>> new_data = {'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df
col1 col2 col3
0 5 7 9
1 8 7 8
2 5 3 3
>>> new_data
{'col1': 'new1', 'col2': 'new2', 'col3': 'new3'}
>>>
>>> df.loc[0] = pd.Series(new_data)
>>> df
col1 col2 col3
0 new1 new2 new3
1 8 7 8
2 5 3 3

Related

pandas subtract rows in dataframe according to a few columns

I have the following dataframe
data = [
{'col1': 11, 'col2': 111, 'col3': 1111},
{'col1': 22, 'col2': 222, 'col3': 2222},
{'col1': 33, 'col2': 333, 'col3': 3333},
{'col1': 44, 'col2': 444, 'col3': 4444}
]
and the following list:
lst = [(11, 111), (22, 222), (99, 999)]
I would like to get out of my data only rows that col1 and col2 do not exist in the lst
result for above example would be:
[
{'col1': 33, 'col2': 333, 'col3': 3333},
{'col1': 44, 'col2': 444, 'col3': 4444}
]
how can I achieve that?
import pandas as pd
df = pd.DataFrame(data)
list_df = pd.DataFrame(lst)
# command like ??
# df.subtract(list_df)
If need test by pairs is possible compare MultiIndex created by both columns in Index.isin with inverted mask by ~ in boolean indexing:
df = df[~df.set_index(['col1','col2']).index.isin(lst)]
print (df)
col1 col2 col3
2 33 333 3333
3 44 444 4444
Or with left join by merge with indicator parameter:
m = df.merge(list_df,
left_on=['col1','col2'],
right_on=[0,1],
indicator=True,
how='left')['_merge'].eq('left_only')
df = df[mask]
print (df)
col1 col2 col3
2 33 333 3333
3 44 444 4444
You can create a tuple out of your col1 and col2 columns and then check if those tuples are in the lst list. Then drop the fines with True values.
df.drop(df.apply(lambda x: (x['col1'], x['col2']), axis =1)
.isin(lst)
.loc[lambda x: x==True]
.index)
With this solution you don't even have to make the second list a dataframe
You can create the tuples of col1 and col2 by .apply() with tuple. Then test these tuples whether in lst by .isin() (add ~ for the negation/opposite condition).
Finally, locate the rows with .loc, as follows:
df.loc[~df[['col1', 'col2']].apply(tuple, axis=1).isin(lst)]
Result:
col1 col2 col3
2 33 333 3333
3 44 444 4444
You can extract the list of values using zip and slice using a mask generated with isna:
a,b = zip(*lst)
data[~(data['col1'].isin(a)|data['col2'].isin(b))]
output:
col1 col2 col3
2 33 333 3333
3 44 444 4444
Or if you need both conditions to be true to drop:
data[~(data['col1'].isin(a)&data['col2'].isin(b))]
NB. if you have many columns, you can automate the process:
mask = sum(data[col].isin(v) for col,v in zip(data, zip(*lst))).eq(0)
df[mask]
use polars's anti join:
df1.join(pl.DataFrame(pd.DataFrame(lst,columns=["col1","col2"]))
,on=["col1","col2"],how="anti").to_pandas()
Result:
col1 col2 col3
2 33 333 3333
3 44 444 4444

FutureWarning in pandas dataframe

I have a sample python code:
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
ddf.sort_values(by='Id')
The above snippet produces ' FutureWarning: 'Id' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version'. And it does become a error when I try this under recent version of python. I am quite new to python and pandas. How do I resolve this issue?
Here the best is convert column Id to index with DataFrame.set_index for avoid index.name same with one of columns name:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf = ddf.set_index('Id')
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'col3'], dtype='object')
Better for sorting by index is DataFrame.sort_index:
print (ddf.sort_index())
col1 col3
Id
1 A a
2 B b
3 A x
Your solution working, if change index.name for different:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.index=ddf['Id']
print (ddf.index.name)
Id
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
Set different index.name by DataFrame.rename_axis or set by scalar:
ddf = ddf.rename_axis('newID')
#alternative
#ddf.index.name = 'newID'
print (ddf.index.name)
newID
print (ddf.columns)
Index(['col1', 'Id', 'col3'], dtype='object')
So now is possible distinguish between index level and columns names, because sort_values working with both:
print(ddf.sort_values(by='Id'))
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
print (ddf.sort_values(by='newID'))
#same like sorting by index
#print (ddf.sort_index())
col1 Id col3
newID
1 A 1 a
2 B 2 b
3 A 3 x
Simple add .values
ddf.index=ddf['Id'].values
ddf.sort_values(by='Id')
Out[314]:
col1 Id col3
1 A 1 a
2 B 2 b
3 A 3 x
Both your columns and row index contain 'Id', a simple solution would be to not set the (row) index as 'Id'.
import pandas as pd
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'Id' : [3,1,2],
'col3': ['x','a','b']})
ddf.sort_values(by='Id')
Out[0]:
col1 Id col3
1 A 1 a
2 B 2 b
0 A 3 x
Or set the index when you create the df:
ddf = pd.DataFrame({'col1' : ['A', 'A', 'B'],
'col3': ['x','a','b']},
index=[3,1,2])
ddf.sort_index()
Out[1]:
col1 col3
1 A a
2 B b
3 A x

Pandas DataFrame filter

My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Python - Combinig pandas dataframes

I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})

Categories

Resources