I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...
Related
I am trying to create a dictionary from a dataframe.
from pandas import util
df= util.testing.makeDataFrame()
df.index.name = 'name'
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
KSYYLUGiJB -0.998997 -0.387378 -0.303704 0.833731
PmsVVmRbQX -1.510940 -1.062814 0.934954 0.970467
oHjAqjAv1P -1.366054 0.595680 -1.039310 -0.126625
a1cU5c4psT -0.486282 -0.369012 -0.284495 -1.263010
qnqmltdFGR -0.041243 -0.792538 0.234809 0.894919
df.to_dict()
{'A': {'qSfQX3rj48': 0.1840905950693832,
'KSYYLUGiJB': -0.9989969426889559,
'PmsVVmRbQX': -1.5109402125881068,
'oHjAqjAv1P': -1.3660539127241154,
'a1cU5c4psT': -0.48628192605203563,
'qnqmltdFGR': -0.04124312561281138,
The above dict method is using the column name as keys.
dict_keys(['A', 'B', 'C', 'D'])
How can I can set it to a dict where the columns A B C D are the values for the name column. Thus it will have just 1 key.
A B C D
name
qSfQX3rj48 0.184091 -1.195861 0.998988 -0.970523
Should produce a dictionary with a list of values.
{'qSfQX3rj48': [0.184091, -1.195861, 0.998988, -0.970523],
'KSYYLUGiJB': [-0.998997, -0.387378 , -0.303704, 0.833731],
And values are column, thus:
{'name': [A, B, C, D],
d = df.T.to_dict('list')
d[df.index.name] = df.columns.tolist()
Example
df = pd.DataFrame(np.arange(12).reshape(3, 4),
columns=['A', 'B', 'C', 'D'],
index=['one', 'two', 'three'])
df.index.name = 'name'
df:
A B C D
name
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
d:
{'one': [0, 1, 2, 3],
'two': [4, 5, 6, 7],
'three': [8, 9, 10, 11],
'name': ['A', 'B', 'C', 'D']}
Code:
dict([(nm,[a,b,c,d ]) for nm, a,b,c,d in zip(df.index, df.A, df.B, df.C, df.D)])
I have 2 df one is
df1 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df2 = {'col_1': [3, 2, 1, 3]}
I want the result as follows
df3 = {'col_1': [3, 2, 1, 3], 'col_2': ['a', 'b', 'c', 'a']}
The column 2 of the new df is the same as the column 2 of the df1 depending on the value of the df1.
Add the new column by mapping the values from df1 after setting its first column as index:
df3 = df2.copy()
df3['col_2'] = df2['col_1'].map(df1.set_index('col_1')['col_2'])
output:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
You can do it with merge after converting the dicts to df with pd.DataFrame():
output = pd.DataFrame(df2)
output = output.merge(pd.DataFrame(df1),on='col_1',how='left')
Or in a one-liner:
output = pd.DataFrame(df2).merge(pd.DataFrame(df1),on='col_1',how='left')
Outputs:
col_1 col_2
0 3 a
1 2 b
2 1 c
3 3 a
This could be a simple way of doing it.
# use df1 to create a lookup dictionary
lookup = df1.set_index("col_1").to_dict()["col_2"]
# look up each value from df2's "col_1" in the lookup dict
df2["col_2"] = df2["col_1"].apply(lambda d: lookup[d])
I am workint in Python 2.7 and I have a data frame and I want to get the average of the column called 'c', but only the rows that verify that the values in another column are equal to some value.
When I execute the code, the answer is unexpected, but when I execute the calculation, calculating the median, the result is correct.
Why is the output of the mean incorrect?
The code is the following:
df = pd.DataFrame(
np.array([['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]]),
columns=['a', 'b', 'c', 'd']
)
df
mean1 = df[df.a == 'A'].c.mean()
mean2 = df[df.a == 'B'].c.mean()
median1 = df[df.a == 'A'].c.median()
median2 = df[df.a == 'B'].c.median()
The output:
df
Out[1]:
a b c d
0 A 1 2 3
1 A 4 5 nan
2 A 7 8 9
3 B 3 2 nan
4 B 5 6 nan
5 B 5 6 nan
mean1
Out[2]: 86.0
mean2
Out[3]: 88.66666666666667
median1
Out[4]: 5.0
median2
Out[5]: 6.0
It is obvious that the output of the mean is incorrect.
Thanks.
Pandas is doing string concatenation for the "sum" when calculating the mean, this is plain to see from your example frame.
>>> df[df.a == 'B'].c
3 2
4 6
5 6
Name: c, dtype: object
>>> 266 / 3
88.66666666666667
If you look at the dtype's for your DataFrame, you'll notice that all of them are object, even though no single Series contains mixed types. This is due to the declaration of your numpy array. Arrays are not meant to contain heterogenous types, so the array defaults to dtype object, which is then passed to the DataFrame constructor. You can avoid this behavior by passing the constructor a list instead, which can hold differing dtype's with no issues.
df = pd.DataFrame(
[['A', 1, 2, 3], ['A', 4, 5, np.nan], ['A', 7, 8, 9], ['B', 3, 2, np.nan], ['B', 5, 6, np.nan], ['B',5, 6, np.nan]],
columns=['a', 'b', 'c', 'd']
)
df[df.a == 'B'].c.mean()
4.666666666666667
In [17]: df.dtypes
Out[17]:
a object
b int64
c int64
d float64
dtype: object
I still can't imagine that this behavior is intended, so I believe it's worth opening an issue report on the pandas development page, but in general, you shouldn't be using object dtype Series for numeric calculations.
I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))
I am working with a pandas dataframe and trying to concatenate multiple string and numbers into one string.
This works
df1 = pd.DataFrame({'Col1': ['a', 'b', 'c'], 'Col2': ['a', 'b', 'c']})
df1.apply(lambda x: ', '.join(x), axis=1)
0 a, a
1 b, b
2 c, c
How can I make this work just like df1?
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x), axis=1)
TypeError: ('sequence item 0: expected str instance, int found', 'occurred at index 2')
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3)),
columns=list('abc')
)
print(df)
a b c
0 0 2 7
1 3 8 7
2 0 6 8
You can use astype(str) ahead of the lambda
df.astype(str).apply(', '.join, 1)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
Using a comprehension
pd.Series([', '.join(l) for l in df.values.astype(str).tolist()], df.index)
0 0, 2, 7
1 3, 8, 7
2 0, 6, 8
dtype: object
In [75]: df2
Out[75]:
Col1 Col2 Col3
0 a a x
1 b b y
2 1 1 2
In [76]: df2.astype(str).add(', ').sum(1).str[:-2]
Out[76]:
0 a, a, x
1 b, b, y
2 1, 1, 2
dtype: object
You have to convert column types to strings.
import pandas as pd
df2 = pd.DataFrame({'Col1': ['a', 'b', 1], 'Col2': ['a', 'b', 1]})
df2.apply(lambda x: ', '.join(x.astype('str')), axis=1)