I would like to replace row values in pandas.
In example:
import pandas as pd
import numpy as np
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
pd.DataFrame(a.T)
Result:
array([[100, 0],
[100, 1],
[101, 2],
[101, 3],
[102, 4],
[102, 5]])
Here, I would like to replace the rows with the values [101, 3] with [200, 10] and the result should therefore be:
array([[100, 0],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[102, 5]])
Update
In a more general case I would like to replace multiple rows.
Therefore the old and new row values are represented by nx2 sized matrices (n is number of row values to replace). In example:
old_vals = np.array(([[101, 3]],
[[100, 0]],
[[102, 5]]))
new_vals = np.array(([[200, 10]],
[[300, 20]],
[[400, 30]]))
And the result is:
array([[300, 20],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[400, 30]])
For the single row case:
In [35]:
df.loc[(df[0]==101) & (df[1]==3)] = [[200,10]]
df
Out[35]:
0 1
0 100 0
1 100 1
2 101 2
3 200 10
4 102 4
5 102 5
For the multiple row-case the following would work:
In [60]:
a = np.array(([100, 100, 101, 101, 102, 102],
[0,1,3,3,3,4]))
df = pd.DataFrame(a.T)
df
Out[60]:
0 1
0 100 0
1 100 1
2 101 3
3 101 3
4 102 3
5 102 4
In [61]:
df.loc[(df[0]==101) & (df[1]==3)] = 200,10
df
Out[61]:
0 1
0 100 0
1 100 1
2 200 10
3 200 10
4 102 3
5 102 4
For multi-row update like you propose the following would work where the replacement site is a single row, first construct a dict of the old vals to search for and use the new values as the replacement value:
In [78]:
old_keys = [(x[0],x[1]) for x in old_vals]
new_valss = [(x[0],x[1]) for x in new_vals]
replace_vals = dict(zip(old_keys, new_vals))
replace_vals
Out[78]:
{(100, 0): array([300, 20]),
(101, 3): array([200, 10]),
(102, 5): array([400, 30])}
We can then iterate over the dict and then set the rows using the same method as my first answer:
In [93]:
for k,v in replace_vals.items():
df.loc[(df[0] == k[0]) & (df[1] == k[1])] = [[v[0],v[1]]]
df
0 1
0 100 0
0 1
5 102 5
0 1
3 101 3
Out[93]:
0 1
0 300 20
1 100 1
2 101 2
3 200 10
4 102 4
5 400 30
The simplest way should be this one:
df.loc[[3],0:1] = 200,10
In this case, 3 is the third row of the data frame while 0 and 1 are the columns.
This code instead, allows you to iterate over each row, check its content and replace it with what you want.
target = [101,3]
mod = [200,10]
for index, row in df.iterrows():
if row[0] == target[0] and row[1] == target[1]:
row[0] = mod[0]
row[1] = mod[1]
print(df)
Replace 'A' with 1 and 'B' with 2.
df = df.replace(['A', 'B'],[1, 2])
This is done over the entire DataFrame no matter the column.
However, we can target a single column in this way
df[column] = df[column].replace(['A', 'B'],[1, 2])
More in-depth examples are available HERE.
Another possibility is:
import io
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
df = pd.DataFrame(a.T)
string = df.to_string(header=False, index=False, index_names=False)
dictionary = {'100 0': '300 20',
'101 3': '200 10',
'102 5': '400 30'}
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
string = replace_all(string, dictionary)
df = pd.read_csv(io.StringIO(string), delim_whitespace=True)
I found this solution better, since when dealing with large amount of data to replace, the time is shorter than by EdChum's solution.
Related
I have a pandas dataframe that consists of a columns of values and a pandas series that consists of column names. I need to get the value of the nth row of the column corresponding to the nth index of the column name in the series.
Note that the columna name is constructed by appending col to the value in the series.
I have looked to see if there is a way to use a fast (vectorization or list comprehension) way of doing this but seem to hit a roadblock where I index into the dataframe using the index position of the series.
dataframe : {
'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]
}
series : [
'1', '2', '1', '3', '8'
]
output is a series : [
1, 20, 3, 400, numpy.nan
]
I am able to do this using a straightforward iterrows, but would like something faster (preferably vectorization, but if not list comprehension).
def test_cols():
stub_data_df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]
})
cols = pd.Series([
'1', '2', '1', '3', '8'
])
rates = []
for i, row in stub_data_df.iterrows():
rates.append(row.get('col' + cols[i]))
print(pd.Series(rates))
output :
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64
Here is a way to do this with a list comprehension:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]})
s = pd.Series(['1', '2', '1', '3', '8'])
s = s.astype(int) - 1 # so these values can be used for integer indexing
result = s.copy()
legal_ix = s < len(df.columns) # only columns that exist can be indexed
s = s[legal_ix]
result[legal_ix] = [df.iloc[i, j] for i, j in zip(s.index, s.values)]
result[~legal_ix] = np.nan
print(result)
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64
the docs have an example related to this:
idx, cols = ('col' + cols).factorize()
array = stub_data_df.reindex(cols, axis = 1).to_numpy()
array = array[np.arange(len(stub_data_df)), idx]
pd.Series(array)
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64
Let's say I have data about actual sales by sales person like so:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
Salesperson id Q3 sales
0 1 105
1 2 82
2 3 230
3 4 58
I also have their sales quotas like so:
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
quotas_df = quotas_df.set_index('Salesperson id')
Quota:
Salesperson id
1 88
2 95
3 200
4 65
I'd like to get a subset of df with only the rows where the sales person has exceeded their sales quota. I try the following:
filtered_df = df[(df['Q3 sales'] > quotas_df.loc[df['Salesperson id']]['Quota'])]
This fails with:
ValueError: Can only compare identically-labeled Series objects
Any pointers for the best way to do this?
You got the error because the two DataFrames' indexes are not aligned.
(
df.set_index('Salesperson id')
.loc[lambda x: x['Q3 sales'] > quotas_df['Quota']]
)
Use Series.map:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
s = df['Salesperson id'].map(quotas_df.set_index('Salesperson id')['Quota']))
filtered_df = df[(df['Q3 sales'] > s]
print (filtered_df)
Salesperson id Q3 sales
0 1 105
2 3 230
You could merge the two dataframes and then filter normally:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
filtered_df = df.merge(quotas_df, on='Salesperson id')
filtered_df[filtered_df['Q3 sales'] > filtered_df['Quota']]
Output:
Salesperson id Q3 sales Quota
0 1 105 88
2 3 230 200
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I want to select rows from a dataframe based on values in the index combined with values in a specific column:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [0, 20, 30], [40, 20, 30]],
index=[4, 5, 6, 7], columns=['A', 'B', 'C'])
A B C
4 0 2 3
5 0 4 1
6 0 20 30
7 40 20 30
with
df.loc[df['A'] == 0, 'C'] = 99
i can select all rows with column A = 0 and replace the value in column C with 99, but how can i select all rows with column A = 0 and the index < 6 (i want to combine selection on the index with selection on the column)?
You can use multiple conditions in your loc statement:
df.loc[(df.index < 6) & (df.A == 0), 'C'] = 99
I have a pandas Data Frame having one column containing arrays. I'd like to "flatten" it by repeating the values of the other columns for each element of the arrays.
I succeed to make it by building a temporary list of values by iterating over every row, but it's using "pure python" and is slow.
Is there a way to do this in pandas/numpy? In other words, I try to improve the flatten function in the example below.
Thanks a lot.
toConvert = pd.DataFrame({
'x': [1, 2],
'y': [10, 20],
'z': [(101, 102, 103), (201, 202)]
})
def flatten(df):
tmp = []
def backend(r):
x = r['x']
y = r['y']
zz = r['z']
for z in zz:
tmp.append({'x': x, 'y': y, 'z': z})
df.apply(backend, axis=1)
return pd.DataFrame(tmp)
print(flatten(toConvert).to_string(index=False))
Which gives:
x y z
1 10 101
1 10 102
1 10 103
2 20 201
2 20 202
Here's a NumPy based solution -
np.column_stack((toConvert[['x','y']].values.\
repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Sample run -
In [78]: toConvert
Out[78]:
x y z
0 1 10 (101, 102, 103)
1 2 20 (201, 202)
In [79]: np.column_stack((toConvert[['x','y']].values.\
...: repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Out[79]:
array([[ 1, 10, 101],
[ 1, 10, 102],
[ 1, 10, 103],
[ 2, 20, 201],
[ 2, 20, 202]])
You need numpy.repeat with str.len for creating columns x and y and for z use this solution:
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({
"x": np.repeat(toConvert.x.values, toConvert.z.str.len()),
"y": np.repeat(toConvert.y.values, toConvert.z.str.len()),
"z": list(chain.from_iterable(toConvert.z))})
print (df)
x y z
0 1 10 101
1 1 10 102
2 1 10 103
3 2 20 201
4 2 20 202