Replacing row values in pandas

Replacing row values in pandas - python

I would like to replace row values in pandas.
In example:
import pandas as pd
import numpy as np
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
pd.DataFrame(a.T)
Result:
array([[100, 0],
[100, 1],
[101, 2],
[101, 3],
[102, 4],
[102, 5]])
Here, I would like to replace the rows with the values [101, 3] with [200, 10] and the result should therefore be:
array([[100, 0],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[102, 5]])
Update
In a more general case I would like to replace multiple rows.
Therefore the old and new row values are represented by nx2 sized matrices (n is number of row values to replace). In example:
old_vals = np.array(([[101, 3]],
[[100, 0]],
[[102, 5]]))
new_vals = np.array(([[200, 10]],
[[300, 20]],
[[400, 30]]))
And the result is:
array([[300, 20],
[100, 1],
[101, 2],
[200, 10],
[102, 4],
[400, 30]])

For the single row case:
In [35]:
df.loc[(df[0]==101) & (df[1]==3)] = [[200,10]]
df
Out[35]:
0 1
0 100 0
1 100 1
2 101 2
3 200 10
4 102 4
5 102 5
For the multiple row-case the following would work:
In [60]:
a = np.array(([100, 100, 101, 101, 102, 102],
[0,1,3,3,3,4]))
df = pd.DataFrame(a.T)
df
Out[60]:
0 1
0 100 0
1 100 1
2 101 3
3 101 3
4 102 3
5 102 4
In [61]:
df.loc[(df[0]==101) & (df[1]==3)] = 200,10
df
Out[61]:
0 1
0 100 0
1 100 1
2 200 10
3 200 10
4 102 3
5 102 4
For multi-row update like you propose the following would work where the replacement site is a single row, first construct a dict of the old vals to search for and use the new values as the replacement value:
In [78]:
old_keys = [(x[0],x[1]) for x in old_vals]
new_valss = [(x[0],x[1]) for x in new_vals]
replace_vals = dict(zip(old_keys, new_vals))
replace_vals
Out[78]:
{(100, 0): array([300, 20]),
(101, 3): array([200, 10]),
(102, 5): array([400, 30])}
We can then iterate over the dict and then set the rows using the same method as my first answer:
In [93]:
for k,v in replace_vals.items():
df.loc[(df[0] == k[0]) & (df[1] == k[1])] = [[v[0],v[1]]]
df
0 1
0 100 0
0 1
5 102 5
0 1
3 101 3
Out[93]:
0 1
0 300 20
1 100 1
2 101 2
3 200 10
4 102 4
5 400 30

The simplest way should be this one:
df.loc[[3],0:1] = 200,10
In this case, 3 is the third row of the data frame while 0 and 1 are the columns.
This code instead, allows you to iterate over each row, check its content and replace it with what you want.
target = [101,3]
mod = [200,10]
for index, row in df.iterrows():
if row[0] == target[0] and row[1] == target[1]:
row[0] = mod[0]
row[1] = mod[1]
print(df)

Replace 'A' with 1 and 'B' with 2.
df = df.replace(['A', 'B'],[1, 2])
This is done over the entire DataFrame no matter the column.
However, we can target a single column in this way
df[column] = df[column].replace(['A', 'B'],[1, 2])
More in-depth examples are available HERE.

Another possibility is:
import io
a = np.array(([100, 100, 101, 101, 102, 102],
np.arange(6)))
df = pd.DataFrame(a.T)
string = df.to_string(header=False, index=False, index_names=False)
dictionary = {'100 0': '300 20',
'101 3': '200 10',
'102 5': '400 30'}
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
string = replace_all(string, dictionary)
df = pd.read_csv(io.StringIO(string), delim_whitespace=True)
I found this solution better, since when dealing with large amount of data to replace, the time is shorter than by EdChum's solution.

Related

Is there a way to access values in dataframe based on series of column names?

I have a pandas dataframe that consists of a columns of values and a pandas series that consists of column names. I need to get the value of the nth row of the column corresponding to the nth index of the column name in the series.
Note that the columna name is constructed by appending col to the value in the series.
I have looked to see if there is a way to use a fast (vectorization or list comprehension) way of doing this but seem to hit a roadblock where I index into the dataframe using the index position of the series.
dataframe : {
'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]
}
series : [
'1', '2', '1', '3', '8'
]
output is a series : [
1, 20, 3, 400, numpy.nan
]
I am able to do this using a straightforward iterrows, but would like something faster (preferably vectorization, but if not list comprehension).
def test_cols():
stub_data_df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]
})
cols = pd.Series([
'1', '2', '1', '3', '8'
])
rates = []
for i, row in stub_data_df.iterrows():
rates.append(row.get('col' + cols[i]))
print(pd.Series(rates))
output :
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64

Here is a way to do this with a list comprehension:
df = pd.DataFrame({'col1': [1, 2, 3, 4, 5],
'col2': [10, 20, 30, 40, 50],
'col3': [100, 200, 300, 400, 500]})
s = pd.Series(['1', '2', '1', '3', '8'])
s = s.astype(int) - 1 # so these values can be used for integer indexing
result = s.copy()
legal_ix = s < len(df.columns) # only columns that exist can be indexed
s = s[legal_ix]
result[legal_ix] = [df.iloc[i, j] for i, j in zip(s.index, s.values)]
result[~legal_ix] = np.nan
print(result)
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64

the docs have an example related to this:
idx, cols = ('col' + cols).factorize()
array = stub_data_df.reindex(cols, axis = 1).to_numpy()
array = array[np.arange(len(stub_data_df)), idx]
pd.Series(array)
0 1.0
1 20.0
2 3.0
3 400.0
4 NaN
dtype: float64

Filter rows in a DataFrame by comparing with result of lookup in another DataFrame

Let's say I have data about actual sales by sales person like so:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
Salesperson id Q3 sales
0 1 105
1 2 82
2 3 230
3 4 58
I also have their sales quotas like so:
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
quotas_df = quotas_df.set_index('Salesperson id')
Quota:
Salesperson id
1 88
2 95
3 200
4 65
I'd like to get a subset of df with only the rows where the sales person has exceeded their sales quota. I try the following:
filtered_df = df[(df['Q3 sales'] > quotas_df.loc[df['Salesperson id']]['Quota'])]
This fails with:
ValueError: Can only compare identically-labeled Series objects
Any pointers for the best way to do this?

You got the error because the two DataFrames' indexes are not aligned.
(
df.set_index('Salesperson id')
.loc[lambda x: x['Q3 sales'] > quotas_df['Quota']]
)

Use Series.map:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
s = df['Salesperson id'].map(quotas_df.set_index('Salesperson id')['Quota']))
filtered_df = df[(df['Q3 sales'] > s]
print (filtered_df)
Salesperson id Q3 sales
0 1 105
2 3 230

You could merge the two dataframes and then filter normally:
df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Q3 sales": [105, 82, 230, 58]})
quotas_df = pd.DataFrame({'Salesperson id': [1, 2, 3, 4], "Quota": [88, 95, 200, 65]})
filtered_df = df.merge(quotas_df, on='Salesperson id')
filtered_df[filtered_df['Q3 sales'] > filtered_df['Quota']]
Output:
Salesperson id Q3 sales Quota
0 1 105 88
2 3 230 200

How to aggregate two largest values per group in pandas?

I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?

If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}

Pandas Dataframe row selection combined condition index- and column values

I want to select rows from a dataframe based on values in the index combined with values in a specific column:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [0, 20, 30], [40, 20, 30]],
index=[4, 5, 6, 7], columns=['A', 'B', 'C'])
A B C
4 0 2 3
5 0 4 1
6 0 20 30
7 40 20 30
with
df.loc[df['A'] == 0, 'C'] = 99
i can select all rows with column A = 0 and replace the value in column C with 99, but how can i select all rows with column A = 0 and the index < 6 (i want to combine selection on the index with selection on the column)?

You can use multiple conditions in your loc statement:
df.loc[(df.index < 6) & (df.A == 0), 'C'] = 99

Python pandas: flatten with arrays in column

I have a pandas Data Frame having one column containing arrays. I'd like to "flatten" it by repeating the values of the other columns for each element of the arrays.
I succeed to make it by building a temporary list of values by iterating over every row, but it's using "pure python" and is slow.
Is there a way to do this in pandas/numpy? In other words, I try to improve the flatten function in the example below.
Thanks a lot.
toConvert = pd.DataFrame({
'x': [1, 2],
'y': [10, 20],
'z': [(101, 102, 103), (201, 202)]
})
def flatten(df):
tmp = []
def backend(r):
x = r['x']
y = r['y']
zz = r['z']
for z in zz:
tmp.append({'x': x, 'y': y, 'z': z})
df.apply(backend, axis=1)
return pd.DataFrame(tmp)
print(flatten(toConvert).to_string(index=False))
Which gives:
x y z
1 10 101
1 10 102
1 10 103
2 20 201
2 20 202

Here's a NumPy based solution -
np.column_stack((toConvert[['x','y']].values.\
repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Sample run -
In [78]: toConvert
Out[78]:
x y z
0 1 10 (101, 102, 103)
1 2 20 (201, 202)
In [79]: np.column_stack((toConvert[['x','y']].values.\
...: repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Out[79]:
array([[ 1, 10, 101],
[ 1, 10, 102],
[ 1, 10, 103],
[ 2, 20, 201],
[ 2, 20, 202]])

You need numpy.repeat with str.len for creating columns x and y and for z use this solution:
import pandas as pd
import numpy as np
from itertools import chain
df = pd.DataFrame({
"x": np.repeat(toConvert.x.values, toConvert.z.str.len()),
"y": np.repeat(toConvert.y.values, toConvert.z.str.len()),
"z": list(chain.from_iterable(toConvert.z))})
print (df)
x y z
0 1 10 101
1 1 10 102
2 1 10 103
3 2 20 201
4 2 20 202

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing row values in pandas - python

Replace 'A' with 1 and 'B' with 2. df = df.replace(['A', 'B'],[1, 2]) This is done over the entire DataFrame no matter the column. However, we can target a single column in this way df[column] = df[column].replace(['A', 'B'],[1, 2]) More in-depth examples are available HERE.

Related

Is there a way to access values in dataframe based on series of column names?

Filter rows in a DataFrame by comparing with result of lookup in another DataFrame

How to aggregate two largest values per group in pandas?

Pandas Dataframe row selection combined condition index- and column values

Python pandas: flatten with arrays in column

Categories

Resources