I am new to pandas, I am facing issue with replacing. So I am creating a function which replaces the values of column of a data frame based on the parameters. The condition is that it should replace all values of the column with only one value, as show below:
Though I tried getting an error 'lenght didn't match'
def replace(df,column,condition):
for i in column:
for j in condition:
df[i]=j
return df
column = ['A','C']
condition = 11,34
df
A B C
0 12 5 1
1 13 6 5
2 14 7 7
replace(df,column,condition)
my excepted output:
A B C
0 11 5 34
1 11 6 34
2 11 7 34
Edit: I initially sugggested using apply but then realized that is not necessary since you are ignoring the existing values in the series. This is simpler and should serve your purposes.
Example:
import pandas as pd
data = [[12, 5, 1], [13, 5, 5], [14, 7, 7]]
df = pd.DataFrame(data, columns = ["A", "B", "C"])
def replace(df, columns, values):
for one_column, one_value in zip(columns, values):
df[one_column] = one_value
return df
print(replace(df, ["A", "C"], [11, 34]))
Output:
A B C
0 11 5 34
1 11 5 34
2 11 7 34
Using key:value pair, convert condition and column into a dict, unpack and assign the values
df.assign(**dict(zip(column, condition)))
Related
Here I have a working solution but my question focus on how to do this the Pandas way. I assume Pandas over better solutions for this.
I use groupby() and then apply(axis=1) to compare the values in the rows of the groups. And while doing this I made the decision which row to delete.
The rule doesn't matter! In this example here the rule is that when values in column A differ only by 1 (the values are "near") then delete the second one. How the decision is made is not part of the question. There could also be a list of color names and I would say that darkblue and marineblue are "near" and one if should be deleted.
The initial data frame is that.
X A B
0 A 9 0 <--- DELETE
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
7 B 11 7 <--- DELETE
8 B 30 8
Row index 0 should be deleted because it's value 9 is near the value 8 in row index 2. The same with row index 7: It's value 11 is "near" 10 in row index 5.
That is the code
#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame(
{
'X': list('AAAAABBBB'),
'A': [9, 14, 8, 1, 18, 10, 20, 11, 30],
'B': range(9)
}
)
print(df)
def mark_near_neighbors(group):
# I snip the decission process here.
# Delete 9 because it is "near" 8.
default_result = pd.Series(
data=[False] * len(group),
index=['Delete'] * len(group)
)
if group.X.iloc[0] is 'A':
# the 9
default_result.iloc[0] = True
else:
# the 11
default_result.iloc[2] = True
return default_result
result = df.groupby('X').apply(mark_near_neighbors)
result = result.reset_index(drop=True)
print(result)
df = df.loc[~result]
print(df)
So in the end I use a "boolean indexing thing" to solve this
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
dtype: bool
But is there a better way to do this?
Initialize the dataframe
df = pd.DataFrame([
['A', 9, 0],
['A', 14, 1],
['A', 8, 2],
['A', 1, 3],
['B', 18, 4],
['B', 10, 5],
['B', 20, 6],
['B', 11, 7],
['B', 30, 8],
], columns=['X', 'A', 'B'])
Sort the dataframe based on A column
df = df.sort_values('A')
Find the difference between values
df["diff" ] =df.groupby('X')['A'].diff()
Select the rows where the difference is not 1
result = df[df["diff"] != 1.0]
Drop the extra column and sort by index to get the initial dataframe
result.drop("diff", axis=1, inplace=True)
result = result.sort_index()
Sample output
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 B 18 4
5 B 10 5
6 B 20 6
8 B 30 8
IIUC, you can use numpy broadcasting to compare all values within a group. Keeping everything with apply here as it seems wanted:
def mark_near_neighbors(group, thresh=1):
a = group.to_numpy().astype(float)
idx = np.argsort(a)
b = a[idx]
d = abs(b-b[:,None])
d[np.triu_indices(d.shape[0])] = thresh+1
return pd.Series((d>thresh).all(1)[np.argsort(idx)], index=group.index)
out = df[df.groupby('X')['A'].apply(mark_near_neighbors)]
output:
X A B
1 A 14 1
2 A 8 2
3 A 1 3
4 A 18 4
5 B 10 5
6 B 20 6
8 B 30 8
So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 3, 6], [7, 2, 9]]),
columns=['a', 'b', 'c'])
df
Output:
a
b
c
1
2
3
4
3
6
7
2
9
I want to select or keep the two columns, with the highest values in the last row. What is the best way to approach?
So in fact I just want to select or keep column 'a' due to value 7 and column 'c' due to value 9.
Try:
df = df[df.iloc[-1].nlargest(2).index]
Output:
c a
0 3 1
1 6 4
2 9 7
If you want to keep original column sequence as well, you can use Index.intersection() together with .nlargest(), as follows:
df[df.columns.intersection(df.iloc[-1].nlargest(2).index, sort=False)]
Result:
a c
0 1 3
1 4 6
2 7 9
I have a pandas Series with the following content.
$ import pandas as pd
$ s = pd.Series(
data = [True, False, True, True],
index = ['A', 'B', 'C', 'D']
)
$ s.index.name = 'my_id'
$ print(s)
my_id
A True
B False
C True
D True
dtype: bool
and a DataFrame like this.
$ df = pd.DataFrame({
'A': [1, 2, 9, 4],
'B': [9, 6, 7, 8],
'C': [10, 91, 32, 13],
'D': [43, 12, 7, 9]
})
$ print(df)
A B C D
0 1 9 10 43
1 2 6 91 12
2 9 7 32 7
3 4 8 13 9
s has A, B, C, and D as its indices. df also has A, B, C, and D as it column names.
True in s means that the corresponding column in df will be preserved. False in s means that the corresponding column in df will be removed.
How can I generate another DataFrame with column B removed using s?
I mean I want to create the following DataFrame using s and df.
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9
Use boolean indexing with DataFrame.loc. The : means filter all rows. Columns are filtered by Series filled with boolean - mask:
df1 = df.loc[:, s]
print (df1)
A C D
0 1 10 43
1 2 91 12
2 9 32 7
3 4 13 9
I have a DataFrame in this format:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
4 13 14 15
and an array like this, with column names:
['a', 'a', 'b', 'c', 'b']
and I’m hoping to extract an array of data, one value from each row. The array of column names specifies which column I want from each row. Here, the result would be:
[1, 4, 8, 12, 14]
Is this possible as a single command with Pandas, or do I need to iterate? I tried using indexing
i = pd.Index(['a', 'a', 'b', 'c', 'b'])
i.choose(df)
but I got a segfault, which I couldn’t diagnose because the documentation is lacking.
You could use lookup, e.g.
>>> i = pd.Series(['a', 'a', 'b', 'c', 'b'])
>>> df.lookup(i.index, i.values)
array([ 1, 4, 8, 12, 14])
where i.index could be different from range(len(i)) if you wanted.
For large datasets, you can use indexing on the base numpy data, if you're prepared to transform your column names into a numerical index (simple in this case):
df.values[arange(5),[0,0,1,2,1]]
out: array([ 1, 4, 8, 12, 14])
This will be much more efficient that list comprehensions, or other explicit iterations.
As MorningGlory stated in the comments, lookup has been deprecated in version 1.2.0.
The documentation states that the same can be achieved using melt and loc but I didn't think it was very obvious so here it goes.
First, use melt to create a look-up DataFrame:
i = pd.Series(["a", "a", "b", "c", "b"], name="col")
melted = pd.melt(
pd.concat([i, df], axis=1),
id_vars="col",
value_vars=df.columns,
ignore_index=False,
)
col variable value
0 a a 1
1 a a 4
2 b a 7
3 c a 10
4 b a 13
0 a b 2
1 a b 5
2 b b 8
3 c b 11
4 b b 14
0 a c 3
1 a c 6
2 b c 9
3 c c 12
4 b c 15
Then, use loc to only get relevant values:
result = melted.loc[melted["col"] == melted["variable"], "value"]
0 1
1 4
2 8
4 14
3 12
Name: value, dtype: int64
Finally - if needed - to get the same index order as before:
result.loc[df.index]
0 1
1 4
2 8
3 12
4 14
Name: value, dtype: int64
Pandas also provides a different solution in the documentation using factorize and numpy indexing:
df = pd.concat([i, df], axis=1)
idx, cols = pd.factorize(df['col'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
[ 1 4 8 12 14]
You can always use list comprehension:
[df.loc[idx, col] for idx, col in enumerate(['a', 'a', 'b', 'c', 'b'])]
I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])
Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.
UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11
im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')
if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']