This question already has answers here:
pandas replace multiple values one column
(7 answers)
Closed 3 years ago.
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
The dataframe looks like:
M
Tissues
a1 1
x2 2
y3 a
b 4
c1 b
v2 a
w3 7
How can I replace all as in column M with say a specific value, 2 and all bs to 3?
I tried:
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].map(replace_values)
, but that changed other values not in the keys in replace_values to NaN:
Tissues M
0 a1 NaN
1 x2 NaN
2 y3 2.0
3 b NaN
4 c1 3.0
5 v2 2.0
6 w3 NaN
I see that I can do
df.loc[(df['M'] == 'a')] = 2
but can I do this efficiently for a, b and so on, instead of one by one?
Use df.replace:
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].replace(replace_values)
Output:
>>> df
Tissues M
0 a1 1
1 x2 2
2 y3 2
3 b 4
4 c1 3
5 v2 2
6 w3 7
Fix your code by add fillna
df['M'] = df['M'].map(replace_values).fillna(df.M)
df
Tissues M
0 a1 1.0
1 x2 2.0
2 y3 2.0
3 b 4.0
4 c1 3.0
5 v2 2.0
6 w3 7.0
Use df.replace
replace_values = {'a':2, 'b':3}
df = df.replace({"M": replace_values})
Results:
Tissues M
0 a1 1
1 x2 2
2 y3 a
3 b 4
4 c1 b
5 v2 a
6 w3 7
Related
I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.
Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3
Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0
I need to combine multiple rows into a single row with pandas, depending on the column 'hash'
View of my dataframe:
hash a b
0 1 1 6
1 1 2 7
2 1 3 8
3 2 4 9
4 2 5 10
I want the dataframe to be converted like this:
hash a a1 a3 b b1 b2
0 1 1 2 3 6 7 8
1 2 4 5 nan 9 10 nan
I have tried to use some code related to groupby or transpose the whole dataframe but cant figure out how to do it. Anyone could help me out?
Create MultiIndex by set_index with counter column by cumcount, reshape by unstack and flatten Multiindex by map with join:
df1 = df.set_index(['hash', df.groupby('hash').cumcount().add(1).astype(str)]).unstack()
df1.columns = df1.columns.map(''.join)
df1 = df1.reset_index()
print (df1)
hash a1 a2 a3 b1 b2 b3
0 1 1.0 2.0 3.0 6.0 7.0 8.0
1 2 4.0 5.0 NaN 9.0 10.0 NaN
I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark
I have a unique requirement , where i need mean of common columns (per row) from two dataframes.
I can not think of a pythonic way of doing this. I know i can loop through two data frames and find common columns and then get mean of rows where key matches.
Assuming I have below Data Frames:
DF1:
Key A B C D E
K1 2 3 4 5 8
K2 2 3 4 5 8
K3 2 3 4 5 8
K4 2 3 4 5 8
DF2:
Key A B C D
K1 4 7 4 7
K2 4 7 4 7
K3 4 7 4 7
K4 4 7 4 7
The result DF should be the mean values of the two DF , each column per row where Key matches.
ResultDF:
Key A B C D
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
I know i should put sample code here , but i can not think of any logic for achieving this till now.
Use DataFrame.add using Key as the indexes:
df1.set_index('Key').add(df2.set_index('Key')).dropna(axis=1) / 2
A B C D
Key
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
Alternative with concat + groupby.
pd.concat([df1, df2], axis=0).dropna(axis=1).groupby('Key').mean()
A B C D
Key
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
Try adding the to data frames together then use the pandas apply function then add a lambda in it then divide x with two:
import pandas as pd
df1 = pd.DataFrame({'A': [2,2]})
df2 = pd.DataFrame({'A': [4,4]})
print((df1+df2).apply(lambda x: x/2))
Output:
A
0 3.0
1 3.0
Note: this is just with a dummy data frame
I've been working on a DataFrame, like the following extract and I want to know when the value changes:
A M C
0 2.0 1 C1
1 2.0 1 C1
2 2.0 2 C1
3 2.0 2 C1
4 2.0 3 C1
5 2.0 3 C1
6 2.0 1 C2
7 2.0 1 C2
8 2.0 2 C2
9 2.0 2 C3
10 2.0 3 C3
11 2.0 3 C3
13 2.1 1 C3
14 2.1 1 C3
15 2.1 2 C3
16 2.1 2 C3
17 2.1 3 C3
18 2.1 3 C3
I know that A or C, changes always when M starts in 1. The question is how can I get the position every time M value starts in 1?
I don't know if your entire data set is built the same way as the one you are showing us but from what I can see you are searching for occurrence of 3 to 1 in the m columns which would result in a difference of -2 :
df[df['M'].diff()==-2].index
Out[101]: Int64Index([6, 13], dtype='int64')
let's say your M column always increases but it can go higher than 3, you could just look for the first occurrence of a negative number such has:
df[df['M'].diff()<0].index
Out[103]: Int64Index([6, 13], dtype='int64')
let's say there is no pattern there you could simply do:
df[(df['M'].diff()!=0) & (df['M']==1)].index
Out[104]: Int64Index([0, 6, 13], dtype='int64')
this is adding index 0 because .diff() will return NaN for the first index of the dataframe which is !=0 and df['M'] ==0
Another way to determine when a new M set starts is to find where M is 1 and the previous M isn't:
In [18]: (df['M'] == 1) & (df["M"].shift() != 1)
Out[18]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
[.. and so on]
Name: M, dtype: bool
This includes the first element, but often makes sense. Once you have this, you can take its cumulative sum to get a group number associated with each group (because True == 1 and False == 0):
In [19]: df["group_index"] = ((df['M'] == 1) & (df["M"].shift() != 1)).cumsum()
In [20]: df
Out[20]:
A M C group_index
0 2.0 1 C1 1
1 2.0 1 C1 1
2 2.0 2 C1 1
3 2.0 2 C1 1
4 2.0 3 C1 1
5 2.0 3 C1 1
6 2.0 1 C2 2
7 2.0 1 C2 2
[.. and so on]
which is convenient because then you can use groupby to perform operations on the different clusters.