I've a scientist dataframe
radius date spin atom
0 12,50 YYYY/MM 0 he
1 11,23 YYYY/MM 2 c
2 45,2 YYYY/MM 1 z
3 11,1 YYYY/MM 1 p
I want select for each row, all rows where the difference between the radius is under, for exemple 5
I've define a function to calc (simple,it's an example):
def diff_radius (a,b)
return a-b
Is-it possible for each rows to find some rows which check the condition in calling an external function?
I try some way, not working:
for i in range(df.shape[0]):
....
df_in_radius=df.apply(lambda x : diff_radius(df[i]['radius'],x['radius']))
Can you help me?
I am assuming that the datatype of the radius column is a tuple. You can keep the diff_radius method like
def diff_radius(x):
a, b = x
return a-b
Then, you can use loc method in pandas to select the rows which matches the condition of radius differece less than 5.
df.loc[df.radius.apply(diff_radius) < 5]
Edit #1
If the datatype of the radius column is a string, then split them and typecast. The logic will go in the diff_radius method. In case of string
def diff_radius(x):
x_split = x.split(',')
a,b = int(x_split[0]), int(x_split[-1])
return a-b
I misspoke.
My dataframe is :
radius of my atom date spin atom
0 12.50 YYYY/MM 0 he
1 11.23 YYYY/MM 2 c
2 45.2 YYYY/MM 1 z
3 11.1 YYYY/MM 1 p
I do a loop , to apply on one row a special calcul of each row whose respond condition.
Example:
def diff_radius(current_row,x):
current_row['radius']-x['radius']
return a-b
df=pd.read_csv(csvfile,delimiter=";",names=('radius','date','spin','atom'))
# for each row of original dataframe
for i in range(df.shape[0]):
# first build a new and tmp dataframe with row
# which have a radius less 5 than df.iloc[i]['radius] (level of loop)
df_tmp=df[diff_radius(df.iloc[i]['radius],df['radius']) <5]
....
# start of special calc, with the df_tmp which contains all of rows
# less 5 than the current row **(i)**
I thank you sincerely for your answers
Related
I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?
If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!
I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024
This is a continuation of my question. Fastest way to compare rows of two pandas dataframes?
I have two dataframes A and B:
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
For a condensed example:
A B C D E
0 0 0 0 1 0
1 1 1 1 1 0
2 1 0 0 1 1
3 0 1 1 1 0
B is 1024 rows x 10 columns, and is a full iteration from 0 to 1023 in binary form.
Example:
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I am trying to find which rows in A, at a particular 10 columns of A, correspond with each row of B.
Each row of A[My_Columns_List] is guaranteed to be somewhere in B, but not every row of B will match up with a row in A[My_Columns_List]
For example, I want to show that for columns [B,D,E] of A,
rows [1,3] of A match up with row [6] of B,
row [0] of A matches up with row [2] of B,
row [2] of A matches up with row [3] of B.
I have tried using:
pd.merge(B.reset_index(), A.reset_index(),
left_on = B.columns.tolist(),
right_on =A.columns[My_Columns_List].tolist(),
suffixes = ('_B','_A')))
This works, but I was hoping that this method would be faster:
S = 2**np.arange(10)
A_ID = np.dot(A[My_Columns_List],S)
B_ID = np.dot(B,S)
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
But when I do this, out_row_idx returns an array containing all the indices of A, which doesn't tell me anything.
I think this method will be faster, but I don't know why it returns an array from 0 to 999.
Any input would be appreciated!
Also, credit goes to #jezrael and #Divakar for these methods.
I'll stick by my initial answer but maybe explain better.
You are asking to compare 2 pandas dataframes. Because of that, I'm going to build dataframes. I may use numpy, but my inputs and outputs will be dataframes.
Setup
You said we have a a 1000 x 500 array of ones and zeros. Let's build that.
A_init = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A_init.columns = pd.MultiIndex.from_product([range(A_init.shape[1]/10), range(10)])
A = A_init
In addition, I gave A a MultiIndex to easily group by columns of 10.
Solution
This is very similar to #Divakar's answer with one minor difference that I'll point out.
For one group of 10 ones and zeros, we can treat it as a bit array of length 8. We can then calculate what it's integer value is by taking the dot product with an array of powers of 2.
twos = 2 ** np.arange(10)
I can execute this for every group of 10 ones and zeros in one go like this
AtB = A.stack(0).dot(twos).unstack()
I stack to get a row of 50 groups of 10 into columns in order to do the dot product more elegantly. I then brought it back with the unstack.
I now have a 1000 x 50 dataframe of numbers that range from 0-1023.
Assume B is a dataframe with each row one of 1024 unique combinations of ones and zeros. B should be sorted like B = B.sort_values().reset_index(drop=True).
This is the part I think I failed at explaining last time. Look at
AtB.loc[:2, :2]
That value in the (0, 0) position, 951 means that the first group of 10 ones and zeros in the first row of A matches the row in B with the index 951. That's what you want!!! Funny thing is, I never looked at B. You know why, B is irrelevant!!! It's just a goofy way of representing the numbers from 0 to 1023. This is the difference with my answer, I'm ignoring B. Ignoring this useless step should save time.
These are all functions that take two dataframes A and B and returns a dataframe of indices where A matches B. Spoiler alert, I'll ignore B completely.
def FindAinB(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
twos = 2 ** np.arange(10)
return A.stack(0).dot(twos).unstack()
def FindAinB2(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
rng = np.arange(A.shape[1])
A.columns = pd.MultiIndex.from_product([range(A.shape[1]/10), range(10)])
# use clever bit shifting instead of dot product with powers
# questionable improvement
return (A.stack(0) << np.arange(10)).sum(1).unstack()
I'm channelling my inner #Divakar (read, this is stuff I've learned from Divakar)
def FindAinB3(A, B):
assert A.shape[1] % 10 == 0, 'Number of columns in A is not a multiple of 10'
a = A.values.reshape(-1, 10)
a = np.einsum('ij->i', a << np.arange(10))
return pd.DataFrame(a.reshape(A.shape[0], -1), A.index)
Minimalist One Liner
f = lambda A: pd.DataFrame(np.einsum('ij->i', A.values.reshape(-1, 10) << np.arange(10)).reshape(A.shape[0], -1), A.index)
Use it like
f(A)
Timing
FindAinB3 is an order of magnitude faster
Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.
Suppose I have a dataframe
ID1 ID2 x y time
0 0 1 34.337735 -76.3319716667 1446797582
1 0 1 34.3841816667 -76.2837666667 1446796183
2 0 2 34.49157 -76.1661133333 1446792969
3 0 3 34.5275266667 -76.1151866667 1446791765
4 0 3 34.5624816667 -76.0633883333 1446790559
What I would like is to capture the distance moved by each member, identified uniquely by the ID1,ID2 pair.
Is there anyway I can perform row operations on a dataframe? My initial idea was to convert the dataframe to a matrix using df.as_matrix(), pick out the unique IDs, an compute distances from the matrix.
This seems really inefficient. Is there a better way I could do this with dataframes?
if you want to calculate the distance for each time step you can do the following
df[['x' , 'y']].apply(lambda x : np.linalg.norm(x) , axis = 1 )
on the other hand if you want to calculate the distance by each member you can do the following
In [38]:
df.groupby([df.ID1 , df.ID2])[['x' , 'y']].
apply(lambda x : np.linalg.norm(x.diff().dropna()) if len(x) > 1 else 0 )
Out[38]:
ID1 ID2
0 1 0.066940
2 0.000000
3 0.062489
dtype: float64
first you will group by your ID columns and then check for the members length if the length is greater than 1 so this means the member has moved other wise the member didn't .
you can calculate the difference between x and y by using the diff function which will produce na for the first columns but you can drop it easily using dropna function .
then to calculate the vector length you can easily use the function np.linalg.norm
you can also use x.diff().iloc[1] instead of x.diff().dropna()
If you need to get the length of a total path for each unique pair you could do
pd.DataFrame(df.groupby(['ID1','ID2']).apply(lambda z:pathlength(z.x.values,z.y.values)))
Where pathlength is
from math import sqrt
def pathlength(x,y):
n = len(x)
lv = [sqrt((x[i]-x[i-1])**2 + (y[i]-y[i-1])**2) for i in range (1,n)]
L = sum(lv)
return L
That gives us
0
ID1 ID2
0 1 0.066940
2 0.000000
3 0.062489