I want to get a specific amount of rows before and after a specific index. However, when I try to get the rows, and the range is greater than the number of indices, it does not return anything. Given this, I would like you to continue looking for rows, as I show below:
df = pd.DataFrame({'column': range(1, 6)})
column
0 1
1 2
2 3
3 4
4 5
index = 2
df.iloc[idx]
3
# Now I want to get three values before and after that index.
# Something like this:
def get_before_after_rows(index):
rows_before = df[(index-1): (index-1)-2]
rows_after = df[(index+1): (index+1)-2]
return rows_before, rows_after
rows_before, rows_after = get_before_after_rows(index)
rows_before
column
0 1
1 2
4 5
rows_after
column
0 1
3 4
4 5
You are mixing iloc and loc which is very dangerous. It works in your example because the index is sequentially numbered starting from zero so these two functions behave identically.
Anyhow, what you want is basically taking rows with wrap-around:
def get_around(df: pd.DataFrame, index: int, n: int) -> (pd.DataFrame, pd.DataFrame):
"""Return n rows before and n rows after the specified positional index"""
idx = index - np.arange(1, n+1)
before = df.iloc[idx].sort_index()
idx = (index + np.arange(1, n+1)) % len(df)
after = df.iloc[idx].sort_index()
return before, after
# Get 3 rows before and 3 rows after the *positional index* 2
before, after = get_around(df, 2, 3)
I want to count the number of common elements between rows (each row has 6 elements/ columns)
My dataframe (df) looks something like this:
>>> df
Customer Number Most Frequent Called 1 Most Frequent Called 2 Most Frequent Called 3 Most Frequent Called 4 Most Frequent Called 5
0 552711620 161359852 611336215 884140437 804548991 135953430
1 561712520 186359312 666336115 855140357 899548041 134953530
2 331112180 316659812 436926115 545220357 117748041 984213530
3 873212120 196357673 331112180 565777359 174348053 554212940
4 113219540 733352993 975632166 569117345 175888077 364212923
...
I have tried this code:
connection_df = pd.DataFrame()
for i in range(len(df)):
connection_list = []
for j in range(len(df)):
intersection = set(df.iloc[i]).intersection(df.iloc[j])
connection_list.append(len(intersection))
connection_df.insert(loc=i, column = str(i), value = connection_list)
This will give me a dataframe of a form of a matrix like this:
>>> connection_df
0 1 2 3 4
0 6 0 0 0 0
1 0 6 0 0 0
2 0 0 6 1 0
3 0 0 1 6 0
4 0 0 0 0 6
This piece of code does what I want, but as I'm using loops, they are very inefficient. Potentially there will be millions of rows so I want to ask for any suggestions on optimizing these codes. Thanks.
An efficient solution consist in performing all the operation with Numpy (by converting the whole dataframe to a Numpy matrix), computing only the upper part of the matrix as the intersection between two sets is symmetric, and pre-computing all the sets.
def fastConnectionDf(df):
size = len(df)
connection_mat = np.zeros((size, size), dtype=np.int)
df_mat = df.to_numpy()
uniqueSets = [np.unique(df_mat[i]) for i in range(size)] # Precompute all the sets
for i in range(size):
connection_mat[i,i] = len(uniqueSets[i])
for j in range(i+1, size):
intersection = np.intersect1d(uniqueSets[i], uniqueSets[j], assume_unique=True)
connection_mat[i,j] = len(intersection)
connection_mat = np.maximum(connection_mat, connection_mat.T)
connection_df = pd.DataFrame(connection_mat)
return connection_df
On my machine, this solution is 28 times faster on the example dataframe (and up to 50 times faster on bigger dataframes).
Note that it is possible to improve the algorithm by:
just counting the number of intersecting elements rather than creating an array with all the items
using a more clever implementation can sort the arrays before to speed up the set intersections (see np.searchsorted)
using Numba to speed up the computation on big dataframes
The two first improvements are hard (impossible?) to perform efficiently only with Numpy, but possible with Numba although this is a bit complex to do.
I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!
I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024
So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!
We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])
You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.
In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.
My basic task is to take vector x=[x1,x2,x3,x4] (which in my case is presented by a row of a Pandas dataframe, lets say a row with an index = 1), multiply it by scalar k and to sum up the results -> x1*k + x2*k + x3*k + x4*k.
I did not find a function that would do it in one step (Is there such a function/operation?), so i do it in two steps. First i multiply my vector x by scalar k, and then i sum up the results:
x_by_k = my_df.loc[[1]]*k
sum = x_by_k.sum(axis=1)
One of the problems i have here is that the resulting sum is of Series type, although effectively it is a number.
Is there a way to perform this sum operation with a number as an output?
Can i do the above described in one step?
IIUC select row in df by ix, then sum and multiple by k:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
k = 2
sum = df.ix[1].sum()* k
print (sum)
30