How to iterate over rows and assign values to a new column - python

I have a dataframe with over 75k rows, having about 13 pre-existing columns. Now, I want to create a new column based on an if statement, such that:
if each row of a certain column has the same value as the next, then the value in the new column for that row would be 0 or 1.
The if statement checks for two equalities (columns are tags_list and gateway_id).
The below code snippet is what I have tried
for i in range(1,len(df_sort['date'])-1):
if (df_sort.iloc[i]['tags_list'] == df_sort.iloc[i+1]['tags_list']) & (df_sort.iloc[i]['gateway_id'] == df_sort[i+1]['gateway_id']):
df_sort.iloc[i]['Transit']=0
else:
df_sort.iloc[i]['Transit']=1
Getting a keyerror :2 in this case
PS: All of the columns have the same number of rows

if (df_sort.iloc[i]['tags_list'] == df_sort.iloc[i+1]['tags_list']) &
(df_sort.iloc[i]['gateway_id'] == df_sort.iloc[i+1]['gateway_id']):
df_sort[i+1]['gateway_id'] should be df_sort.iloc[i+1]['gateway_id']
Also, are you sure you want to iterate from 1 and not from 0 ?

There is numpy machinery for this, namely numpy.diff. Consider a DataFrame that already has some generic column 'x' populated.
In [48]: df['x'].values
Out[48]: array([0, 0, 0, 0, 1, 1, 1, 2, 2, 3])
In [49]: df['x_diff'] = (np.diff(df['x'], prepend=0) != 0) * 1
In [50]: df['x_diff'].values
Out[50]: array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
If you need the zeros and ones flipped, just change != to ==.

Related

Filtering IDS which has the same elements in a list pandas

I have grouped columns and got the following sample dataframe. After this I would like to filter IDs which has identical elements in a list.
df
ID Value
1 [0,0,50,0,0]
2 [0,0,0,0,0,0,0]
3 [0,100,0,0,50]
4 [0,0,0,0,0]
I would like to filter IDs which has same elements in a list under Value columns and those elements must be only 0s.
The expected output is just IDs and it should be 2 and 4.
Can anyone help on this?
You can compare sets by set([0]):
df1 = df[df['Value'].map(set).eq(set([0]))]
print (df1)
ID Value
1 2 [0, 0, 0, 0, 0, 0, 0]
3 4 [0, 0, 0, 0, 0]
If need filter only same values per list compare lengths in Series.str.len:
df2 = df[df['Value'].map(set).str.len().eq(1)]

calculate sum of Nth column of numpy array entry grouped by the indices in first two columns?

I would like to loop over following check_matrix in such a way that code recognize whether the first and second element is 1 and 1 or 1 and 2 etc? Then for each separate class of pair i.e. 1,1 or 1,2 or 2,2, the code should store in the new matrices, the sum of last element (which in this case has index 8) times exp(-i*q(check_matrix[k][2:5]-check_matrix[k][5:8])), where i is iota (complex number), k is the running index on check_matrix and q is a vector defined as given below. So there are 20 q vectors.
import numpy as np
q= []
for i in np.linspace(0, 10, 20):
q.append(np.array((0, 0, i)))
q = np.array(q)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
This means in principles I will have to have 20 matrices of shape 2x2, corresponding to each q vector.
For the moment my code is giving only one matrix, which appears to be the last one, even though I am appending in the Matrices. My code looks like below,
for i in range(2):
i = i+1
for j in range(2):
j= j +1
j_list = []
Matrices = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
j_list.append(check_matrix[k][8]*np.exp(-1J*np.dot(q,(np.subtract(check_matrix[k][2:5],check_matrix[k][5:8])))))
j_11 = np.sum(j_list)
I_matrix[i-1][j-1] = j_11
Matrices.append(I_matrix)
I_matrix is defined as below:
I_matrix= np.zeros((2,2),dtype=np.complex_)
At the moment I get following output.
Matrices = [array([[-0.66071446-0.77603624j, -0.29038112+2.34855023j], [-0.31387562-0.08116629j, 4.2788 +0.j ]])]
But, I desire to get a matrix corresponding to each q value meaning that in total there should be 20 matrices in this case, where each 2x2 matrix element would be containing sums such that elements belong to 1,1 and 1,2 and 2,2 pairs in following manner
array([[11., 12.],
[21., 22.]])
I shall highly appreciate your suggestion to correct it. Thanks in advance!
I am pretty sure you can solve this problem in an easier way and I am not 100% sure that I understood you correctly, but here is some code that does what I think you want. If you have a possibility to check if the results are valid, I would suggest you do so.
import numpy as np
n = 20
q = np.zeros((20, 3))
q[:, -1] = np.linspace(0, 10, n)
check_matrix = np.array([[1, 1, 0, 0, 0, 0, 0, -0.7977, -0.243293],
[1, 1, 0, 0, 0, 0, 0, 1.5954, 0.004567],
[1, 2, 0, 0, 0, -1, 0, 0, 1.126557],
[2, 1, 0, 0, 0, 0.5, 0.86603, 1.5954, 0.038934],
[2, 1, 0, 0, 0, 2, 0, -0.7977, -0.015192],
[2, 2, 0, 0, 0, -0.5, 0.86603, 1.5954, 0.21394]])
check_matrix[:, :2] -= 1 # python indexing is zero based
matrices = np.zeros((n, 2, 2), dtype=np.complex_)
for i in range(2):
for j in range(2):
k_list = []
for k in range(len(check_matrix)):
if check_matrix[k][0] == i and check_matrix[k][1] == j:
k_list.append(check_matrix[k][8] *
np.exp(-1J * np.dot(q, check_matrix[k][2:5]
- check_matrix[k][5:8])))
matrices[:, i, j] = np.sum(k_list, axis=0)
NOTE: I changed your indices to have consistent
zero-based indexing.
Here is another approach where I replaced the k-loop with a vectored version:
for i in range(2):
for j in range(2):
k = np.logical_and(check_matrix[:, 0] == i, check_matrix[:, 1] == j)
temp = np.dot(check_matrix[k, 2:5] - check_matrix[k, 5:8], q[:, :, np.newaxis])[..., 0]
temp = check_matrix[k, 8:] * np.exp(-1J * temp)
matrices[:, i, j] = np.sum(temp, axis=0)
3 line solution
You asked for efficient solution in your original title so how about this solution that avoids nested loops and if statements in a 3 liner, which is thus hopefully faster?
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
grp=np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[np.sum(x) for x in grp]
output:
[-0.23872600000000002, 1.126557, 0.023742000000000003, 0.21394]
How does it work?
I combine the first two columns into a single index, treating each as "bits" (i.e. base 2)
fac=2*(check_matrix[:,0]-1)+(check_matrix[:,1]-1)
( If you have indexes that exceed 2, you can still use this technique but you will need to use a different base to combine the columns. i.e. if your indices go from 1 to 18, you would need to multiply column 0 by a number equal to or larger than 18 instead of 2. )
So the result of the first line is
array([0., 0., 1., 2., 2., 3.])
Note as well it assumes the data is ordered, that one column changes fastest, if this is not the case you will need an extra step to sort the index and the original check matrix. In your example the data is ordered.
The next step groups the data according to the index, and uses the solution posted here.
np.split(check_matrix[:,8], np.cumsum(np.unique(fac,return_counts=True)[1])[:-1])
[array([-0.243293, 0.004567]), array([1.126557]), array([ 0.038934, -0.015192]), array([0.21394])]
i.e. it outputs the 8th column of check_matrix according to the grouping of fac
then the last line simply sums those... knowing how the first two columns were combined to give the single index allows you to map the result back. Or you could simply add it to check matrix as a 9th column if you wanted.

selecting from a pandas.dataframe based on a column of arrays

I have a data frame with a column containing arrays (all 1x9 arrays). For all rows in that column, I wish to find the ones where the third element is 1 and pick out the values from another column in the corresponding row.
For example, I wish to pick out the 'cal_nCa' value (116) where the second element in info_trig is 0
info_trig cal_nCa
0 [0, 1, 0, 0, 0, 0, 0, 0, 0] 128
1 [0, 1, 0, 0, 0, 0, 0, 0, 0] 79
2 [0, 0, 0, 1, 0, 0, 0, 1, 0] 116
3 [0, 1, 0, 0, 0, 0, 0, 0, 0] 82
I tried something in line of df["A"][(df["B"] > 50)], based on Selecting with complex criteria from pandas.DataFrame.
When selecting the desired rows:
data["info_trig"][:][3]
I only succeed selecting a specific row and the third element in that row. But unable to select all the third element in every row. A loop could work but I hope there is a cleaner way out.
Using str to access the column 3rd position value
data["info_trig"].str[3]
data.apply(lambda x: x['cal_nCa'] if x['info_trig'][1] == 0 else 0, axis = 1)
This will return a Series that only remain value in cal_nCa when the second element value in info_trig is 0:
0 0
1 0
2 116
3 0
dtype: int64
Or you can only select the rows you want by this:
data[data.apply(lambda x: True if x['info_trig'][1] == 0 else False, axis = 1)]
Hope it will help you.

Efficient way: find row where nearly no zero appears in column

I have a problem that as to be solved as efficient as possible. My current approach kind of works, but is extreme slow.
I have a dataframe with multiple columns, in this case I only care for one of them. It contains positive continuous numbers and some zeros.
my goal: is to find the row where nearly no zeros appear in the following rows.
To make clear what I mean I wrote this example to replicate my problem:
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
There are some zeros at the beginning, but they get less after some time.
Here comes my unoptimized code to visualize the number of zeros:
zerosum = 0 # counter for all zeros that have appeared so far
for i in range(len(df)):
if(df[0][i]== 0.0):
df.loc[df.index[i],'zerosum']=zerosum
zerosum+=1
else:
df.loc[df.index[i],'zerosum']=zerosum
df['zerosum'].plot()
With that unoptimized code I can see the distribution of zeros over time.
My expected output: would be in this example the date 01-Jan-2018 08:00, because no zeros appear after that date.
The problem I have when dealing with my real data is some single zeros can appear later. Therefore I can't just pick the last row that contains a zero. I have to somehow inspect the distribution of zeros and ignore later outliers.
Note: The visualization is not necessary to solve my problem, I just included it to explain my problem as good as possible. Thanks
Ok
Second go
import pandas as pd
import numpy as np
import math
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'),
columns=['values'])
We create a column that contains the rank of each zero, and zero if there is a non-zero value
df['zero_idx'] = np.where(df['values']==0,np.cumsum(np.where(df['values']==0,1,0)), 0)
We can use this column to get the location of any zero of any rank. I dont know what your criteria is for naming a zero an outlier. But lets say we want to make sure at we are past at least 90% of all zeros...
# Total number of zeros
n_zeros = max(df['zero_idx'])
# Get past at least this percentage
tolerance = 0.9
# The rank of the abovementioned zero
rank_tolerance = math.ceil(tolerance * n_zeros)
df[df['zero_idx']==rank_tolerance].index
Out[44]: DatetimeIndex(['2018-01-01 07:30:00'], dtype='datetime64[ns]', freq='15T')
Okay, If you need to get the index after the last zero occurred, you can try this:
last = 0
for i in range(len(df)):
if(df[0][i] == 0):
last = i
print(df.iloc[last+1])
or by Filtering:
new = df.loc[df[0]==0]
last = df.index.get_loc(new.index[-1])
print(df.iloc[last+1])
here my solution using a filter and cumsum:
df = pd.DataFrame([0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 2, 3, 4, 0, 4, 0, 5, 1, 0, 1, 2, 3, 4,
0, 0, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3, 6, 1, 1, 5, 1, 2, 3, 4, 4, 4, 3, 5, 1, 2, 1, 2, 3, 4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
a = df[0] == 0
df['zerosum'] = a.cumsum()
maxval = max(df['zerosum'])
firstdate = df[df['zerosum'] == maxval].index[1]
print(firstdate)
output:
2018-01-01 08:00:00

Iterate over two nested 2D lists where list2 has list1's row numbers

I'm new to Python. So I want to get this done with loops without using some fancy stuff like generators. I have two 2D arrays, one integer array and the other string array like this:
Integer 2D list:
Here, dataset2d[0][0] is number of rows in the table, dataset[0][1] is number of columns. So the below 2D list has 6 rows and 4 columns
dataset2d = [
[6, 4],
[0, 0, 0, 1],
[1, 0, 2, 0],
[2, 2, 0, 1],
[1, 1, 1, 0],
[0, 0, 1, 1],
[1, 0, 2, 1]
]
String 2D list:
partition2d = [
['A', '1', '2', '4'],
['B', '3', '5'],
['C', '6']
]
partition[*][0] i.e first column is a label. For group A, 1,2 and 4 are the row numbers that I need to pick up from dataset2d and apply a formula. So it means I will read 1, go to row 1 in dataset2d and read the first column value i.e dataset2d[1][0], then I will read 2 from partition2d, go to row 2 of dataset 2d and read the first column i.e dataset2d[2][0]. Similarly next one I'll read dataset2d[4][0].
Then I will do some calculations, get a value and store it in a 2D list, then go to the next column in dataset2d for those rows. So in this example, next column values read would be dataset2d[1][1], dataset2d[2][1], dataset2d[4][1]. And again do some calculation and get one value for that column, store it. I'll do this until I reach the last column of dataset2d.
The next row in partition2d is [B, 3, 5]. So I'll start with dataset2d[3][0], dataset2d[5][0]. Get a value for that column be a formula. Then real dataset2d [3][1], dataset2d[5][1] etc. until I reach last column. I do this until all rows in partition2d are read.
What I tried:
for partitionRow in partition2d:
for partitionCol in partitionRow:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
What problem I'm facing:
partition2d is a string array where I need to skip the first column which has characters like A,B,C.
I want to iterate in dataset2d column wise only over the row numbers given in partition2d. So the colDataset should increment only after I'm done with that column.
Update1:
I'm reading the contents from a text file, and the data in 2D lists can vary, depending on file content and size, but the structure of file1 i.e dataset2d and file2 i.e partition2d will be the same.
Update2: Since Eric asked about how the output should look like.
0.842322 0.94322 0.34232 0.900009 (For A)
0.642322 0.44322 0.24232 0.800009 (For B)
This is just an example and the numbers are randomly typed by me.
So the first number 0.842322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group A having considered rows 1,2,4.
The second number, 0.94322 is the result of applying formula to column 1 of dataset2d i.e dataset2d[partitionCol][1] for group A having considered rows 1,2 4.
The third number, 0.34232 is the result of applying formula to column 2 of dataset2d i.e dataset2d[partitionCol][2] for group A having considered rows 1,2 4. Similarly we get 0.900009.
The first number in second row, i.e 0.642322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group B having considered rows 3,5. And so on.
You can use Numpy (I hope this is not fancy for you):
import numpy
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
dataset2D_size = dataset2D[0]
dataset2D = numpy.array(dataset2D)
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
# Take the specified rows
rows = dataset2D[row_indices]
# Iterate the columns (this is the power of Python!)
for column in zip(*rows):
# Now, column will contain one column of data from specified row indices
print column, # Apply your formula here
print
or if you don't want to install Numpy, here is what you can do (this is what you want, actually):
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
dataset2D_size = dataset2D[0]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
rows = [dataset2D[row_idx] for row_idx in row_indices]
for column in zip(*rows):
print column,
print
both will print:
(0, 1, 1) (0, 0, 1) (0, 2, 1) (1, 0, 0)
(2, 0) (2, 0) (0, 1) (1, 1)
(1,) (0,) (2,) (1,)
Explanation of second code (without Numpy):
[dataset2D[row_idx] for row_idx in row_indices]
This is basically you take each row (dataset2D[row_idx]) and collate them together as a list. So the result of this expression is a list of lists (which comes from the specified row indices)
for column in zip(*rows):
Then zip(*rows) will iterate column-wise (the one you want). This works by taking the first element of each row, then combine them together to form a tuple. In each iteration, the result is stored in variable column.
Then inside the for column in zip(*rows): you already have your intended column-wise iterated elements from specified rows!
To apply your formula, just change the print column, into the stuff you wanna do. For example I modify the code to include row and column number:
print 'Processing partition %s' % label
for (col_num, column) in enumerate(zip(*rows)):
print 'Column number: %d' % col_num
for (row_num, element) in enumerate(column):
print '[%d,%d]: %d' % (row_indices[row_num], col_num, element)
which will result in:
Processing partition A
Column number: 0
[1,0]: 0
[2,0]: 1
[4,0]: 1
Column number: 1
[1,1]: 0
[2,1]: 0
[4,1]: 1
Column number: 2
[1,2]: 0
[2,2]: 2
[4,2]: 1
Column number: 3
[1,3]: 1
[2,3]: 0
[4,3]: 0
Processing partition B
Column number: 0
[3,0]: 2
[5,0]: 0
Column number: 1
[3,1]: 2
[5,1]: 0
Column number: 2
[3,2]: 0
[5,2]: 1
Column number: 3
[3,3]: 1
[5,3]: 1
Processing partition C
Column number: 0
[6,0]: 1
Column number: 1
[6,1]: 0
Column number: 2
[6,3]: 2
Column number: 3
[6,3]: 1
I hope this helps.
Here's an extensible solution using an iterator:
def partitions(data, p):
for partition in p:
label = partition[0]
row_indices = partition[1:]
rows = [dataset2D[row_idx] for row_idx in row_indices]
columns = zip(*rows)
yield label, columns
for label, columns in partitions(dataset2D, partitions2d):
print "Processing", label
for column in columns:
print column
to address your problems:
What problem I'm facing:
partition2d is a string array where I need to
skip the first column which has characters like A,B,C.
I want to
iterate in dataset2d column wise only over the row numbers given in
partition2d. So the colDataset should increment only after I'm done
with that column.
Problem 1 can be solved using slicing - if you want to iterate on partition2d from the second element only you can to something for partitionCol in partitionRow[1:]. This will slice the row starting from the second element to the end.
So something like:
for partitionRow in partition2d:
for partitionCol in partitionRow[1:]:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
Problem 2 I didn't understand what you want :)
partition2d is a string array where I need to skip the first column
which has characters like A,B,C.
This is called slicing:
for partitionCol in partitionRow[1:]:
the above snippet will skip the first column.
for colDataset in dataset2d:
Already does what you want. There is no structure here like in C++ loops. Although you could do stuff in a very Unpythonic way:
i=0
for i in range(len(dataset2d)):
print dataset2d[partitionCol][i]
i=+1
This is a very bad way of doing stuff. For arrays and matrices, I suggest you don't re-invent the wheel (that is also Pythonic stuff), look at Numpy. And especially at:
numpy.loadtxt
Setup:
d = [[6,4],[0,0,0,1],[1,0,2,0],[2,2,0,1],[1,1,1,0],[0,0,1,1],[1,0,2,1]]
s = [['A',1,2,4],['B',3,5],['C',6]]
The results are put into a list l
l = []
for r in s: #go over each [character,index0,index1,...]
new_r = [r[0]] #create a new list to values given by indexN. Add in the character by default
for i,c in enumerate(r[1:]): #go over each indexN. Using enumerate to keep track of what N is.
new_r.append(d[c][i]) #i is now the N in indexN. c is the column.
l.append(new_r) #add that new list to l
Resulting in
>>> l
[['A', 0, 0, 1], ['B', 2, 0], ['C', 1]]
The execution of the first iteration would look like:
for r in s:
#-> r = ['A',1,2,4]
new_r = [r[0]] #= ['A']
for i,c in enumerate([r[1:] = [1,2,4])
#-> i = 0, c = 1
new_r.append(d[1][i])
#-> i = 1, c = 2
#...

Categories

Resources