How to only iterate certain positions in the Itertool combinations - python

I am working on a python project which iterates through all the possible combinations of entries in a row of excel data to find which combination produces the correct output.
To achieve this, I am iterating through different combinations of 0 and 1 to choose whether that entry is required for the combination. 1 meaning data point is included in the calculation and 0 meaning the data point is not included.
The number of combinations would thus be equal to 2 ^ (Number of excel columns)
Example Excel Data:
1, 22, 7, 11, 2, 4
Example Iteration:
(1, 0, 0, 0, 1, 0)
I could be looking for what combination of the excel data would result in an output of 3, the only correct combination of the excel data being the above iteration.
However, I would know that any value greater than 3 would not be included in a possible combination that would equal 3. As such I would like to choose and set the values of these columns to 0 and iterate the other columns only. This would in turn reduce the number of combinations.
Combination = 2 ^ (Number of excel columns - Fixed Entry Columns)
At the moment I am using Itertools.products to get all combination which I need:
Numbers = ["0","1"]
for item in itertools.product(Numbers, repeat=len(df.columns)):
Iteration = pd.DataFrame(item) # Iteration e.g (0,1,1,1,0,0,1)
Data = df.iloc[0] # Excel data row
Data = Data.to_numpy()
Iteration = Iteration.astype(float)
Answer = np.dot(Data, Iteration) # Get the result of (Iteration * Data) to check if answer is correct
This results in iterating through combinations which I know will not work.
Is there a way to only iterate 0's and 1's in certain positions of the combination while keeping the known entries a fixed value (either 0 or 1) to reduce the possible combinations?
There are some excel files have over 25 columns which as a result would be 33,554,432 combinations. As such I am trying to reduce the number of columns which I need to iterate by setting values to the columns that I do know.
If you would need further clarification please let me know. I am novice programmer so I may be overlooking or over complicating a simple solution.

Find which columns meet your criteria for exclusion. Then just get the product combinations for the other columns.
One possible method:
from itertools import product
LIMIT=10
column_data = [1, 22, 7, 11, 2, 4]
changeable_indexes = [i for i,x in enumerate(column_data) if x <= LIMIT]
for item in product([0,1], repeat=len(changeable_indexes)):
row_iteration = [0] * len(column_data)
for index, value in zip(changeable_indexes, item):
row_iteration[index] = value
print(row_iteration)

Related

How to efficiently count the number of smaller elements for every element in another column?

I have the following df
name created_utc
0 t1_cqug90j 1430438400
1 t1_cqug90k 1430438400
2 t1_cqug90z 1430438400
3 t1_cqug91c 1430438401
4 t1_cqug91e 1430438401
... ... ...
in which column name contains only unique values. I would like to create a dictionary whose keys are the same elements as in column name. The value for each such a key is the number of elements in column created_utc strictly smaller than that of the key. My expected result is something like
{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...}
In this case, there are 6 elements in column created_utc strictly smaller than 1430438400, which is the corresponding value of t1_cqug90j. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.
Could you please elaborate on a more efficient way?
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df
Update: I posted the question How to efficiently count the number of larger elements for every elements in another column? and received a great answer there. However, I'm not able to modify the code into this case. It would be great if there is an efficient code that can be adapted for both cases, i.e. "strictly larger" and "strictly smaller".
I think you need sort_index for descending sorting for your previous answer:
count_utc = df.groupby('created_utc').size().sort_index(ascending=False)
print (count_utc)
created_utc
1430438401 2
1430438400 3
dtype: int64
cumulative_counts = count_utc.shift(fill_value=0).cumsum()
output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )
print (output)
{'t1_cqug90j': 2, 't1_cqug90k': 2, 't1_cqug90z': 2, 't1_cqug91c': 0, 't1_cqug91e': 0}

Pandas - slice sections of dataframe into multiple dataframes

I have a Pandas dataframe with 3000+ rows that looks like this:
t090:   c0S/m:    pr:      timeJ:  potemp090C:   sal00:  depSM:  \
407  19.3574  4.16649  1.836  189.617454      19.3571  30.3949   1.824
408  19.3519  4.47521  1.381  189.617512      19.3517  32.9250   1.372
409  19.3712  4.44736  0.710  189.617569      19.3711  32.6810   0.705
410  19.3602  4.26486  0.264  189.617627      19.3602  31.1949   0.262
411  19.3616  3.55025  0.084  189.617685      19.3616  25.4410   0.083
412  19.2559  0.13710  0.071  189.617743      19.2559   0.7783   0.071
413  19.2092  0.03000  0.068  189.617801      19.2092   0.1630   0.068
414  19.4396  0.00522  0.068  189.617859      19.4396   0.0321   0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.
You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups
Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.

Compute the row and column totals for an iterable of tuples

This requires to yield the results as they are computed. should not store all of the data at any point. This should support streams of data larger than memory.
For each row add an int at the beginning that is the total of the row.
Once the entire input has been processed add a final row with the
totals of every column in the input. This should include the initial
total column and you should treat columns that are missing on a given
row as zeros.
The row totals are the first column instead of the last
(as is more common) because it makes rows of different length easier
to handle.
For example:
def func([(1,2,3), (4,5)]) = [(6,1,2,3),(9,4,5),(15,5,7,3)]
Hopefully you will learn something from this:
from itertools import izip_longest
def func(rows):
totals = []
for row in rows:
row = (sum(row),) + row
totals = [sum(col) for col in izip_longest(totals, row, fillvalue=0)]
yield row
yield tuple(totals)
>>> list(func([(1,2,3), (4,5)]))
[(6, 1, 2, 3), (9, 4, 5), [15, 5, 7, 3]]
This code iterates over all of the rows yielding a tuple comprising the summed columns and the original columns.
izip_longest() pairs items in the current row with the corresponding item in totals to maintain a running total of each column. izip_longest() was chosen because it can handle rows of different lengths and you can supply a fill value (0 in this case) for missing items.

How to classify values in a columns of a pandas data frame according to their value?

I have a data frame that has a column that contains real values.
I would like to have an additional column that classify these values according to heir size. For example I would like to know if a value belongs to the group of the smallest values of a group of the largest values. I would like these two groups to have the same number of elements.
For example. If I have the following values:
[1,2,3,4,40,50]
I would like to map 1,2 and 3 to 1 and 4, 40, and 50 to 2. Is there an easy way to do it in a data frame.
In the above example I have used only two groups. But I would like to keep it flexible. For example for three groups I would like to map 1 and 2 to 1, 3 and 4 to 2, 40 and 50 to 3.
import heapq
import random
x = range(100000)
random.shuffle(x)
print(heapq.nlargest(2, x))
Gives: [99999, 99998]
Now just do something like:
max_column = heapq.nlargest(len(x)/2, x)
That should give you half of your list in a "large" pile, and do the same for the small pile.

Counting non-zero elements within each row and within each column of a 2D NumPy array

I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:
Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.
Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.
For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.
The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.
The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.
ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
TestID=str(TestIDs[j])
ReadOrWrite='Read'
fileName=inputFileName
directory=GetCurrentDirectory(arguments that return correct directory)
inputfile=open(directory,'r')
reader=csv.reader(inputfile)
m=0
for row in reader:
if m<9:
if row[0]!='TestID':
ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
m+=1
inputfile.close()
IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape)
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
print(', '.join(str(p[0]) for p in pt))
Can anyone help me with this?
import numpy as np
a = np.array([[1, 0, 1],
[2, 3, 4],
[0, 0, 7]])
columns = (a != 0).sum(0)
rows = (a != 0).sum(1)
The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.
The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.
The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:
columns = np.array([2, 1, 3])
rows = np.array([2, 3, 1])
EDIT: The whole code could look like this (with a few simplifications in your original code):
ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
ReadOrWrite = 'Read'
fileName = inputFileName
directory = GetCurrentDirectory(arguments that return correct directory)
# use directory or filename to get the CSV file?
with open(directory, 'r') as csvfile:
ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]
nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)
EDIT 2:
To get the mean value of all columns/rows, use the following:
colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)
What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.
A fast way to count nonzero elements per row in a scipy sparse matrix m is:
np.diff(m.tocsr().indptr)
The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.
Similarly, for the number of nonzero elements in each column, use:
np.diff(m.tocsc().indptr)
If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.
If you need both row and column nozero counts, and, say, m is already a CSR, you might use:
row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)
which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.
The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:
X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)
That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)
edit:
Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by
np.array(NumNonZeroElementsByColumn)[0]
For sparse matrices, use the getnnz() function supported by CSR/CSC matrix.
E.g.
a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)
array([0, 2, 1])
(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.
For sparse matrices I did:
(i,j) = X.nonzero()
column_sums = np.zeros(X.shape[1])
for n in np.asarray(j).ravel():
column_sums[n] += 1.
I wonder if there is a more elegant way.

Categories

Resources