Split output of loop by columns used as input - python

Hi I'm relatively new to Python and am currently working on trying to measure the width of features in an image. The resolution of my image is 1m so measuring the width should be easier. I've managed to select certain columns or rows of the image and extract the necessary data using loops and such. My code is below:
subset = imarray[:,::500]#(imarray.shape[1]/2):(imarray.shape[1]/2)+1]
subset[(subset > 0) & (subset <= 17)] = 1
subset[(subset > 17)] = 0
width = []
count = 0
for i in np.arange(subset.shape[1]):
column = subset[:,i]
for value in column:
if (value == 1):
count += 1
width.append(count)
width_arr = np.array(width).astype('uint8')
else:
count = 0
final = np.split(width_arr, np.argwhere(width_arr == 1).flatten())
final2 = [x for x in final if x != []]
width2 = []
for array in final2:
width2.append(max(array))
width2 = np.array(width2).astype('uint8')
print width2
I can't figure out how to split the output up so it shows the results for each column or row individually. Instead all I've been able to do is to append the data to an empty list and here's the output for that:
[ 70 35 4 2 5 36 4 5 2 51 97 4 228 3 21 47 7 21
23 58 126 4 111 2 2 5 3 2 18 15 6 19 3 3 12 15
6 8 2 4 6 88 122 24 14 49 73 57 74 6 179 8 3 2
6 3 184 9 3 19 24 3 2 2 3 255 30 8 191 33 127 5
3 27 112 2 24 2 5 2 10 30 10 6 37 2 38 6 12 17
44 67 23 5 101 10 9 4 6 4 255 136 5 255 255 255 255 26
255 235 148 4 255 199 3 2 114 87 255 109 69 12 41 20 30 57
72 89 32]
So these are the widths of the features in all the columns appended together. How do I use my loop or another method to split these up into individual numpy arrays representing each column I've sliced out of the original?
It seems like I am almost there but I can't seem to figure that last step out and it's driving me nuts.
Thanks in advance for your help!

Related

How to group and sum rows by ID and subtract from group of rows with same ID? [python]

I have the following dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum
-----------------------------------------------
0 22 5 1 54 208
1 23 5 2 34 208
2 24 6 1 44 268
3 25 6 1 64 268
4 26 5 2 35 208
5 27 7 3 45 229
6 28 7 2 66 229
7 29 8 1 76 161
8 30 8 2 25 161
9 31 6 2 27 268
10 32 5 3 14 208
11 33 5 3 17 208
12 34 6 2 43 268
13 35 6 2 53 268
14 36 8 1 22 161
15 37 7 3 65 229
16 38 7 1 53 229
17 39 8 2 23 161
18 40 8 3 15 161
19 41 6 3 37 268
20 42 5 2 54 208
Each row contains a unique "ID_A", while different rows can have the same "ID_B" and "ID_C". Each row corresponds to its own "Value", where this "Value" can be the same between different rows. The "ID_B_Value_Sum" column contains the sums of all values from the "Value" column for all rows containing the same "ID_B". Calculating this sum is straightforward with python and pandas.
What I want to do is, for each row, take the "ID_B_Value_Sum" column, but subtract all values corresponding to rows containing the same "ID_C", exclusive of the target row. For example, taking "ID_B" = 6, we see the sum of all the "Value" values from this "ID_B" = 6 group = 268, as shown in all corresponding rows in the "ID_B_Value_Sum" column. Now, two of the rows in this group contain "ID_C" = 1, three rows in this group contain "ID_C" = 2, and one row in this group contain "ID_C" = 3. Starting with row 2, with "ID_C" = 1, this means taking the corresponding "ID_B_Value_Sum" value and subtracting the "Value" values from all other rows containing both "ID_B" = 6 and "ID_C = 1", exclusive of the target row. And so for row 2 I take 268 - 64 = 204. And for another example, for row 4, this means 208 - 34 - 54 = 120. And another example, for row 7, this means 161 - 22 = 139. These new values will go in a new "Value_Sum_New" column for each row.
And so I want to produce the following output dataframe:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
---------------------------------------------------------------
0 22 5 1 54 208 XX
1 23 5 2 34 208 XX
2 24 6 1 44 268 204
3 25 6 1 64 268 XX
4 26 5 2 35 208 120
5 27 7 3 45 229 XX
6 28 7 2 66 229 XX
7 29 8 1 76 161 139
8 30 8 2 25 161 XX
9 31 6 2 27 268 XX
10 32 5 3 14 208 XX
11 33 5 3 17 208 XX
12 34 6 2 43 268 XX
13 35 6 2 53 268 XX
14 36 8 1 22 161 XX
15 37 7 3 65 229 XX
16 38 7 1 53 229 XX
17 39 8 2 23 161 XX
18 40 8 3 15 161 XX
19 41 6 3 37 268 XX
20 42 5 2 54 208 XX
What I am having trouble with conceptualizing is how to, for each row, group together all columns with the same "ID_B" and then group together all of those rows and sub-group all rows with the same "ID_C" and subtract their sum from the "Value" of the target row, but still including the "Value" from the target row, to create the final "Value_Sum_New". It seems like so many actions and sub-actions to take and I am confused with how to approach this in a simple and streamlined manner, as I am confused with how to organize and order the workflow. How might I approach calculating this sum in python?
IIUC, you need:
df['Value_Sum_New'] = (df['ID_B_Value_Sum']
- df.groupby(['ID_B', 'ID_C'])['Value'].transform('sum')
+ df['Value']
)
output:
ID_A ID_B ID_C Value ID_B_Value_Sum Value_Sum_New
0 22 5 1 54 208 208
1 23 5 2 34 208 119
2 24 6 1 44 268 204
3 25 6 1 64 268 224
4 26 5 2 35 208 120
5 27 7 3 45 229 164
6 28 7 2 66 229 229
7 29 8 1 76 161 139
8 30 8 2 25 161 138
9 31 6 2 27 268 172
10 32 5 3 14 208 191
11 33 5 3 17 208 194
12 34 6 2 43 268 188
13 35 6 2 53 268 198
14 36 8 1 22 161 85
15 37 7 3 65 229 184
16 38 7 1 53 229 229
17 39 8 2 23 161 136
18 40 8 3 15 161 161
19 41 6 3 37 268 268
20 42 5 2 54 208 139
explanation
As you said, computing a sum per group is easy in pandas. You can actually compute ID_B_Value_Sum with:
df['ID_B_Value_Sum'] = df.groupby('ID_B')['Value'].transform('sum')
Now we do the same for groups of ID_B + ID_C, we subtract it from ID_B_Value_Sum, and as we want to exclude only the other rows in the group, we add back the row Value itself.

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading
Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

Python - Convert String to an Object or Array

I have the data below stored in a variable data_str which have a class string.
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 500 659 -1
2 1 1 0 0 0 35 41 422 560 -1
3 1 1 1 0 0 44 41 406 203 -1
4 1 1 1 1 0 98 41 341 10 -1
5 1 1 1 1 1 98 42 31 8 70 ‘When
5 1 1 1 1 2 135 42 17 8 75 Dr.
5 1 1 1 1 3 160 41 32 9 92 Umali
5 1 1 1 1 4 197 44 25 6 96 rose
5 1 1 1 1 5 227 42 11 8 96 to
5 1 1 1 1 6 243 41 17 9 93 the
5 1 1 1 1 7 265 41 52 10 91 deanship
5 1 1 1 1 8 322 41 11 9 96 of
5 1 1 1 1 9 337 41 18 8 96 the
5 1 1 1 1 10 361 41 27 9 80 U.P.
5 1 1 1 1 11 394 41 45 10 85 College
Every time I access data_str[0] it returns l . I want to access the first line and every cell element of it. In other words I want to make it as an object so I can access every cell of it easily. How will I do it in Python? Please help.
If it's stored in string,
cells = data_str.split('\n')[1].split('\s') # first line in list
# all lines
lines = [line.split() for line in data_str.split('\n')]
Or using csv lib to handle the whole string:
from io import StringIO # Python 3
import csv
f = StringIO(data_str)
reader = csv.reader(f, delimiter='\s')
lines = [row for row in reader]
# first line
cells = lines[1]
strings are arrays in python, so data_str[0] means getting the character at position 0, that is l
So you can:
Split the string by row to retrieve a list of row.
For each row, split it to obtain items for each row.
Something like this
rows = data_str.splitlines()
arr = [row.split() for row in rows]
# now you can access item at row 1, column 2 like arr[1][2]

Comparing two consecutive rows and creating a new column based on a specific logical operation

I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,
These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0
You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56

Categories

Resources