I have a large CSV table with ~1,200 objects. I am narrowing down those to a volume limited sample (VLS) of 326 objects by setting certain parameters (only certain distances, etc.)
Within this VLS I am using a for loop to count the number of specific types of objects. I don't want to just count the entire VLS at once though, instead it'll count in "sections" (think of drawing boxes on a scatter plot and counting what's in each box).
I'm pretty sure my issue is because of the way pandas reads in the columns of my CSV table and the "box" array I have can't talk to the columns that are "dtype: object."
I don't expect someone to have a perfect fix for this, but even pointing me to some specific and relevant information on pandas would be helpful and appreciated. I try reading the documentation for pandas, but I don't understand much.
This is how I read in the CSV table and my columns in case it's relevant:
file = pd.read_csv(r'~/Downloads/CSV')
#more columns than this, but they're all defined like this in my code
blend = file["blend"]
dec = file["dec"]
When I define my VLS inside the definition of the section I'm looking at (named 'box') the code does work, and the for loop properly counts the objects.
This is what it looks like when it works:
color = np.array([-1,0,1])
for i in color:
box1 = np.where((constant box parameters) & (variable par >= i)&
(variable par < i+1) &('Volume-limited parameters I wont list'))[0]
binaries = np.where(blend[box1].str[:1].eq('Y'))[0]
candidates = np.where(blend[box1].str[0].eq('?'))[0]
singles = np.where(blend[box1].str[0].eq('N'))[0]
print ("from", i, "to", i+1, "there are", len(binaries), "binaries,", len(candidates), "candidates,", len(singles), "singles.")
# Correct Output:
"from -1 to 0 there are 7 binaries, 1 candidates, 78 singles."
"from 0 to 1 there are 3 binaries, 1 candidates, 24 singles."
"from 1 to 2 there are 13 binaries, 6 candidates, 69 singles."
The problem, is that I don't want to include the parameters for my VLS in the np.where() for "box". This is how I would like my code to look:
vollim = np.where((dec >= -30)&(dec <= 60) &(p_anglemas/err_p_anglemas
>= 5) &(dist<=25) &(err_j_k_mag < 0.2))[0]
j_k_mag_vl = j_k_mag[vollim]
abs_jmag_vl = abs_jmag[vollim]
blend_vl = blend[vollim]
hires_vl = hires[vollim]
#%%
color = np.array([-1,0,1])
for i in color:
box2 = np.where((abs_jmag_vl >= 13)&(abs_jmag_vl <= 16) &
(j_k_mag_vl >= i)&(j_k_mag_vl < i+1))[0]
binaries = np.where(blend_vl[box2].str[:1].eq('Y'))[0]
candidates = np.where(blend_vl[box2].str[0].eq('?'))[0]
singles = np.where(blend_vl[box2].str[0].eq('N'))[0]
print ("from", i, "to", i+1, "there are", len(binaries), "binaries,", len(candidates), "candidates,", len(singles), "singles.")
#Wrong Output:
"from -1 to 0 there are 4 binaries, 1 candidates, 22 singles."
"from 0 to 1 there are 1 binaries, 0 candidates, 5 singles."
"from 1 to 2 there are 4 binaries, 0 candidates, 14 singles."
When I print blend_vl[box2] a lot of the elements for blend_vl have been changed from their regular strings to "NaN" which I do not understand.
When I print box1 and box2 they are the same lengths but they are different indices.
I think blend_vl[box2] would work properly if I changed blend_vl into a flat array?
I know this is a lot of information at once, but I appreciate any input. Even if just some more info about how pandas and arrays. TIA!!
Related
Luckily i found this side:
https://www.linuxtut.com/en/150745ae0cc17cb5c866/
(There are many Linetypes difined
Excel Enum XlLineStyle)
(xlContinuous = 1
xlDashDot = 4
xlDashDotDot = 5
xlSlantDashDot = 13
xlDash = -4115
xldot = -4118
xlDouble = -4119
xlLineStyleNone = -4142)
i run with try and except +/- 100.000 times set lines because i thought anywhere should be this
[index] number for put this line in my picture too but they warsnt.. why not?
how can i set this line?
why are there some line indexe's in a such huge negative ranche and not just 1, 2, 3...?
how can i discover things like the "number" for doing things like that?
why is this even possible, to send apps data's in particular positions, i want to step a little deeper in that, where can i learn more about this?
(1) You can't find the medium dashed in the linestyle enum because there is none. The line that is drawn as border is a combination of lineStyle and Weight. The lineStyle is xlDash, the weight is xlThin for value 03 in your table and xlMedium for value 08.
(2) To figure out how to set something like this in VBA, use the Macro recorder, it will reveal that lineStyle, Weight (and color) are set when setting a border.
(3) There are a lot of pages describing all the constants, eg have a look to the one #FaneDuru linked to in the comments. They can also be found at Microsoft itself: https://learn.microsoft.com/en-us/office/vba/api/excel.xllinestyle and https://learn.microsoft.com/en-us/office/vba/api/excel.xlborderweight. It seems that someone translated them to Python constants on the linuxTut page.
(4) Don't ask why the enums are not continuous values. I assume especially the constants with negative numbers serve more that one purpose. Just never use the values directly, always use the defined constants.
(5) You can assume that numeric values that have no defined constant can work, but the results are kind of unpredictable. It's unlikely that there are values without constant that result in something "new" (eg a different border style).
As you can see in the following table, not all combination give different borders. Setting the weight to xlHairline will ignore the lineStyle. Setting it to xlThick will also ignore the lineStyle, except for xlDouble. Ob the other hand, xlDouble will be ignored when the weight is not xlThick.
Sub border()
With ThisWorkbook.Sheets(1)
With .Range("A1:J18")
.Clear
.Interior.Color = vbWhite
End With
Dim lStyles(), lWeights(), lStyleNames(), lWeightNames
lStyles() = Array(xlContinuous, xlDash, xlDashDot, xlDashDotDot, xlDot, xlDouble, xlLineStyleNone, xlSlantDashDot)
lStyleNames() = Array("xlContinuous", "xlDash", "xlDashDot", "xlDashDotDot", "xlDot", "xlDouble", "xlLineStyleNone", "xlSlantDashDot")
lWeights = Array(xlHairline, xlThin, xlMedium, xlThick)
lWeightNames = Array("xlHairline", "xlThin", "xlMedium", "xlThick")
Dim x As Long, y As Long
For x = LBound(lStyles) To UBound(lStyles)
Dim row As Long
row = x * 2 + 3
.Cells(row, 1) = lStyleNames(x) & vbLf & "(" & lStyles(x) & ")"
For y = LBound(lWeights) To UBound(lWeights)
Dim col As Long
col = y * 2 + 3
If x = 1 Then .Cells(1, col) = lWeightNames(y) & vbLf & "(" & lWeights(y) & ")"
With .Cells(row, col).Borders
.LineStyle = lStyles(x)
.Weight = lWeights(y)
End With
Next
Next
End With
End Sub
I was wondering if anyone else has ever experienced value_counts() returning incorrect counts. I have two variables, Pass and Fail, and when I use value_counts() it is returning the correct total but the wrong number for each variable.
The data in the data frame is for samples made with different sample preparation methods (A-G) and then tested on different testing machines (numbered 1-5; they run the same test we just have 5 different ones so we can run more tests) and I am trying to compare both the method and testers by putting the pass % into a pivot table. I would like to be able to do this for different sample materials as well so I have been trying to write the pass % function in a separate script so that I can call it to each material's script if that makes sense.
The pass % function is as follows:
def pass_percent(df_copy):
pds = df_copy.value_counts()
p = pds['PASS']
try:
f = pds['FAIL']
except:
f = 0
print(pds)
print(p)
print(f)
pass_pc = p/(p+f) *100
print(pass_pc)
return pass_pc
And then within the individual material script (e.g. material 1A) I have (among a few other things to tidy up the data frame before this - essentially getting rid of columns I don't need from the testing outputs):
from pass_pc_function import pass_percent
mat_1A = pd.pivot_table(df_copy, index='Prep_Method', columns='Test_Machine', aggfunc=pass_percent)
An example of what is happening is, for Material 1A I have 100 tests of Prep_Method A on Test_Machine 1 of which 65 passed and 35 failed, so a 65% pass rate. But value_counts() is returning 56 passes and 44 fails (so the total is still 100 which is correct but for some reason it is counting 9 passes as fails). This is just an example, I have much larger data sets than this but this is essentially what is happening.
I thought perhaps it could be a white space issue so I also have the line:
df_copy.columns = [x.strip() for x in df_copy.columns]
in my M1A script. However I am still getting this strange error.
Any advice would be appreciated. Thanks!
EDIT:
Results example as requested
PASS 31
FAIL 27
Name: Result, dtype: int64
31
27
53.44827586206896
Result
Test_Machine 1 2 3 4
Prep_Method
A 53.448276 89.655172 93.478261 97.916667
B 87.050360 90.833333 91.596639 97.468354
C 83.333333 93.150685 98.305085 100.000000
D 85.207101 94.339623 95.652174 97.163121
E 87.901701 96.310680 95.961538 98.655462
F 73.958333 82.178218 86.166008 93.750000
G 80.000000 91.743119 89.622642 98.529412
The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]
This question is related to HECRAS if anyone has experience, but in general it's just a question about writing text files to a very particular format to be read by the HECRAS software.
Basically I'm reading some files and altering some numbers, then writing it back out but I can't seem to match the initial format perfectly.
Here is what the original file looks like:
Type RM Length L Ch R = 1 ,229.41 ,21276,21276,21276
Node Last Edited Time=Oct-17-2019 15:52:28
#Sta/Elev= 452
0 20.097 67.042 9.137 67.43 9.139 68.208 9.073 68.598 9.129
68.986 9.086 70.538 9.071 70.926 9.042 71.984 9.046 72.48 9.025
73.646 9.056 74.368 9.034 75.586 9.042 76.55 9.017 77.138 9.047
78.304 8.989 79.47 9.025 80.19 9.001 81.41 9.003 81.974 8.978
83.83 9.005 85.284 9.079 85.682 9.068 86.97 9.118 88.012 9.223
88.79 9.239 89.65 9.316 90.342 9.324 91.134 9.475 91.966 9.525
92.282 9.589 93.346 9.546 94.222 9.557 94.922 9.594 95.71 9.591
96.546 9.64 97.286 9.574 98.87 9.688 99.258 9.673 99.642 9.712
#Mann= 3 , 0 , 0
0 .09 0 246.4 .028 0 286.4 .09 0
Bank Sta=246.4,286.4
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=1.708,0.1, 500
XS HTab Horizontal Distribution= 5 , 5 , 5
Exp/Cntr=0.3,0.1
I'm interested in the Sta/Elev data...it look's like some right justified tab/space? delimited format in station/elevation pairs of 5 per line..maybe 16 chars per pair??
I've tried a bunch of different things, my current code is:
with open('C:/Users/deden/Desktop/t/test.g01','w') as out:
out.write(txt[:idx[0][0]])
out.write(txt[idx[0][0]:idx[0][0]+bounds[0]])
out.write('#'+raw_SE.split('\n')[0]+'\n')
i = 0
while i <= len(new_SE):
out.write('\t'.join(new_SE[i:i+10])+'\n')
i+=10
out.write(txt[idx[0][0]+bounds[1]:idx[1][0]])
it's a little hacky atm, still trying to work it out, the important part is just:
while i <= len(new_SE):
out.write('\t'.join(new_SE[i:i+10])+'\n')
i+=10
new_SE is just a list of station/elevation:
['0', '30.097', '67.042', '19.137', '67.43', '19.139', '68.208', '19.073', '68.598', '19.128999999999998' ...]
I also tried playing around with justified side with something like:
'%8s %8s' % (tmp[0], tmp[1])
to basically have 8 spaces between the text but right justify them
honestly struggling...if anyone can recreate the original text in between #Sta/Elev= 452 and #Mann I would be eternally grateful, here is the full list if someone wants to give it a go:
new_SE = ['0', '30.097', '67.042', '19.137', '67.43', '19.139', '68.208', '19.073', '68.598', '19.128999999999998', '68.986', '19.086', '70.538', '19.070999999999998', '70.926', '19.042', '71.984', '19.046', '72.48', '19.025', '73.646', '19.055999999999997', '74.368', '19.034', '75.586', '19.042', '76.55', '19.017', '77.138', '19.047', '78.304', '18.989', '79.47', '19.025', '80.19', '19.000999999999998', '81.41', '19.003', '81.974', '18.978', '83.83', '19.005000000000003', '85.284', '19.079', '85.682', '19.067999999999998', '86.97', '19.118000000000002', '88.012', '19.223', '88.79', '19.239', '89.65', '19.316000000000003', '90.342', '19.323999999999998', '91.134', '19.475', '91.966', '19.525', '92.282', '19.589', '93.346', '19.546', '94.222', '19.557000000000002', '94.922', '19.594', '95.71', '19.591', '96.546', '19.64', '97.286', '19.573999999999998', '98.87', '19.688000000000002', '99.258', '19.673000000000002', '99.642', '19.712']
not really sure if I understood correctly - please consider having a look at
# with open('C:/Users/deden/Desktop/t/test.g01','w') as out:
for i in range(0, len(new_SE), 10):
row = [f'{float(v):8.3f}' for v in new_SE[i:i+10]]
out.write(''.join(r) + '\n')
# 0.000 30.097 67.042 19.137 67.430 19.139 68.208 19.073 68.598 19.129
# 68.986 19.086 70.538 19.071 70.926 19.042 71.984 19.046 72.480 19.025
# 73.646 19.056 74.368 19.034 75.586 19.042 76.550 19.017 77.138 19.047
# 78.304 18.989 79.470 19.025 80.190 19.001 81.410 19.003 81.974 18.978
# 83.830 19.005 85.284 19.079 85.682 19.068 86.970 19.118 88.012 19.223
# 88.790 19.239 89.650 19.316 90.342 19.324 91.134 19.475 91.966 19.525
# 92.282 19.589 93.346 19.546 94.222 19.557 94.922 19.594 95.710 19.591
# 96.546 19.640 97.286 19.574 98.870 19.688 99.258 19.673 99.642 19.712
Here is what I did:
w = zeros(1,28);
e = zeros(1,63) + 1;
r = zeros(1,90) + 2;
t = zeros(1,100) + 3;
y = zeros(1,90) + 4;
u = zeros(1,63) + 5;
i = zeros(1,28) + 6;
qa = horzcat(w,e,r,t,y,u,i);
hist(qa,25,0.5)
h = findobj(gca,'Type','patch');
set(h,'FaceColor',[.955 0 0],'EdgeColor','w');
I would like to achieve the effect, but it in a more succinct way. This is my attempt:
v= zeros(1,28);
for i=2:8
v(i) = horzcat(v(i-1) + (i-1));
end
And the error I receive is "Cell contents assignment to a non-cell array object."
Also, would anyone know what the python equivalent would be, if it is not too much to ask?
You can also achieve this without a for loop, albeit somewhat less intiutive. But hey, it's without loops! Besides, it gives you freedom to pick a different set of values.
v=[0;1;2;3;4;5;6]; %values
r=[28 63 90 100 90 63 28]; %number of repeats
qa=zeros(sum(r),1);
qa(cumsum([1 r(1:end-1)]))=1;
qa=v(cumsum(qa));
You don't need a cell array to concatenate vectors for which one of the dimensions always remain the same (for example, row or columns, in your case, row).
You can define the sizes in a separate array and then use for loop as follows.
szArray=[28 63 90 100 90 63 28];
qa=[];
for i=0:length(szArray)-1
%multiplying by i replicates the addition of a scalar you have done.
qa=[qa i*ones(1,szArray(i+1)];
end
This is still hardcoding. It will only apply to the exact problem you have mentioned above.