Renumbering a Sequence of Numbers With Gaps using Python - python
I am trying to figure out how to renumber a certain file format and struggling to get it right.
First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.
259 252
260 254
261 255
262 256
264 248 265 268
265 264 266 269 270
266 265 267 282
267 266
268 264
269 265
270 265 271 276 277
271 270 272 273
272 271 274 278
273 271 275 279
274 272 275 280
275 273 274 281
276 270
277 270
278 272
279 273
280 274
282 266 283 286
283 282 284 287 288
284 283 285 289
285 284
286 282
287 283
288 283
289 284 290 293
290 289 291 294 295
291 290 292 304
As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
with open("atom_nums.xyz", "r") as f2:
lines = f2.read()
for i in range(0, len(missing_nums) - 1):
if i == 0:
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i])
for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
else:
with open("atom_nums_out.xyz", "r") as f2:
lines = f2.read()
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i]) - (i + 1)
print(replacement)
for number in range(int(missing_nums[i]), int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.
EDIT: The desired output of the code using the above sample would be
259 252
260 254
261 255
262 256
263 248 264 267
264 263 265 268 269
265 264 266 280
266 265
267 263
268 264
269 264 270 275 276
270 269 271 272
271 270 273 277
272 270 274 278
273 271 274 279
274 272 273 279
275 269
276 269
277 271
278 272
279 273
280 265 281 284
281 280 282 285 286
282 281 283 287
283 282
284 280
285 281
286 281
287 282 288 291
288 287 289 292 293
289 288 290 302
Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.
Thanks!
Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.
from os import rename
def fixFile(fn, mn1, mn2):
with open(fn, "r") as fin:
with open('tmp.txt', "w") as fout:
for line in fin:
for i in range(len(mn1)):
minN = int(mn1[1])
maxN = int(mn2[i])
for nxtn in range(minN, maxN):
line.replace(str(nxtn), str(nxtn +1))
fout.write(line)
rename(temp, fn)
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"
fixFile(fn, missing_nums, missing_nums2)
Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.
Related
How to convert image Run length Encoded Pixels mask to binary mask and reshape in python?
I have file having EncodedPixels mask of different size 1: I want to convert these EncodedPixels in binary and resize all into 1024 and then again convert in to EncodedPixels. Explanation: In file there is image-Mask in Encoded Pixels form, and images have different dimensions (5000x5000, 260x260 etc) So I resize all images in to 1024x1024, Now I want to resize each image-mask according to image 1024x1024. I my mind there is only one possible solution (might be more available) to resize mask is first we need to convert run length encoding pixel in to binary and then we are able to resize mask easily. File Link: link here This code will use to resize binary mask. from PIL import Image import numpy as np pil_image = Image.fromarray(binary_mask) pil_image = pil_image.resize((new_width, new_height), Image.NEAREST) resized_binary_mask = np.asarray(pil_image) Encoded Pixels Example ['6068157 7 6073371 20 6078584 34 6083797 48 6089010 62 6094223 72 6099436 76 6104649 80 6109862 85 6115075 89 6120288 93 6125501 98 6130714 102 6135927 106 6141140 111 6146354 114 6151567 118 6156780 123 6161993 127 6167206 131 6172419 136 6177632 140 6182845 144 6188058 149 6193271 153 6198484 157 6203697 162 6208910 166 6214124 169 6219337 174 6224550 178 6229763 182 6234976 187 6240189 191 6245402 195 6250615 200 6255828 204 6261041 208 6266254 213 6271467 218 6276680 224 6281893 229 6287107 233 6292320 238 6297533 244 6302746 249 6307959 254 6313172 259 6318385 265 6323598 270 6328811 275 6334024 280 6339237 286 6344450 291 6349663 296 6354877 300 6360090 306 6365303 311 6370516 316 6375729 322 6380942 327 6386155 332 6391368 337 6396581 343 6401794 348 6407007 353 6412220 358 6417433 364 6422647 368 6427860 373 6433073 378 6438286 384 6443499 389 6448712 394 6453925 399 6459138 405 6464351 410 6469564 415 6474777 420 6479990 426 17204187 78 17208797 227 17209412 56 17214025 203 17214637 34 17219253 179 17219862 11 17224481 155 17229709 131 17234937 107 17240165 83 17245393 60 17250621 36 17255849 12']
extract only integer from txt
I have a txt file which contains lost of information,I do not want its head and tail, I need only numbers in the middle. which is a 1x11200 matrix. [txtpda] LT=5.6 DATE=21.06.2018 TIME=14:11 CNT=11200 RES=0.00854518 N=5 VB=350 VT=0.5 LS=0 MEASTIME=201806211412 PICKUP=BFW-2 LC=0.8 [PROFILE] 255 256 258 264 269 273 267 258 251 255 259 262 260 256 255 260 264 266 265 263 261 263 267 275 280 280 280 280 283 284 283 277 279 280 283 285 283 282 280 280 286 288 298 299 299 299 304 303 300 297 295 296 299 301 303 301 299 296 298 299 302 303 304 307 308 312 313 314 312 311 311 310 312 310 309 305 303 299 297 294 288 280 270 266 250 242 222 213 199 180 173 ... -1062 -1063 [VALUES] Ra;2;3;2;0.769;0;0;-1;0;-1;0 Rz;2;2;2;5.137;0;0;-1;0;-1;0 Pt;0;0;0;26.25;0;0;-1;0;-1;0 Wt;0;0;0;24.3;0;0;-1;0;-1;0 now I using the following method to extract numbers: def OpenFile(): name=askopenfilename(parent=root) f=open(name,'r') originalyvec1=[] yvec1=[] if f==0: print("fail to open the file") else: print("file successfully opened") data=f.readlines() for i in range(0,14): del data[0]//delete its head(string) del data[11204]//delete its tail(string) del data[11203]//delete its tail(string) del data[11202]//delete its tail(string) del data[11201]//delete its tail(string) del data[11200]//delete its tail(string) for line in data: for nbr in line.split(): //delete \n yvec1.append(int(nbr)) if f.close()==0: print("fail to close file") else: print("file closed") I want to use numpy to manage it in a easy way. Is that possible? like np.array or something like that.
You can use a alternative form of iter(), where you pass iter() a function and it will keep calling that function until it sees the value (2nd arg). You can use this to skip until you see [PROFILE]\n and then use that same form of iter() to read until [VALUES]\n. The function is just the one called by next(iterable), which is iterable.__next__, e.g.: with open(name) as f: for _ in iter(f.__next__, '[PROFILE]\n'): # Skip until PROFILE pass yvec1 = [int(d) for d in iter(f.__next__, '[VALUES]\n')] yvec1 will now contain all values between [PROFILE] and [VALUES]. An alternative and potentially quicker way to consume the first iter() is to use collections.deque() instead of the for loop but this is likely over-kill for this problem, e.g.: deque(iter(f.__next__, '[PROFILE]\n'), maxlen=0) Note: using with will automatically close(f) at the end of the block.
You can simply replace everything from the line data=f.readlines() and below with: data = [int(line) for line in map(str.strip, f.readlines()) if line.isdigit() or line.startswith('-') and line[1:].isdigit()] And data will be the list of integers you're looking for.
Just to give you the idea this may help The s3[0] will be all the numbers between PROFILE ans VALUES #s=your data s='sjlkf slflsafj[PROFILEl9723,2974982,2987492,886[VALUES]skjlfsajlsjal' s2=s.split('[PROFILE]') s3=s2[1].split('[VALUES]')
How to display a sequence of numbers in column-major order?
Program description: Find all the prime numbers between 1 and 4,027 and print them in a table which "reads down", using as few rows as possible, and using as few sheets of paper as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height of the columns should all be the same, except for perhaps the last column, which might have a few blank entries towards its bottom row. The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down. 2 23 59 3 29 61 5 31 67 7 37 71 11 41 73 13 43 79 17 47 83 19 53 89 ect... This all I've been able to come up with myself: def findPrimes(n): """ Adds calculated prime numbers to a list. """ prime_list = list() for number in range(1, n + 1): prime = True for i in range(2, number): if(number % i == 0): prime = False if prime: prime_list.append(number) return prime_list def displayPrimes(): pass print(findPrimes(4027)) I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems. I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them. For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end. from itertools import zip_longest import locale import math locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting def zip_discard(*iterables, _NULL=object()): """ Like zip_longest() but doesn't fill out all rows to equal length. https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue """ return [[entry for entry in iterable if entry is not _NULL] for iterable in zip_longest(*iterables, fillvalue=_NULL)] def grouper(n, seq): """ Group elements in sequence into groups of "n" items. """ for i in range(0, len(seq), n): yield seq[i:i+n] def tabularize(width, height, numbers): """ Print list of numbers in column-major tabular form given the dimensions of the table in characters (rows and columns). Will create multiple tables of required to display all numbers. """ # Determine number of chars needed to hold longest formatted numeric value gap = 2 # including space between numbers col_width = len('{:n}'.format(max(numbers))) + gap # Determine number of columns that will fit within the table's width. num_cols = width // col_width chunk_size = num_cols * height # maximum numbers in each table for i, chunk in enumerate(grouper(chunk_size, numbers), start=1): print('---- Page {} ----'.format(i)) num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up table = zip_discard(*grouper(num_rows, chunk)) for row in table: print(''.join(('{:{width}n}'.format(num, width=col_width) for num in row))) def find_primes(n): """ Create list of prime numbers from 1 to n. """ prime_list = [] for number in range(1, n+1): for i in range(2, int(math.sqrt(number)) + 1): if not number % i: # Evenly divisible? break # Not prime. else: prime_list.append(number) return prime_list primes = find_primes(4027) tabularize(80, 15, primes) Output: ---- Page 1 ---- 1 47 113 197 281 379 463 571 659 761 863 2 53 127 199 283 383 467 577 661 769 877 3 59 131 211 293 389 479 587 673 773 881 5 61 137 223 307 397 487 593 677 787 883 7 67 139 227 311 401 491 599 683 797 887 11 71 149 229 313 409 499 601 691 809 907 13 73 151 233 317 419 503 607 701 811 911 17 79 157 239 331 421 509 613 709 821 919 19 83 163 241 337 431 521 617 719 823 929 23 89 167 251 347 433 523 619 727 827 937 29 97 173 257 349 439 541 631 733 829 941 31 101 179 263 353 443 547 641 739 839 947 37 103 181 269 359 449 557 643 743 853 953 41 107 191 271 367 457 563 647 751 857 967 43 109 193 277 373 461 569 653 757 859 971 ---- Page 2 ---- 977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087 983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089 991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099 997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111 1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113 1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129 1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131 1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137 1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141 1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143 1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153 1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161 1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179 1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203 1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207 ---- Page 3 ---- 2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413 2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433 2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449 2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457 2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461 2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463 2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467 2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469 2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491 2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499 2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511 2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517 2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527 2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529 2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533 ---- Page 4 ---- 3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019 3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021 3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027 3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003 3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007 3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013
Separate specific value in a dataframe
I have a large dataset. I am trying to read it with Pandas Dataframe. I want to separate some values from one of the columns. Assuming the name of column is "A", there are values ranging from 90 to 300. I want to separate any values between 270 to 280. I did try below code but it is wrong! %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('....csv') df2 = df[ 270 < df['A'] < 280]
Use between with boolean indexing: df = pd.DataFrame({'A':range(90,300)}) df2 = df[df['A'].between(270,280, inclusive=False)] print (df2) A 181 271 182 272 183 273 184 274 185 275 186 276 187 277 188 278 189 279 Or: df2 = df[(df['A'] > 270) & (df['A'] < 280)] print (df2) A 181 271 182 272 183 273 184 274 185 275 186 276 187 277 188 278 189 279
Using numpy to speed things up and reconstruct a new dataframe. Assuming we use jezrael's sample data a = df.A.values m = (a > 270) & (a < 280) pd.DataFrame(a[m], df.index[m], df.columns) A 181 271 182 272 183 273 184 274 185 275 186 276 187 277 188 278 189 279
You can also use query() method: df2 = df.query("270 < A < 280") Demo: In [40]: df = pd.DataFrame({'A':range(90,300)}) In [41]: df.query("270 < A < 280") Out[41]: A 181 271 182 272 183 273 184 274 185 275 186 276 187 277 188 278 189 279
Removing a recurrant regular expression in a string - Python
I have the following collection of items. I would like to add a comma followed by a space at the end of each item so I can create a list out of them. I am assuming the best way to do this is to form a string out of the items and then replace 3 spaces between each item with a comma, using regular expressions? I would like to do this with python, which I am new to. 179 181 191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
Instead of a regular expression, how about this (assuming you have it in a file somewhere): items = open('your_file.txt').read().split() If it's just in a string variable: items = your_input.split() To combine them again with a comma in between: print ', '.join(items)
data = """179 181 191 193 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 """ To get the list out of it: lst = re.findall("(\d+)", data) print lst To add comma after each item, replace multiple spaces with , and space. data = re.sub("[ ]+", ", ", data) print data