Removing a recurrant regular expression in a string - Python - python

I have the following collection of items. I would like to add a comma followed by a space at the end of each item so I can create a list out of them. I am assuming the best way to do this is to form a string out of the items and then replace 3 spaces between each item with a comma, using regular expressions?
I would like to do this with python, which I am new to.
179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281
283 293 307 311 313 317 331 337 347 349
353 359 367 373 379 383 389 397 401 409
419 421 431 433 439 443 449 457 461 463

Instead of a regular expression, how about this (assuming you have it in a file somewhere):
items = open('your_file.txt').read().split()
If it's just in a string variable:
items = your_input.split()
To combine them again with a comma in between:
print ', '.join(items)

data = """179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281 """
To get the list out of it:
lst = re.findall("(\d+)", data)
print lst
To add comma after each item, replace multiple spaces with , and space.
data = re.sub("[ ]+", ", ", data)
print data

Related

Renumbering a Sequence of Numbers With Gaps using Python

I am trying to figure out how to renumber a certain file format and struggling to get it right.
First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.
259 252
260 254
261 255
262 256
264 248 265 268
265 264 266 269 270
266 265 267 282
267 266
268 264
269 265
270 265 271 276 277
271 270 272 273
272 271 274 278
273 271 275 279
274 272 275 280
275 273 274 281
276 270
277 270
278 272
279 273
280 274
282 266 283 286
283 282 284 287 288
284 283 285 289
285 284
286 282
287 283
288 283
289 284 290 293
290 289 291 294 295
291 290 292 304
As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
with open("atom_nums.xyz", "r") as f2:
lines = f2.read()
for i in range(0, len(missing_nums) - 1):
if i == 0:
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i])
for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
else:
with open("atom_nums_out.xyz", "r") as f2:
lines = f2.read()
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i]) - (i + 1)
print(replacement)
for number in range(int(missing_nums[i]), int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.
EDIT: The desired output of the code using the above sample would be
259 252
260 254
261 255
262 256
263 248 264 267
264 263 265 268 269
265 264 266 280
266 265
267 263
268 264
269 264 270 275 276
270 269 271 272
271 270 273 277
272 270 274 278
273 271 274 279
274 272 273 279
275 269
276 269
277 271
278 272
279 273
280 265 281 284
281 280 282 285 286
282 281 283 287
283 282
284 280
285 281
286 281
287 282 288 291
288 287 289 292 293
289 288 290 302
Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.
Thanks!
Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.
from os import rename
def fixFile(fn, mn1, mn2):
with open(fn, "r") as fin:
with open('tmp.txt', "w") as fout:
for line in fin:
for i in range(len(mn1)):
minN = int(mn1[1])
maxN = int(mn2[i])
for nxtn in range(minN, maxN):
line.replace(str(nxtn), str(nxtn +1))
fout.write(line)
rename(temp, fn)
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"
fixFile(fn, missing_nums, missing_nums2)
Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.

Python print string alignment

I am printing some values in a loop in Python. My current output is as follows:
0 Data Count: 249 7348 249 4469 2768 261 20 126
1 Data Count: 288 11 288 48 2284 598 137 408
2 Data Count: 808 999 808 2896 32739 138 202 678
3 Data Count: 140 26 140 2688 8054 884 433 987
What I'd like is for all values in each column to align, despite differing character/number counts in some, to make it easier to read.
The pseudo code behind this is as follows:
for i in range(0,3):
print i, " Data Count: ", Count_A, " ", Count_B, " ", Count_C, " ", Count_D, " ", Count_E, " ", Count_F, " ", Count_G, " ", Count_H
Thanks in advance everyone!
You could use format string justification:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("{:5} {:5} {:5} {:5}".format(*data))
output:
92 460 72 630
837 214 118 677
906 328 102 320
895 998 177 922
651 742 215 938
According to the format specification from Python docs
With the % string formatting operator, the minimum width of output is specified in a placeholder as a number before the data type (the full format of a placeholder is %[key][flags][width][.precision][length type]conversion type). If the result is shorter, it will be left-padded to the specified length:
from random import randint
for i in range(5):
data = [randint(0, 1000) for j in range(5)]
print("%5d %5d %5d %5d %5d" % tuple(data))
gives:
946 937 544 636 871
232 860 704 877 716
868 849 851 488 739
419 381 695 909 518
570 756 467 351 537
(code adapted from #andreihondrari's answer)

extract only integer from txt

I have a txt file which contains lost of information,I do not want its head and tail, I need only numbers in the middle. which is a 1x11200 matrix.
[txtpda]
LT=5.6
DATE=21.06.2018
TIME=14:11
CNT=11200
RES=0.00854518
N=5
VB=350
VT=0.5
LS=0
MEASTIME=201806211412
PICKUP=BFW-2
LC=0.8
[PROFILE]
255
256
258
264
269
273
267
258
251
255
259
262
260
256
255
260
264
266
265
263
261
263
267
275
280
280
280
280
283
284
283
277
279
280
283
285
283
282
280
280
286
288
298
299
299
299
304
303
300
297
295
296
299
301
303
301
299
296
298
299
302
303
304
307
308
312
313
314
312
311
311
310
312
310
309
305
303
299
297
294
288
280
270
266
250
242
222
213
199
180
173
...
-1062
-1063
[VALUES]
Ra;2;3;2;0.769;0;0;-1;0;-1;0
Rz;2;2;2;5.137;0;0;-1;0;-1;0
Pt;0;0;0;26.25;0;0;-1;0;-1;0
Wt;0;0;0;24.3;0;0;-1;0;-1;0
now I using the following method to extract numbers:
def OpenFile():
name=askopenfilename(parent=root)
f=open(name,'r')
originalyvec1=[]
yvec1=[]
if f==0:
print("fail to open the file")
else:
print("file successfully opened")
data=f.readlines()
for i in range(0,14):
del data[0]//delete its head(string)
del data[11204]//delete its tail(string)
del data[11203]//delete its tail(string)
del data[11202]//delete its tail(string)
del data[11201]//delete its tail(string)
del data[11200]//delete its tail(string)
for line in data:
for nbr in line.split(): //delete \n
yvec1.append(int(nbr))
if f.close()==0:
print("fail to close file")
else:
print("file closed")
I want to use numpy to manage it in a easy way. Is that possible?
like np.array or something like that.
You can use a alternative form of iter(), where you pass iter() a function and it will keep calling that function until it sees the value (2nd arg). You can use this to skip until you see [PROFILE]\n and then use that same form of iter() to read until [VALUES]\n. The function is just the one called by next(iterable), which is iterable.__next__, e.g.:
with open(name) as f:
for _ in iter(f.__next__, '[PROFILE]\n'): # Skip until PROFILE
pass
yvec1 = [int(d) for d in iter(f.__next__, '[VALUES]\n')]
yvec1 will now contain all values between [PROFILE] and [VALUES].
An alternative and potentially quicker way to consume the first iter() is to use collections.deque() instead of the for loop but this is likely over-kill for this problem, e.g.:
deque(iter(f.__next__, '[PROFILE]\n'), maxlen=0)
Note: using with will automatically close(f) at the end of the block.
You can simply replace everything from the line data=f.readlines() and below with:
data = [int(line) for line in map(str.strip, f.readlines()) if line.isdigit() or line.startswith('-') and line[1:].isdigit()]
And data will be the list of integers you're looking for.
Just to give you the idea this may help
The s3[0] will be all the numbers between PROFILE ans VALUES
#s=your data
s='sjlkf slflsafj[PROFILEl9723,2974982,2987492,886[VALUES]skjlfsajlsjal'
s2=s.split('[PROFILE]')
s3=s2[1].split('[VALUES]')

How to display a sequence of numbers in column-major order?

Program description:
Find all the prime numbers between 1 and 4,027 and print them in a table which
"reads down", using as few rows as possible, and using as few sheets of paper
as possible. (This is because I have to print them out on paper to turn it in.) All numbers should be right-justified in their column. The height
of the columns should all be the same, except for perhaps the last column,
which might have a few blank entries towards its bottom row.
The plan for my first function is to find all prime numbers between the range above and put them in a list. Then I want my second function to display the list in a table that reads up to down.
2 23 59
3 29 61
5 31 67
7 37 71
11 41 73
13 43 79
17 47 83
19 53 89
ect...
This all I've been able to come up with myself:
def findPrimes(n):
""" Adds calculated prime numbers to a list. """
prime_list = list()
for number in range(1, n + 1):
prime = True
for i in range(2, number):
if(number % i == 0):
prime = False
if prime:
prime_list.append(number)
return prime_list
def displayPrimes():
pass
print(findPrimes(4027))
I'm not sure how to make a row/column display in Python. I remember using Java in my previous class and we had to use a for loop inside a for loop I believe. Do I have to do something similar to that?
Although I frequently don't answer questions where the original poster hasn't even made an attempt to solve the problem themselves, I decided to make an exception of yours—mostly because I found it an interesting (and surprisingly challenging) problem that required solving a number of somewhat tricky sub-problems.
I also optimized your find_primes() function slightly by taking advantage of some reatively well-know computational shortcuts for calculating them.
For testing and demo purposes, I made the tables only 15 rows high to force more than one page to be generated as shown in the output at the end.
from itertools import zip_longest
import locale
import math
locale.setlocale(locale.LC_ALL, '') # enable locale-specific formatting
def zip_discard(*iterables, _NULL=object()):
""" Like zip_longest() but doesn't fill out all rows to equal length.
https://stackoverflow.com/questions/38054593/zip-longest-without-fillvalue
"""
return [[entry for entry in iterable if entry is not _NULL]
for iterable in zip_longest(*iterables, fillvalue=_NULL)]
def grouper(n, seq):
""" Group elements in sequence into groups of "n" items. """
for i in range(0, len(seq), n):
yield seq[i:i+n]
def tabularize(width, height, numbers):
""" Print list of numbers in column-major tabular form given the dimensions
of the table in characters (rows and columns). Will create multiple
tables of required to display all numbers.
"""
# Determine number of chars needed to hold longest formatted numeric value
gap = 2 # including space between numbers
col_width = len('{:n}'.format(max(numbers))) + gap
# Determine number of columns that will fit within the table's width.
num_cols = width // col_width
chunk_size = num_cols * height # maximum numbers in each table
for i, chunk in enumerate(grouper(chunk_size, numbers), start=1):
print('---- Page {} ----'.format(i))
num_rows = int(math.ceil(len(chunk) / num_cols)) # rounded up
table = zip_discard(*grouper(num_rows, chunk))
for row in table:
print(''.join(('{:{width}n}'.format(num, width=col_width)
for num in row)))
def find_primes(n):
""" Create list of prime numbers from 1 to n. """
prime_list = []
for number in range(1, n+1):
for i in range(2, int(math.sqrt(number)) + 1):
if not number % i: # Evenly divisible?
break # Not prime.
else:
prime_list.append(number)
return prime_list
primes = find_primes(4027)
tabularize(80, 15, primes)
Output:
---- Page 1 ----
1 47 113 197 281 379 463 571 659 761 863
2 53 127 199 283 383 467 577 661 769 877
3 59 131 211 293 389 479 587 673 773 881
5 61 137 223 307 397 487 593 677 787 883
7 67 139 227 311 401 491 599 683 797 887
11 71 149 229 313 409 499 601 691 809 907
13 73 151 233 317 419 503 607 701 811 911
17 79 157 239 331 421 509 613 709 821 919
19 83 163 241 337 431 521 617 719 823 929
23 89 167 251 347 433 523 619 727 827 937
29 97 173 257 349 439 541 631 733 829 941
31 101 179 263 353 443 547 641 739 839 947
37 103 181 269 359 449 557 643 743 853 953
41 107 191 271 367 457 563 647 751 857 967
43 109 193 277 373 461 569 653 757 859 971
---- Page 2 ----
977 1,069 1,187 1,291 1,427 1,511 1,613 1,733 1,867 1,987 2,087
983 1,087 1,193 1,297 1,429 1,523 1,619 1,741 1,871 1,993 2,089
991 1,091 1,201 1,301 1,433 1,531 1,621 1,747 1,873 1,997 2,099
997 1,093 1,213 1,303 1,439 1,543 1,627 1,753 1,877 1,999 2,111
1,009 1,097 1,217 1,307 1,447 1,549 1,637 1,759 1,879 2,003 2,113
1,013 1,103 1,223 1,319 1,451 1,553 1,657 1,777 1,889 2,011 2,129
1,019 1,109 1,229 1,321 1,453 1,559 1,663 1,783 1,901 2,017 2,131
1,021 1,117 1,231 1,327 1,459 1,567 1,667 1,787 1,907 2,027 2,137
1,031 1,123 1,237 1,361 1,471 1,571 1,669 1,789 1,913 2,029 2,141
1,033 1,129 1,249 1,367 1,481 1,579 1,693 1,801 1,931 2,039 2,143
1,039 1,151 1,259 1,373 1,483 1,583 1,697 1,811 1,933 2,053 2,153
1,049 1,153 1,277 1,381 1,487 1,597 1,699 1,823 1,949 2,063 2,161
1,051 1,163 1,279 1,399 1,489 1,601 1,709 1,831 1,951 2,069 2,179
1,061 1,171 1,283 1,409 1,493 1,607 1,721 1,847 1,973 2,081 2,203
1,063 1,181 1,289 1,423 1,499 1,609 1,723 1,861 1,979 2,083 2,207
---- Page 3 ----
2,213 2,333 2,423 2,557 2,687 2,789 2,903 3,037 3,181 3,307 3,413
2,221 2,339 2,437 2,579 2,689 2,791 2,909 3,041 3,187 3,313 3,433
2,237 2,341 2,441 2,591 2,693 2,797 2,917 3,049 3,191 3,319 3,449
2,239 2,347 2,447 2,593 2,699 2,801 2,927 3,061 3,203 3,323 3,457
2,243 2,351 2,459 2,609 2,707 2,803 2,939 3,067 3,209 3,329 3,461
2,251 2,357 2,467 2,617 2,711 2,819 2,953 3,079 3,217 3,331 3,463
2,267 2,371 2,473 2,621 2,713 2,833 2,957 3,083 3,221 3,343 3,467
2,269 2,377 2,477 2,633 2,719 2,837 2,963 3,089 3,229 3,347 3,469
2,273 2,381 2,503 2,647 2,729 2,843 2,969 3,109 3,251 3,359 3,491
2,281 2,383 2,521 2,657 2,731 2,851 2,971 3,119 3,253 3,361 3,499
2,287 2,389 2,531 2,659 2,741 2,857 2,999 3,121 3,257 3,371 3,511
2,293 2,393 2,539 2,663 2,749 2,861 3,001 3,137 3,259 3,373 3,517
2,297 2,399 2,543 2,671 2,753 2,879 3,011 3,163 3,271 3,389 3,527
2,309 2,411 2,549 2,677 2,767 2,887 3,019 3,167 3,299 3,391 3,529
2,311 2,417 2,551 2,683 2,777 2,897 3,023 3,169 3,301 3,407 3,533
---- Page 4 ----
3,539 3,581 3,623 3,673 3,719 3,769 3,823 3,877 3,919 3,967 4,019
3,541 3,583 3,631 3,677 3,727 3,779 3,833 3,881 3,923 3,989 4,021
3,547 3,593 3,637 3,691 3,733 3,793 3,847 3,889 3,929 4,001 4,027
3,557 3,607 3,643 3,697 3,739 3,797 3,851 3,907 3,931 4,003
3,559 3,613 3,659 3,701 3,761 3,803 3,853 3,911 3,943 4,007
3,571 3,617 3,671 3,709 3,767 3,821 3,863 3,917 3,947 4,013

Separate specific value in a dataframe

I have a large dataset. I am trying to read it with Pandas Dataframe. I want to separate some values from one of the columns. Assuming the name of column is "A", there are values ranging from 90 to 300. I want to separate any values between 270 to 280. I did try below code but it is wrong!
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('....csv')
df2 = df[ 270 < df['A'] < 280]
Use between with boolean indexing:
df = pd.DataFrame({'A':range(90,300)})
df2 = df[df['A'].between(270,280, inclusive=False)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Or:
df2 = df[(df['A'] > 270) & (df['A'] < 280)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Using numpy to speed things up and reconstruct a new dataframe.
Assuming we use jezrael's sample data
a = df.A.values
m = (a > 270) & (a < 280)
pd.DataFrame(a[m], df.index[m], df.columns)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
You can also use query() method:
df2 = df.query("270 < A < 280")
Demo:
In [40]: df = pd.DataFrame({'A':range(90,300)})
In [41]: df.query("270 < A < 280")
Out[41]:
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279

Categories

Resources