How to convert from String to List/Array .? - python

I have 3 strings
a=38 186 298 345 0.93345
27 198 277 389 0.86006
33 127 293 354 0.89782
Type(a)
len(a) shows it as 22 including splace between 2 numbers
Want to convert them to list
Need as below
b=[[38 186 298 345 0.93345][27 198 277 389 0.86006][33 127 293 354 0.89782]]

Is this what you aim for:
a = '''38 186 298 345 0.93345
27 198 277 389 0.86006
33 127 293 354 0.89782'''
b = [line.split() for line in a.split('\n')]
b
#[['38', '186', '298', '345', '0.93345'],
# ['27', '198', '277', '389', '0.86006'],
# ['33', '127', '293', '354', '0.89782']]

Split them by newline and spaces. Use this python function.
string_name.split(str="")
For more info: https://www.tutorialspoint.com/python/string_split.htm

This is one solution specific to your data.
Note your inputs are not valid Python, I have resolved that below.
a1 = '38 186 298 345 0.93345'
a2 = '27 198 277 389 0.86006'
a3 = '33 127 293 354 0.89782'
res = [[float(j) if float(j) < 1 else int(j) for j in i.split()] \
for i in [a1, a2, a3]]
# [[38, 186, 298, 345, 0.93345],
# [27, 198, 277, 389, 0.86006],
# [33, 127, 293, 354, 0.89782]]

Related

Pandas apply function on each row with a condition

I have a pandas dataframe column, code that is of type int. I would like to extract the last 5 digits for rows where length of this field is > 5. Sample data and my attempt below:
df['Code']
144 602000
145 602000
146 602000
147 602000
148 602000
...
571 84410
572 84410
573 84410
574 84410
575 684410
df['Code5'] = df['Code'].apply(lambda row: row['Code'].astype(str).str[-5:] if len(row['Code']) > 5 else row['Code'])
Error:
TypeError: 'int' object is not subscriptable
Try:
df['Code5'] = df['Code'].astype(str).str[-5:]
>>> df
Code Code5
144 602000 02000
145 602000 02000
146 602000 02000
147 602000 02000
148 602000 02000
571 84410 84410
572 84410 84410
573 84410 84410
574 84410 84410
575 684410 84410
999 1234 1234 # <- I added this sample
Another option is to fill value with less than 5 digits:
>>> df['Code'].astype(str).str[-5:].str.zfill(5)
144 02000
145 02000
146 02000
147 02000
148 02000
571 84410
572 84410
573 84410
574 84410
575 84410
999 01234 # <- padded with fillchar '0'
Name: Code, dtype: object

Renumbering a Sequence of Numbers With Gaps using Python

I am trying to figure out how to renumber a certain file format and struggling to get it right.
First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.
259 252
260 254
261 255
262 256
264 248 265 268
265 264 266 269 270
266 265 267 282
267 266
268 264
269 265
270 265 271 276 277
271 270 272 273
272 271 274 278
273 271 275 279
274 272 275 280
275 273 274 281
276 270
277 270
278 272
279 273
280 274
282 266 283 286
283 282 284 287 288
284 283 285 289
285 284
286 282
287 283
288 283
289 284 290 293
290 289 291 294 295
291 290 292 304
As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
with open("atom_nums.xyz", "r") as f2:
lines = f2.read()
for i in range(0, len(missing_nums) - 1):
if i == 0:
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i])
for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
else:
with open("atom_nums_out.xyz", "r") as f2:
lines = f2.read()
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i]) - (i + 1)
print(replacement)
for number in range(int(missing_nums[i]), int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.
EDIT: The desired output of the code using the above sample would be
259 252
260 254
261 255
262 256
263 248 264 267
264 263 265 268 269
265 264 266 280
266 265
267 263
268 264
269 264 270 275 276
270 269 271 272
271 270 273 277
272 270 274 278
273 271 274 279
274 272 273 279
275 269
276 269
277 271
278 272
279 273
280 265 281 284
281 280 282 285 286
282 281 283 287
283 282
284 280
285 281
286 281
287 282 288 291
288 287 289 292 293
289 288 290 302
Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.
Thanks!
Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.
from os import rename
def fixFile(fn, mn1, mn2):
with open(fn, "r") as fin:
with open('tmp.txt', "w") as fout:
for line in fin:
for i in range(len(mn1)):
minN = int(mn1[1])
maxN = int(mn2[i])
for nxtn in range(minN, maxN):
line.replace(str(nxtn), str(nxtn +1))
fout.write(line)
rename(temp, fn)
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"
fixFile(fn, missing_nums, missing_nums2)
Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Can't reshape to right shape size

My original dataset is 7049 images(96x96) with following format:
train_x.shape= (7049,)
train_x[:3]
0 238 236 237 238 240 240 239 241 241 243 240 23...
1 219 215 204 196 204 211 212 200 180 168 178 19...
2 144 142 159 180 188 188 184 180 167 132 84 59 ...
Name: Image, dtype: object
I want to split image-string into 96x96 and get the (7049,96,96) array.
I try this method:
def split_reshape(row):
return np.array(row.split(' ')).reshape(96,96)
result = train_x.apply(split_reshape)
Then I still got result.shape=(7049,)
How to reshape to (7049,96,96) ?
Demo:
Source Series:
In [129]: train_X
Out[129]:
0 238 236 237 238 240 240 239 241 241
1 219 215 204 196 204 211 212 200 180
2 144 142 159 180 188 188 184 180 167
Name: 1, dtype: object
In [130]: type(train_X)
Out[130]: pandas.core.series.Series
In [131]: train_X.shape
Out[131]: (3,)
Solution:
In [132]: X = train_X.str \
.split(expand=True) \
.astype(np.int16) \
.values.reshape(len(train_X), 3, 3)
In [133]: X
Out[133]:
array([[[238, 236, 237],
[238, 240, 240],
[239, 241, 241]],
[[219, 215, 204],
[196, 204, 211],
[212, 200, 180]],
[[144, 142, 159],
[180, 188, 188],
[184, 180, 167]]], dtype=int16)
In [134]: X.shape
Out[134]: (3, 3, 3)

Removing a recurrant regular expression in a string - Python

I have the following collection of items. I would like to add a comma followed by a space at the end of each item so I can create a list out of them. I am assuming the best way to do this is to form a string out of the items and then replace 3 spaces between each item with a comma, using regular expressions?
I would like to do this with python, which I am new to.
179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281
283 293 307 311 313 317 331 337 347 349
353 359 367 373 379 383 389 397 401 409
419 421 431 433 439 443 449 457 461 463
Instead of a regular expression, how about this (assuming you have it in a file somewhere):
items = open('your_file.txt').read().split()
If it's just in a string variable:
items = your_input.split()
To combine them again with a comma in between:
print ', '.join(items)
data = """179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281 """
To get the list out of it:
lst = re.findall("(\d+)", data)
print lst
To add comma after each item, replace multiple spaces with , and space.
data = re.sub("[ ]+", ", ", data)
print data

Categories

Resources