I have a large dataset. I am trying to read it with Pandas Dataframe. I want to separate some values from one of the columns. Assuming the name of column is "A", there are values ranging from 90 to 300. I want to separate any values between 270 to 280. I did try below code but it is wrong!
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('....csv')
df2 = df[ 270 < df['A'] < 280]
Use between with boolean indexing:
df = pd.DataFrame({'A':range(90,300)})
df2 = df[df['A'].between(270,280, inclusive=False)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Or:
df2 = df[(df['A'] > 270) & (df['A'] < 280)]
print (df2)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Using numpy to speed things up and reconstruct a new dataframe.
Assuming we use jezrael's sample data
a = df.A.values
m = (a > 270) & (a < 280)
pd.DataFrame(a[m], df.index[m], df.columns)
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
You can also use query() method:
df2 = df.query("270 < A < 280")
Demo:
In [40]: df = pd.DataFrame({'A':range(90,300)})
In [41]: df.query("270 < A < 280")
Out[41]:
A
181 271
182 272
183 273
184 274
185 275
186 276
187 277
188 278
189 279
Related
I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!
If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833
You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)
I am trying to figure out how to renumber a certain file format and struggling to get it right.
First, a little background may help: There is a certain file format used in computational chemistry to describe the structure of a molecule with the extension .xyz. The first column is the number used to identify a specific atom (carbon, hydrogen, etc.), and the subsequent columns show what other atom numbers it is connected to. Below is a small sample of this file, but the usual file is significantly larger.
259 252
260 254
261 255
262 256
264 248 265 268
265 264 266 269 270
266 265 267 282
267 266
268 264
269 265
270 265 271 276 277
271 270 272 273
272 271 274 278
273 271 275 279
274 272 275 280
275 273 274 281
276 270
277 270
278 272
279 273
280 274
282 266 283 286
283 282 284 287 288
284 283 285 289
285 284
286 282
287 283
288 283
289 284 290 293
290 289 291 294 295
291 290 292 304
As you can see, the numbers 263 and 281 are missing. Of course, there could be many more missing numbers so I need my script to be able to account for this. Below is the code I have thus far, and the lists missing_nums and missing_nums2 are given as well, however, I would normally obtain them from an earlier part of the script. The last element of the list missing_nums2 is where I want numbering to finish, so in this case: 289.
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
with open("atom_nums.xyz", "r") as f2:
lines = f2.read()
for i in range(0, len(missing_nums) - 1):
if i == 0:
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i])
for number in range(int(missing_nums[i]) + 1, int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
else:
with open("atom_nums_out.xyz", "r") as f2:
lines = f2.read()
with open("atom_nums_out.xyz", "w") as f2:
replacement = int(missing_nums[i]) - (i + 1)
print(replacement)
for number in range(int(missing_nums[i]), int(missing_nums2[i])):
lines = lines.replace(str(number), str(replacement))
replacement += 1
f2.write(lines)
The problem lies in the fact that as the file gets larger, there seems to be repeats of numbers for reasons I cannot figure out. I hope somebody can help me here.
EDIT: The desired output of the code using the above sample would be
259 252
260 254
261 255
262 256
263 248 264 267
264 263 265 268 269
265 264 266 280
266 265
267 263
268 264
269 264 270 275 276
270 269 271 272
271 270 273 277
272 270 274 278
273 271 274 279
274 272 273 279
275 269
276 269
277 271
278 272
279 273
280 265 281 284
281 280 282 285 286
282 281 283 287
283 282
284 280
285 281
286 281
287 282 288 291
288 287 289 292 293
289 288 290 302
Which is, indeed, what I get as the output for this small sample, but as the missing numbers increase it seems to not work and I get duplicate numbers. I can provide the whole file if anyone wants.
Thanks!
Assuming my interpretation of the lists missing_nums and missing_nums2 is correct, this is how I would perform the operation.
from os import rename
def fixFile(fn, mn1, mn2):
with open(fn, "r") as fin:
with open('tmp.txt', "w") as fout:
for line in fin:
for i in range(len(mn1)):
minN = int(mn1[1])
maxN = int(mn2[i])
for nxtn in range(minN, maxN):
line.replace(str(nxtn), str(nxtn +1))
fout.write(line)
rename(temp, fn)
missing_nums = ['263', '281']
missing_nums2 = ['281', '289']
fn = "atom_nums_out.xyz"
fixFile(fn, missing_nums, missing_nums2)
Note, I am only reading the file in once a line at a time, and writing the result out a line at a time. I am then renaming the temp file to the original filename after all data is processed. This means, significantly longer files, will not chew up memory.
I have file having EncodedPixels mask of different size
1: I want to convert these EncodedPixels in binary and resize all into 1024 and then again convert in to EncodedPixels.
Explanation:
In file there is image-Mask in Encoded Pixels form, and images have different dimensions (5000x5000, 260x260 etc) So I resize all images in to 1024x1024, Now I want to resize each image-mask according to image 1024x1024.
I my mind there is only one possible solution (might be more available) to resize mask is first we need to convert run length encoding pixel in to binary and then we are able to resize mask easily.
File Link: link here
This code will use to resize binary mask.
from PIL import Image
import numpy as np
pil_image = Image.fromarray(binary_mask)
pil_image = pil_image.resize((new_width, new_height), Image.NEAREST)
resized_binary_mask = np.asarray(pil_image)
Encoded Pixels Example
['6068157 7 6073371 20 6078584 34 6083797 48 6089010 62 6094223 72 6099436 76 6104649 80
6109862 85 6115075 89 6120288 93 6125501 98 6130714 102 6135927 106 6141140 111 6146354 114 6151567 118 6156780 123 6161993 127 6167206 131 6172419 136 6177632 140 6182845 144 6188058 149 6193271 153 6198484 157 6203697 162 6208910 166 6214124 169 6219337 174 6224550 178 6229763 182 6234976 187 6240189 191 6245402 195 6250615 200 6255828 204 6261041 208 6266254 213 6271467 218 6276680 224 6281893 229 6287107 233 6292320 238 6297533 244 6302746 249 6307959 254 6313172 259 6318385 265 6323598 270 6328811 275 6334024 280 6339237 286 6344450 291 6349663 296 6354877 300 6360090 306 6365303 311 6370516 316 6375729 322 6380942 327 6386155 332 6391368 337 6396581 343 6401794 348 6407007 353 6412220 358 6417433 364 6422647 368 6427860 373 6433073 378 6438286 384 6443499 389 6448712 394 6453925 399 6459138 405 6464351 410 6469564 415 6474777 420 6479990 426 17204187 78 17208797 227 17209412 56 17214025 203 17214637 34 17219253 179 17219862 11 17224481 155 17229709 131 17234937 107 17240165 83 17245393 60 17250621 36 17255849 12']
I have the following collection of items. I would like to add a comma followed by a space at the end of each item so I can create a list out of them. I am assuming the best way to do this is to form a string out of the items and then replace 3 spaces between each item with a comma, using regular expressions?
I would like to do this with python, which I am new to.
179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281
283 293 307 311 313 317 331 337 347 349
353 359 367 373 379 383 389 397 401 409
419 421 431 433 439 443 449 457 461 463
Instead of a regular expression, how about this (assuming you have it in a file somewhere):
items = open('your_file.txt').read().split()
If it's just in a string variable:
items = your_input.split()
To combine them again with a comma in between:
print ', '.join(items)
data = """179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281 """
To get the list out of it:
lst = re.findall("(\d+)", data)
print lst
To add comma after each item, replace multiple spaces with , and space.
data = re.sub("[ ]+", ", ", data)
print data
I am looking for a clean way to reorder the index in a group.
Example code:
import numpy as np
import pandas as pd
mydates = pd.date_range('1/1/2012', periods=1000, freq='D')
myts = pd.Series(np.random.randn(len(mydates)), index=mydates)
grouped = myts.groupby(lambda x: x.timetuple()[7])
mymin = grouped.min()
mymax = grouped.max()
The above gives me what I want, aggregate stats on julian day of the year BUT I would then like to reorder the group so the last half (183 days) is placed in front of the 1st half.
With a normal numpy array:
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
But I can't do this with the groupby it raises a not implement error.
Note: this is a cross post from google-groups. Also I have been reading on comp.lang.python, unfortunately people tend to ignore some posts e.g. from google groups.
Thanks in advance,
Bevan
Why not just reindex the result?
In [7]: mymin.reindex(myindex)
Out[7]:
184 -0.788140
185 -2.206314
186 0.284884
187 -2.197727
188 -0.714634
189 -1.082745
190 -0.789286
191 -1.489837
192 -1.278941
193 -0.795507
194 -0.661476
195 0.582994
196 -1.634310
197 0.104332
198 -0.602378
...
169 -1.150616
170 -0.315325
171 -2.233139
172 -1.081528
173 -1.316668
174 -0.963783
175 -0.215260
176 -2.723446
177 -0.493480
178 -0.706771
179 -2.082051
180 -1.066649
181 -1.455419
182 -0.332383
183 -1.277424
Im not aware of a specific Pandas function for this, but you could consider the np.roll() function:
myindex = np.arange(1,367)
myindex = np.roll(myindex, int(len(myindex)/2.))