Refactor python code: dynamic unpacking of list - python

I am given a frequency range start,end and an integer n:1<=n<=5. The task is to print all 5 linear-spaced mid points in a certain format (for other tools downstream to process it further .. this part is not imp here)
In case n is lesser than 5 the last frequency variables should be defaulted to 0 in the print. I have a working code given below.. I wanted to get more "pythonic" refactoring. Parts of the snippet that I particularly dislike:
No inherent support for dynamic unpacking of list into lesser no of variables. I have used a solution from here - which is over 2 yrs old so looking for a fresh perspective
I don't like having to convert np array to list .. Can np be bye-passed altogether .. is there a standard library range/linspace equivalent?
Code snippet:-
import sys
import numpy as np
max = 5
n=int(sys.argv[1])
if n>max:
print("No of frequency larger than "+ str(max) + " ..resetting")
n=max
if n<1:
print("No of frequency less than 1 resetting to 1")
n=1
fr1=0
fr2=0
fr3=0
fr4=0
fr5=0
start = 5060000
end = 6165000
range = end-start
inc = range/(n+1)
ret_list = np.arange(start+inc,end,inc).tolist()[:n]
ret_list = ret_list + [0]*(max-n)
fr1,fr2,fr3,fr4,fr5 = ret_list
print(".fr1.",fr1, sep = '')
print(".fr2.",fr2, sep = '')
print(".fr3.",fr3, sep = '')
print(".fr4.",fr4, sep = '')
print(".fr5.",fr5, sep = '')

Related

Very large pandas dataframe - keeping count

Assume this is a sample of my data: dataframe
the entire dataframe is stored in a csv file (dataframe.csv) that is 40GBs so I can't open all of it at once.
I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data).
To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?
SUMMARY -- sorry if the question is a bit long:
the csv file storing the data is very big and I can't open it at once, but I'd like to find the top 25 dominant names in the data. Any ideas on what to do and how to do it?
I'd appreciate any help I can get! :)
Thanks for your interesting task! I've implemented pure numpy + pandas solution. It uses sorted array to keep names and counts. Hence algorithm should be around O(n * log n) complexity.
I didn't any hash table in numpy, hash table definitely would be faster (O(n)). Hence I used existing sorting/inserting routines of numpy.
Also I used .read_csv() from pandas with iterator = True, chunksize = 1 << 24 params, this allows reading file in chunks and producing pandas dataframes of fixed size from each chunk.
Note! In the first runs (until program is debugged) set limit_chunks (number of chunks to process) in code to small value (like 5). This is to check that whole program runs correctly on partial data.
Program needs to run one time command python -m pip install pandas numpy to install these 2 packages if you don't have them.
Progress is printed once in a while, total megabytes done plus speed.
Result will be printed to console plus saved to res_fname file name, all constants configuring script are placed in the beginning of script. topk constant controls how many top names will be outputed to file/console.
Interesting how fast is my solution. If it is to slow maybe I devote some time to write nice HashTable class using pure numpy.
You can also try and run next code here online.
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
):
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
else:
cfg['progressed'] = total
sys.stdout.write(
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
)
sys.stdout.flush()
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
break
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
continue
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
Progress(f.tell())
Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
f.write(f'{g}:\n\n')
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
continue
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
f.write(f'\n')
print()
import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())
The second line will convert your csv into a pandas dataframe and the third line should print the occurances of each name.
https://dfrieds.com/data-analysis/value-counts-python-pandas.html
This doesn't seem to be a case where pandas is really going to be an advantage. But if you're committed to going down that route, change the read_csv chunksize paramater, then filter out the useless columns.
Perhaps consider using a different set of tooling such as a database or even vanilla python using a generator to populate a dict in the form of name:count.

Why does concatenation in Python appear to be getting slower?

Why does it appear that concatenation in Python 3 is slower in some cases than in Python 2?
The most impacted method of concatenation appears to be successive concatenation of bytes objects, which has gone from an O(n) to O(n²) operation.
The bulk of my profiling code is here:
#!/usr/bin/env python
from operator import concat
from sys import version, version_info
from timeit import timeit # Compatibility: ver >= 2.6
# ver = version.partition('\n')[0].rstrip()
ver = '.'.join(str(v) for v in version_info[:3])
print(ver)
if version_info[0] == 2:
from StringIO import StringIO
else:
from io import StringIO
from functools import reduce
xrange = range
def build_plus():
output = ''
for _ in xrange(input_len):
output += 'a'
return output
def build_join():
return ''.join('a' for _ in xrange(input_len))
def build_bytes_plus():
output = b''
for _ in xrange(input_len):
output += b'a'
return output
def build_stringio():
output = StringIO()
for _ in xrange(input_len):
output.write('a')
return output.getvalue()
def build_reduce():
return reduce(concat, ('a' for _ in xrange(input_len)))
builds = {'str+': build_plus,
'join': build_join,
'reduce': build_reduce,
'bytes+': build_bytes_plus,
'StringIO': build_stringio}
if version_info[0] == 2:
import cStringIO
def build_cstringio():
output = cStringIO.StringIO()
for _ in xrange(input_len):
output.write('a')
return output.getvalue()
builds['cStringIO'] = build_cstringio
else:
from io import BytesIO
def build_bytesio():
output = BytesIO()
for _ in xrange(input_len):
output.write(b'a')
return output.getvalue()
builds['BytesIO'] = build_bytesio
resfile = open('times.csv', 'a')
size_range = 50 # Number of points over the size axis
min_order = 1.0 # 10^x byte input min
max_order = 5.0 # 10^x byte input max
for allow_gc in (False, True):
setup = 'gc.enable()' if allow_gc else 'pass'
for build_name, build_fun in builds.items():
# For a roughly constant confidence interval, aim for uniform sample density across the
# (logarithmic) input size axis.
for size_index in range(size_range+1):
input_len = int(10**((max_order-min_order)*size_index/size_range + min_order))
# Rather than repeating many measurements at one input size, perform one measurement
# per input size for a continuous range of input sizes and apply smoothing later.
dur = timeit(build_fun, setup, number=1)
resfile.write('"%s",%s,"%s",%d,%.6g\n' % (ver, str(allow_gc).upper(), build_name,
input_len, dur))
Some graphs from my R script shown here:
Concatenating strings with + or += in a loop was never a good idea. It only seemed efficient because there was a weird, controversial special case in the bytecode interpreter loop which would attempt to concatenate strings mutatively if it could prove no one else had a reference to the string it was messing with. There was no efficient resize policy in place; it just called realloc and hoped for the best, so it could still end up O(n^2) if realloc needed to copy.
In Python 3, that weird special case now handles unicode strings instead of bytestrings. Bytestring concatenation goes back to building a new string object each time, so your loop goes back to O(n^2).

Vectorize python code for improved performance

I am writing a scientific code in python to calculate the energy of a system.
Here is my function : cte1, cte2, cte3, cte4 are constants previously computed; pii is np.pi (calculated beforehand, since it slows the loop otherwise). I calculate the 3 components of the total energy, then sum them up.
def calc_energy(diam):
Energy1 = cte2*((pii*diam**2/4)*t)
Energy2 = cte4*(pii*diam)*t
d=diam/t
u=np.sqrt((d)**2/(1+d**2))
cc= u**2
E = sp.special.ellipe(cc)
K = sp.special.ellipk(cc)
Id=cte3*d*(d**2+(1-d**2)*E/u-K/u)
Energy3 = cte*t**3*Id
total_energy = Energy1+Energy2+Energy3
return (total_energy,Energy1)
My first idea was to simply loop over all values of the diameter :
start_diam, stop_diam, step_diam = 1e-10, 500e-6, 1e-9 #Diametre
diametres = np.arange(start_diam,stop_diam,step_diam)
for d in diametres:
res1,res2 = calc_energy(d)
totalEnergy.append(res1)
Energy1.append(res2)
In an attempt to speed up calculations, I decided to use numpy to vectorize, as shown below :
diams = diametres.reshape(-1,1) #If not reshaped, calculations won't run
r1 = np.apply_along_axis(calc_energy,1,diams)
However, the "vectorized" solution does not properly work. When timing I get 5 seconds for the first solution and 18 seconds for the second one.
I guess I'm doing something the wrong way but can't figure out what.
With your current approach, you're applying a Python function to each element of your array, which carries additional overhead. Instead, you can pass the whole array to your function and get an array of answers back. Your existing function appears to work fine without any modification.
import numpy as np
from scipy import special
cte = 2
cte1 = 2
cte2 = 2
cte3 = 2
cte4 = 2
pii = np.pi
t = 2
def calc_energy(diam):
Energy1 = cte2*((pii*diam**2/4)*t)
Energy2 = cte4*(pii*diam)*t
d=diam/t
u=np.sqrt((d)**2/(1+d**2))
cc= u**2
E = special.ellipe(cc)
K = special.ellipk(cc)
Id=cte3*d*(d**2+(1-d**2)*E/u-K/u)
Energy3 = cte*t**3*Id
total_energy = Energy1+Energy2+Energy3
return (total_energy,Energy1)
start_diam, stop_diam, step_diam = 1e-10, 500e-6, 1e-9 #Diametre
diametres = np.arange(start_diam,stop_diam,step_diam)
a = calc_energy(diametres) # Pass the whole array

Optimizing Algorithm of large dataset calculations

Once again I find myself stumped with pandas, and how to best perform a 'vector operation'. My code works, however it will take a long time to iterate through everything.
What the code is trying to do is loop through shapes.cv and determine which shape_pt_sequence is a stop_id, and then assigns the stop_lat and stop_lon to shape_pt_lat and shape_pt_lon, while also marking the shape_pt_sequence as is_stop.
GISTS
stop_times.csv LINK
trips.csv LINK
shapes.csv LINK
Here is my code:
import pandas as pd
from haversine import *
'''
iterate through shapes and match stops along a shape_pt_sequence within
x amount of distance. for shape_pt_sequence that is closest, replace the stop
lat/lon to the shape_pt_lat/shape_pt_lon, and mark is_stop column with 1.
'''
# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_index = list(set(shapes['shape_id']))
shapes_index.sort(key=int)
shapes.set_index(['shape_id', 'shape_pt_sequence'], inplace=True)
# readability assignments for trips.csv
trips = pd.read_csv('csv/trips.csv')
trips_index = list(set(trips['trip_id']))
trips.set_index(['trip_id'], inplace=True)
# readability assignments for stops_times.csv
stop_times = pd.read_csv('csv/stop_times.csv')
stop_times.set_index(['trip_id','stop_sequence'], inplace=True)
print(len(stop_times.loc[1423492]))
# readability assginments for stops.csv
stops = pd.read_csv('csv/stops.csv')
stops.set_index(['stop_id'], inplace=True)
# for each trip_id
for i in trips_index:
print('******NEW TRIP_ID******')
print(i)
i = i.astype(int)
# for each stop_sequence in stop_times
for x in range(len(stop_times.loc[i])):
stop_lat = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][0]
stop_lon = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][1]
stop_coordinate = (stop_lat, stop_lon)
print(stop_coordinate)
# shape_id that matches trip_id
print('**SHAPE_ID**')
trips_shape_id = trips.loc[i,['shape_id']].iloc[0]
trips_shape_id = int(trips_shape_id)
print(trips_shape_id)
smallest = 0
for y in range(len(shapes.loc[trips_shape_id])):
shape_lat = shapes.loc[trips_shape_id].iloc[y,[0,1]][0]
shape_lon = shapes.loc[trips_shape_id].iloc[y,[0,1]][1]
shape_coordinate = (shape_lat, shape_lon)
haversined = haversine_mi(stop_coordinate, shape_coordinate)
if smallest == 0 or haversined < smallest:
smallest = haversined
smallest_shape_pt_indexer = y
else:
pass
print(haversined)
print('{0:.20f}'.format(smallest))
print('{0:.20f}'.format(smallest))
print(smallest_shape_pt_indexer)
# mark is_stop as 1
shapes.iloc[smallest_shape_pt_indexer,[2]] = 1
# replace coordinate value
shapes.loc[trips_shape_id].iloc[y,[0,1]][0] = stop_lat
shapes.loc[trips_shape_id].iloc[y,[0,1]][1] = stop_lon
shapes.to_csv('csv/shapes.csv', index=False)
What you could do to optmizing this code is use some threads/workers instead those for.
I recommend using the Pool of Workes as its very simple to use.
In:
for i in trips_index:
You could use something like:
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.apply_async(func, trips_index)
And than the method func would be like:
def func(i):
#code here
And you could simply put the whole for loop inside this method.
It would make it work with 4 subprocess in this example, git it a nice improvment.
One thing to consider is that a collection of trips will often have the same sequence of stops and the same shape data (the only difference between trips is the timing). So it might make sense to cache the find-closest-point-on-shape operation for (stop_id, shape_id). I bet that would reduce your runtime by an order-of-magnitude.

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages
Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).
If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Categories

Resources