Pandas .loc[] is extremly slow compared to dict - python

I have a DataFrame that has around 10 columns and 100K rows. I want to get every row in a loop using .loc[] on index. However .loc[] is being extremely slow compared to Python's dict.
Here is the code to reproduce:
import pandas as pd
import random
import time
data = {}
for i in range(100000):
data[i] = {
'id': i,
'a': random.randint(1, 40000),
'b': random.randint(1, 40000),
'c': random.randint(1, 40000),
'd': random.randint(1, 40000),
'e': random.randint(1, 40000),
'f': random.randint(1, 40000),
}
df = pd.DataFrame.from_dict(
data=data,
orient="index",
dtype=int,
)
df.set_index('id', inplace=True)
dict_objs = df.to_dict('index')
start_time_dataframe = time.time()
for i in range(100000):
obj = df.loc[i]
end_time_dataframe = time.time() - start_time_dataframe
start_time_dict = time.time()
for i in range(100000):
obj = dict_objs[i]
end_time_dict = time.time() - start_time_dict
print(f"Time needed for DataFrame: {end_time_dataframe}") # 12.08s
print(f"Time needed for Dict: {end_time_dict}") # 0.01s
Why is DataFrame's .loc[] running so slow?

Related

How to style pandas dataframe using for loop

I have a dataset where I need to display different values with different colors. Not all the cells in the data are highlighted and only some of the data is highlighted.
Here are some of the colors:
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
How can I highlight all these cells with given colors?
MWE
# data
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
# without for loop
(df.style
.apply(lambda dfx: ['background: red' if val == 'a' else '' for val in dfx], axis = 1)
.apply(lambda dfx: ['background: blue' if val == 'b' else '' for val in dfx], axis = 1)
)
# How to do this using for loop (I have so many values and different colors for them)
# My attempt
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
s = df.style
for key,color in dict_colors.items():
s = s.apply(lambda dfx: [f'background: {color}' if cell == key else '' for cell in dfx], axis = 1)
display(s)
You can try that:
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
dict_colors = {'a': 'red', 'b': 'blue', 'e':'tomato'}
# create a Styler object for the DataFrame
s = df.style
def apply_color(val):
if val in dict_colors:
return f'background: {dict_colors[val]}'
return ''
# apply the style to each cell
s = s.applymap(apply_color)
# display the styled DataFrame
display(s)
I found a way using eval method, it is not the most elegant method but it works.
import pandas as pd
df = pd.DataFrame({'A': list('abcdef'), 'B': list('aabbcc'), 'C': list('aaabbb')})
dict_colors = {'a': 'red', 'b': 'blue','e':'tomato'}
lst = [ 'df.style']
for key,color in dict_colors.items():
text = f".apply(lambda dfx: ['background: {color}' if cell == '{key}' else '' for cell in dfx], axis = 1)"
lst.append(text)
s = ''.join(lst)
display(eval(s))

Get closest time from data

Below is the sample data:
{
"a":"05:32",
"b":"12:15",
"c":"15:42",
"d":"18:23"
}
Using this data: I want to get the closest next value to the current time.
ie; So if it is, 15:30 right now, the query should return c.
I tried to do this with a for loop and it didn't seem very efficient.
You can use min with a custom key function:
d = {'a': '05:32', 'b': '12:15', 'c': '15:42', 'd': '18:23'}
def closest(c, t2 = [15, 30]):
a, b = map(int, d[c].split(':'))
return abs((t2[-1]+60*t2[0]) - (b+60*a))
new_d = min(d, key=closest)
Output:
c
In general, you can replace t2 = [15, 30] (used only for demo purposes) with results from datetime.datetime.now:
from datetime import datetime
def closest(c, t2 = [(n:=datetime.now()).hour, n.minute]):
...

How can I create a stream of data from a pandas dataframe?

I am looking for a way to produce stream of data from static data, eg. I want to create a source where each row of data will arrive after 10 ms. Is there a way to do it?
You could just iterate with a timer to wait, using yield to create a generator, I used itertuples but you can change how you iterate the data
import time
import pandas as pd
def yield_wait(frame, ms):
for v in frame.itertuples():
yield v
time.sleep(ms / 1000)
if __name__ == '__main__':
inp = [{'c1': 10, 'c2': 100}, {'c1': 11, 'c2': 110}, {'c1': 12, 'c2': 120}]
df = pd.DataFrame(inp)
for v in yield_wait(df, 1000): # print every 1sec
print(v)

Optimising quartiling of columns of panda dataframe?

I have multiple columns in a data frame that have numerical data. I want to quartile each column, changing each value to either q1, q2, q3 or q4.
I currently loop through each column and change them using the pandas qcut function:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
This is very slow! Is there a faster way to do this?
Played around with the the following example a little. Looks like converting to float from a string is increasing the time. Though a working example was not provided, so the original type can't be known. df[column].astype(copy=) appears to be performant if copying or not. Not much else to go after.
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)

Python: multiply csv data together with dict()

I posted below a code that works fine. What it does at the moment is:
it opens 2 .csv files 'CMF.csv' and 'D65.csv', and then
performs some math on it.
Here's the simple structure of those files :
'CMF.csv' (wavelength, x, y, z)
400,1.879338E-02,2.589775E-03,8.508254E-02
410,8.277331E-02,1.041303E-02,3.832822E-01
420,2.077647E-01,2.576133E-02,9.933444E-01
...etc
'D65.csv': (wavelength, a, b)
400,82.7549,14.708
410,91.486,17.6753
420,93.4318,20.995
...etc
I have a 3rd file data.csv, with this structure (serialNumber, wavelength, measurement, name) :
0,400,2.21,1
0,410,2.22,1
0,420,2.22,1
...
1,400,2.21,2
1,410,2.22,2
1,420,2.22,2
...etc
What I would like to do is to be able to write a few lines of code to perform
math on all the series of the last file (series are defined by their serial number and their name)
For example I need a loop that will perform, for each name or serial number, and for each wavelength, the operation:
x * a * measurement
I tried to load data.csv`in the csv reader as the other files, but I couldn't
any ideas?
Thanks
import csv
with open('CMF.csv') as cmf:
reader = csv.reader(cmf)
dict_cmf = dict()
for row in reader:
dict_cmf[float(row[0])] = row
with open('D65.csv') as d65:
reader = csv.reader(d65)
dict_d65 = dict()
for row in reader:
dict_d65[float(row[0])] = row
with open('data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
dict_sp[float(row[0])] = row
Y = 0
Y_total = 0
X = 0
X_total = 0
Z = 0
Z_total = 0
i = 0
j = 0
for i in range(400, 700, i+10):
X = float(dict_cmf[i][1]) * float(dict_d65[i][1])
X_total = X_total + X
Y = float(dict_cmf[i][2]) * float(dict_d65[i][1])
Y_total = Y_total + Y
Z = float(dict_cmf[i][3]) * float(dict_d65[i][1])
Z_total = Z_total + Z
wp_X = 100 * X_total / Y_total
wp_Y = 100 * Y_total / Y_total
wp_Z = 100 * Z_total / Y_total
print Y_total
print "D65_CMF_2006_10_deg white point = "
print wp_X, wp_Y, wp_Z
I get this :
Traceback (most recent call last): File "C:\Users\gary\Documents\eclipse\Spectro\1illum_XYZ2006_D65_numpy.py", line 24, in <module> dict_sp[row[0]] = row IndexError: list index out of range
You need pandas. You can read the files into pandas tables, then join them to replace your code with this code:
import pandas
cmf = pandas.read_csv('CMF.csv', names=['wavelength', 'x', 'y', 'z'])
d65 = pandas.read_csv('D65.csv', names=['wavelength', 'a', 'b'])
data = pandas.read_csv('data.csv', names=['serialNumber', 'wavelength', 'measurement', 'name'])
lookup = pandas.merge(cmf, d65, on='wavelength')
merged = pandas.merge(data, lookup, on='wavelength')
totals = ((lookup[['x', 'y', 'z']].T*lookup['a']).T).sum()
wps = totals/totals['y']
print totals['y']
print "D65_CMF_2006_10_deg white point = "
print wps
Now, that doesn't do the last bit where you want to calculate extra values for each measurement. You can do this by adding a column to merged, like this:
merged['newcol'] = merged.x * merged.a * merged.measurement
One or more of the lines in data.csv does not contain what you think it does. Try to put your statement inside a try...except block to see what the problem is:
with open('spectral_data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
try:
dict_sp[float(row[0])] = row
except IndexError as e:
print 'The problematic row is:'
print row
raise e
A proper debugger would also be helpful in these kind of situations.
pandas is probably the better way to go, but if you'd like an example in vanilla Python, you can have a look at this example:
import csv
from collections import defaultdict
d = defaultdict(dict)
for fname, cols in [('CMF.csv', ('x', 'y', 'z')), ('D65.csv', ('a', 'b'))]:
with open(fname) as ifile:
reader = csv.reader(ifile)
for row in reader:
wl, values = int(row[0]), row[1:]
d[wl].update(zip(cols, map(float, values)))
measurements = defaultdict(dict)
with open('data.csv') as ifile:
reader = csv.reader(ifile)
cols = ('measurement', 'name')
for serial, wl, me, name in reader:
measurements[int(serial)][int(wl)] = dict(zip(cols, (float(me), str(name))))
for serial in sorted(measurements.keys()):
for wl in sorted(measurements[serial].keys()):
me = measurements[serial][wl]['measurement']
print me * d[wl]['x'] * d[wl]['a']
This will store both x, y, z, a and b in a dictionary inside a dictionary with wavelength as the key (there is no apparent reason to store these values in separate dicts).
The measurements are stored in a two level deep dictionary with keys serial and wavelength. This way you can iterate over all serials and all corresponding wavelength like shown in the latter part of the code.
As for your specific calculations on the data in your example, this can be done quite easily with this structure:
tot_x = sum(v['x']*v['a'] for v in data.values())
tot_y = sum(v['y']*v['a'] for v in data.values())
tot_z = sum(v['z']*v['a'] for v in data.values())
wp_x = 100 * tot_x / tot_y
wp_y = 100 * tot_y / tot_y # Sure this is correct? It will always be 100
wp_z = 100 * tot_z / tot_y
print wp_x, wp_y, wp_z # 798.56037811 100.0 3775.04316468
These are the dictionaries given the input file in your question:
>>> from pprint import pprint
>>> pprint(dict(data))
{400: {'a': 82.7549,
'b': 14.708,
'x': 0.01879338,
'y': 0.002589775,
'z': 0.08508254},
410: {'a': 91.486,
'b': 17.6753,
'x': 0.08277331,
'y': 0.01041303,
'z': 0.3832822},
420: {'a': 93.4318,
'b': 20.995,
'x': 0.2077647,
'y': 0.02576133,
'z': 0.9933444}}
>>> pprint(dict(measurements))
{0: {400: {'measurement': 2.21, 'name': '1'},
410: {'measurement': 2.22, 'name': '1'},
420: {'measurement': 2.22, 'name': '1'}},
1: {400: {'measurement': 2.21, 'name': '2'},
410: {'measurement': 2.22, 'name': '2'},
420: {'measurement': 2.22, 'name': '2'}}}

Categories

Resources