Related
I am new to python/pandas/numpy and I need to create the following Dataframe:
DF = pd.concat([pd.Series(x[2]).apply(lambda r: pd.Series(re.split('\#|/',r))).assign(id=x[0]) for x in hDF])
where hDF is a dataframe that has been created by:
hDF=pd.DataFrame(h.DF)
and h.DF is a list whose elements looks like this:
['5203906',
['highway=primary',
'maxspeed=30',
'oneway=yes',
'ref=N 22',
'surface=asphalt'],
['3655224911#1.735928/42.543651',
'3655224917#1.735766/42.543561',
'3655224916#1.735694/42.543523',
'3655224915#1.735597/42.543474',
'4817024439#1.735581/42.543469']]
However, in some cases the list is very long (O(10^7)) and also the list in h.DF[*][2] is very long, so I run out of memory.
I can obtain the same result, avoiding the use of the lambda function, like so:
DF = pd.concat([pd.Series(x[2]).str.split('\#|/', expand=True).assign(id=x[0]) for x in hDF])
But I am still running out of memory in the cases where the lists are very long.
Can you think of a possible solution to obtain the same results without starving resources?
I managed to make it work using the following code:
bl = []
for x in h.DF:
data = np.loadtxt(
np.loadtxt(x[2], dtype=str, delimiter="#")[:, 1], dtype=float, delimiter="/"
).tolist()
[i.append(x[0]) for i in data]
bl.append(data)
bbl = list(itertools.chain.from_iterable(bl))
DF = pd.DataFrame(bbl).rename(columns={0: "lon", 1: "lat", 2: "wayid"})
Now it's super fast :)
Please be nice - I'm not a proper programmer, I'm a scientist and I've read as many docs on this as I can find (they're a bit sparse).
I'm trying to convert this pandas code into dash because my input file is ~0.5TB with gz and it loads too slowly in native pandas. I have a 3 TB machine, btw.
This is an example of what I'm doing with pandas:
df = pd.DataFrame([['chr1',33329,17,'''33)'6'4?1&AB=?+..''','''X%&=E&!%,0("&"Y&!'''],
['chr1',33330,15,'''6+'/7=1#><C1*'*''','''X%=E!%,("&"Y&&!'''],
['chr1',33331,13,'''2*3A#/9#CC3--''','''X%E!%,("&"Y&!'''],
['chr1',33332,1,'''4**(,:3)+7-#<(0-''','''X%&E&!%,0("&"Y&!'''],
['chr1',33333,2,'''66(/C=*42A:.&*''','''X%=&!%0("&"&&!''']],
columns = ['chrom','pos','depth','phred','map'])
df.loc[:,'phred'] = [(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df.loc[:,"map"] = [(sum(map(ord,i)))/len(i) for i in df.loc[:,"map"]]
df = df.astype({'phred': 'int32', 'map': 'int32'})
df.query('(depth < 10) | (phred < 7) | (map < 10)', inplace=True)
for chrom, df_tmp in df.groupby('chrom'):
df_end = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(-1)-1))]
df_start = df_tmp[~((df_tmp.pos.shift(0) == df_tmp.pos.shift(+1)+1))]
for start, end in zip(df_start.pos, df_end.pos):
print (start, end)
Gives
33332 33333
This works (to find regions of a cancer genome with no data) and it's optimised as much as I know how.
I load the real thing like:
df = pd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
index_col=None,
usecols=[0,1,3,5,6],
names = ['chrom','pos','depth','phred','map']
)
and I can do the same with Dask (way faster!):
df = dd.read_csv(
'/Users/liamm/Downloads/test_head33333.tsv.gz',
sep='\t',
header=None,
usecols=[0,1,3,5,6],
compression='gzip',
blocksize=None,
names = ['chrom','pos','depth','phred','map']
)
but i'm stuck here:
ff=[(sum(map(ord,i))-len(i)*33)/len(i) for i in df.loc[:,"phred"]]
df['phred'] = ff
Error: Column assignment doesn't support type list
Question - is this sort of thing possible? If so are there good tutes somewhere? I need to convert the whole block of pandas code above.
Thanks in advance!
You created list comprehensions to transform 'Fred' and 'map'; I converted these list comps to functions, and wrapped the functions in np.vectorize().
def func_p(p):
return (sum(map(ord, p)) - len(p) * 33) / len(p)
def func_m(m):
return (sum(map(ord, m))) / len(m)
vec_func_p = np.vectorize(func_p)
vec_func_m = np.vectorize(func_m)
np.vectorize() does not make code faster, but it does let you write a function with scalar inputs and outputs, and convert it to a function that takes array inputs and outputs.
The benefit is that we can now pass pandas Series to these functions (I also added the type conversion to this step):
df.loc[:, 'phred'] = vec_func_p( df.loc[:, 'phred']).astype(np.int32)
df.loc[:, 'map'] = vec_func_m( df.loc[:, 'map']).astype(np.int32)
Replacing the list comprehensions with these new functions gives the same results as your version (33332 33333).
#rpanai noted that you could eliminate the for loops. The following example uses groupby() (and a couple helper columns) to find the start and end position for each contiguous sequence of positions.
Using only pandas built-in functions should be compatible with Dask (and fast).
First, create demo data frame with multiple chromosomes and multiple contiguous blocks of positions:
data1 = {
'chrom' : 'chrom_1',
'pos' : [1000, 1001, 1002,
2000, 2001, 2002, 2003]}
data2 = {
'chrom' : 'chrom_2',
'pos' : [30000, 30001, 30002, 30003, 30004,
40000, 40001, 40002, 40003, 40004, 40005]}
df = pd.DataFrame(data1).append( pd.DataFrame(data2) )
Second, create two helper functions:
rank is a sequential counter for each group;
key is constant for positions in a contiguous 'run' of positions.
df['rank'] = df.groupby('chrom')['pos'].rank(method='first')
df['key'] = df['pos'] - df['rank']
Third, group by chrom and key to create a groupby object for each contiguous block of positions, then use min and max to find start and end value for the positions.
result = (df.groupby(['chrom', 'key'])['pos']
.agg(['min', 'max'])
.droplevel('key')
.rename(columns={'min': 'start', 'max': 'end'})
)
print(result)
start end
chrom
chrom_1 1000 1002
chrom_1 2000 2003
chrom_2 30000 30004
chrom_2 40000 40005
I am running a fairly complex filter on a dataframe in pandas (I am filtering for passing test results against 67 different thresholds via a dictionary). In order to do this I have the following:
query_string = ' | '.join([f'{k} > {v}' for k , v in dictionary.items()])
test_passes = df.query(query_string, engine='python')
Where k is the test name and v is the threshold value.
This is working nicely and I am able to export the rows with test passes to csv.
I am wondering though if there is a way to also attach a column which counts the number of test passes. So for example if the particular row recorded 1-67 test passes.
So I finally 'solved' with the following, starting after the pandas query originally posted. The original question was for test passes by my use case if actually for test failures which....
test_failures = data.query(query_string, engine='python').copy()
The copy is to prevent unintentional data manipulation and chaining error messages.
for k, row in test_failures.iterrows():
failure_count=0
test_count=0
for key, val in threshold_dict.items():
test_count +=1
if row[key] > val:
failure_count +=1
test_failures.at[k, 'Test Count'] = test_count
test_failures.at[k, 'Failure Count'] = failure_count
From what I have read iterrows() is not the fastest iterative method but it does provide the index (k) and a data dictionary (row) separately which I found more useful for these purposes than the tuple returned in itertuples().
sorted_test_failures = test_failures.sort_values('Failure Count', ascending=False)
sorted_test_failures.to_csv('failures.csv', encoding='utf8')
A little sorting and saving to finish.
I have tested on a dummy data set of (8000 x 66) - it doesn't provide groundbreaking speed but it does the job. Any improvements would be great!
This was answered here:
https://stackoverflow.com/a/24516612/6815750
But to give an example you could do the following:
new_df = df.apply(pd.Series.value_counts, axis = 1) #where df is your current dataframe holding the pass/fails
df[new_df.columns] = new_df
You can use the following approach instead:
dictionary = {'a':'b', 'b': 'c'}
data = pd.DataFrame({'a': [1,2,3], 'b': [ 2,1,2], 'c': [2,1,1] })
test_components = pd.DataFrame([df.loc[:, k] > df.loc[:, v] for k , v in dictionary.items()]).T
# now can inspect what conditions were met in `test_components` variable
condition = test_components.any(axis=1)
data_filtered = data.loc[common_condition, :]
I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).
I want to perform my own complex operations on financial data in dataframes in a sequential manner.
For example I am using the following MSFT CSV file taken from Yahoo Finance:
Date,Open,High,Low,Close,Volume,Adj Close
2011-10-19,27.37,27.47,27.01,27.13,42880000,27.13
2011-10-18,26.94,27.40,26.80,27.31,52487900,27.31
2011-10-17,27.11,27.42,26.85,26.98,39433400,26.98
2011-10-14,27.31,27.50,27.02,27.27,50947700,27.27
....
I then do the following:
#!/usr/bin/env python
from pandas import *
df = read_csv('table.csv')
for i, row in enumerate(df.values):
date = df.index[i]
open, high, low, close, adjclose = row
#now perform analysis on open/close based on date, etc..
Is that the most efficient way? Given the focus on speed in pandas, I would assume there must be some special function to iterate through the values in a manner that one also retrieves the index (possibly through a generator to be memory efficient)? df.iteritems unfortunately only iterates column by column.
The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.
Pandas is based on NumPy arrays.
The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item.
For example, if close is a 1-d array, and you want the day-over-day percent change,
pct_change = close[1:]/close[:-1]
This computes the entire array of percent changes as one statement, instead of
pct_change = []
for row in close:
pct_change.append(...)
So try to avoid the Python loop for i, row in enumerate(...) entirely, and
think about how to perform your calculations with operations on the entire array (or dataframe) as a whole, rather than row-by-row.
Like what has been mentioned before, pandas object is most efficient when process the whole array at once. However for those who really need to loop through a pandas DataFrame to perform something, like me, I found at least three ways to do it. I have done a short test to see which one of the three is the least time consuming.
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(time.time()-A)
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(time.time()-A)
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(time.time()-A)
print B
Result:
[0.5639059543609619, 0.017839908599853516, 0.005645036697387695]
This is probably not the best way to measure the time consumption but it's quick for me.
Here are some pros and cons IMHO:
.iterrows(): return index and row items in separate variables, but significantly slower
.itertuples(): faster than .iterrows(), but return index together with row items, ir[0] is the index
zip: quickest, but no access to index of the row
EDIT 2020/11/10
For what it is worth, here is an updated benchmark with some other alternatives (perf with MacBookPro 2,4 GHz Intel Core i9 8 cores 32 Go 2667 MHz DDR4)
import sys
import tqdm
import time
import pandas as pd
B = []
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
for _ in tqdm.tqdm(range(10)):
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append({"method": "iterrows", "time": time.time()-A})
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append({"method": "itertuples", "time": time.time()-A})
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append({"method": "zip", "time": time.time()-A})
C = []
A = time.time()
for r in zip(*t.to_dict("list").values()):
C.append((r[0], r[1]))
B.append({"method": "zip + to_dict('list')", "time": time.time()-A})
C = []
A = time.time()
for r in t.to_dict("records"):
C.append((r["a"], r["b"]))
B.append({"method": "to_dict('records')", "time": time.time()-A})
A = time.time()
t.agg(tuple, axis=1).tolist()
B.append({"method": "agg", "time": time.time()-A})
A = time.time()
t.apply(tuple, axis=1).tolist()
B.append({"method": "apply", "time": time.time()-A})
print(f'Python {sys.version} on {sys.platform}')
print(f"Pandas version {pd.__version__}")
print(
pd.DataFrame(B).groupby("method").agg(["mean", "std"]).xs("time", axis=1).sort_values("mean")
)
## Output
Python 3.7.9 (default, Oct 13 2020, 10:58:24)
[Clang 12.0.0 (clang-1200.0.32.2)] on darwin
Pandas version 1.1.4
mean std
method
zip + to_dict('list') 0.002353 0.000168
zip 0.003381 0.000250
itertuples 0.007659 0.000728
to_dict('records') 0.025838 0.001458
agg 0.066391 0.007044
apply 0.067753 0.006997
iterrows 0.647215 0.019600
You can loop through the rows by transposing and then calling iteritems:
for date, row in df.T.iteritems():
# do some logic here
I am not certain about efficiency in that case. To get the best possible performance in an iterative algorithm, you might want to explore writing it in Cython, so you could do something like:
def my_algo(ndarray[object] dates, ndarray[float64_t] open,
ndarray[float64_t] low, ndarray[float64_t] high,
ndarray[float64_t] close, ndarray[float64_t] volume):
cdef:
Py_ssize_t i, n
float64_t foo
n = len(dates)
for i from 0 <= i < n:
foo = close[i] - open[i] # will be extremely fast
I would recommend writing the algorithm in pure Python first, make sure it works and see how fast it is-- if it's not fast enough, convert things to Cython like this with minimal work to get something that's about as fast as hand-coded C/C++.
You have three options:
By index (simplest):
>>> for index in df.index:
... print ("df[" + str(index) + "]['B']=" + str(df['B'][index]))
With iterrows (most used):
>>> for index, row in df.iterrows():
... print ("df[" + str(index) + "]['B']=" + str(row['B']))
With itertuples (fastest):
>>> for row in df.itertuples():
... print ("df[" + str(row.Index) + "]['B']=" + str(row.B))
Three options display something like:
df[0]['B']=125
df[1]['B']=415
df[2]['B']=23
df[3]['B']=456
df[4]['B']=189
df[5]['B']=456
df[6]['B']=12
Source: alphons.io
I checked out iterrows after noticing Nick Crawford's answer, but found that it yields (index, Series) tuples. Not sure which would work best for you, but I ended up using the itertuples method for my problem, which yields (index, row_value1...) tuples.
There's also iterkv, which iterates through (column, series) tuples.
Just as a small addition, you can also do an apply if you have a complex function that you apply to a single column:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
df[b] = df[a].apply(lambda col: do stuff with col here)
As #joris pointed out, iterrows is much slower than itertuples and itertuples is approximately 100 times faster than iterrows, and I tested the speed of both methods in a DataFrame with 5 million records the result is for iterrows, it is 1200it/s, and itertuples is 120000it/s.
If you use itertuples, note that every element in the for loop is a namedtuple, so to get the value in each column, you can refer to the following example code
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> for row in df.itertuples():
... print(row.col1, row.col2)
...
1, 0.1
2, 0.2
For sure, the fastest way to iterate over a dataframe is to access the underlying numpy ndarray either via df.values (as you do) or by accessing each column separately df.column_name.values. Since you want to have access to the index too, you can use df.index.values for that.
index = df.index.values
column_of_interest1 = df.column_name1.values
...
column_of_interestk = df.column_namek.values
for i in range(df.shape[0]):
index_value = index[i]
...
column_value_k = column_of_interest_k[i]
Not pythonic? Sure. But fast.
If you want to squeeze more juice out of the loop you will want to look into cython. Cython will let you gain huge speedups (think 10x-100x). For maximum performance check memory views for cython.
Another suggestion would be to combine groupby with vectorized calculations if subsets of the rows shared characteristics which allowed you to do so.
look at last one
t = pd.DataFrame({'a': range(0, 10000), 'b': range(10000, 20000)})
B = []
C = []
A = time.time()
for i,r in t.iterrows():
C.append((r['a'], r['b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for ir in t.itertuples():
C.append((ir[1], ir[2]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in zip(t['a'], t['b']):
C.append((r[0], r[1]))
B.append(round(time.time()-A,5))
C = []
A = time.time()
for r in range(len(t)):
C.append((t.loc[r, 'a'], t.loc[r, 'b']))
B.append(round(time.time()-A,5))
C = []
A = time.time()
[C.append((x,y)) for x,y in zip(t['a'], t['b'])]
B.append(round(time.time()-A,5))
B
0.46424
0.00505
0.00245
0.09879
0.00209
I believe the most simple and efficient way to loop through DataFrames is using numpy and numba. In that case, looping can be approximately as fast as vectorized operations in many cases. If numba is not an option, plain numpy is likely to be the next best option. As has been noted many times, your default should be vectorization, but this answer merely considers efficient looping, given the decision to loop, for whatever reason.
For a test case, let's use the example from #DSM's answer of calculating a percentage change. This is a very simple situation and as a practical matter you would not write a loop to calculate it, but as such it provides a reasonable baseline for timing vectorized approaches vs loops.
Let's set up the 4 approaches with a small DataFrame, and we'll time them on a larger dataset below.
import pandas as pd
import numpy as np
import numba as nb
df = pd.DataFrame( { 'close':[100,105,95,105] } )
pandas_vectorized = df.close.pct_change()[1:]
x = df.close.to_numpy()
numpy_vectorized = ( x[1:] - x[:-1] ) / x[:-1]
def test_numpy(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numpy_loop = test_numpy(df.close.to_numpy())[1:]
#nb.jit(nopython=True)
def test_numba(x):
pct_chng = np.zeros(len(x))
for i in range(1,len(x)):
pct_chng[i] = ( x[i] - x[i-1] ) / x[i-1]
return pct_chng
numba_loop = test_numba(df.close.to_numpy())[1:]
And here are the timings on a DataFrame with 100,000 rows (timings performed with Jupyter's %timeit function, collapsed to a summary table for readability):
pandas/vectorized 1,130 micro-seconds
numpy/vectorized 382 micro-seconds
numpy/looped 72,800 micro-seconds
numba/looped 455 micro-seconds
Summary: for simple cases, like this one, you would go with (vectorized) pandas for simplicity and readability, and (vectorized) numpy for speed. If you really need to use a loop, do it in numpy. If numba is available, combine it with numpy for additional speed. In this case, numpy + numba is almost as fast as vectorized numpy code.
Other details:
Not shown are various options like iterrows, itertuples, etc. which are orders of magnitude slower and really should never be used.
The timings here are fairly typical: numpy is faster than pandas and vectorized is faster than loops, but adding numba to numpy will often speed numpy up dramatically.
Everything except the pandas option requires converting the DataFrame column to a numpy array. That conversion is included in the timings.
The time to define/compile the numpy/numba functions was not included in the timings, but would generally be a negligible component of the timing for any large dataframe.