Fastest approach to read and process 10k Excell cells in Python/Pandas?

Fastest approach to read and process 10k Excell cells in Python/Pandas? - python

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?

It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.

#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Related

Advance processing multiple Data Frames in python

I got a few (15) data frames. They contain values based on one map, but they have fragmentary form.
List of samples looks like A1 - 3k records, A2 - 6k records. B1 - 12k records, B2- 1k records, B3 - 3k records. C1... etc.
All files have the same format and it looks that:
name sample position position_ID
String1 String1 num1 num1
String2 String2 num2 num2
...
All files come from a variety of biological microarrays. Different companies have different matrices, hence the scatter in the size of files. But each of them is based on one common, whole database. Just some of the data from the main database is selected. Therefore, individual records can be repeated between files. I want to see if they are compatible.
What do I want to achieve in this task?
I want to check that all records are the same in terms of name in all files have the same position and pos_ID values.
If the tested record with the same name differs in values in any file, it must be written to error.csv.
If it is everywhere the same - result.csv.
And to be honest I do not know how to bite it, so I am guided here with a hint that someone is taking good advise to me. I want do it in python.
I have two ideas.
Load in Pandas all files as one data frame and try to write a function filtering whole DF record by record (for loop with if statements?).
Open separate all files by python read file and adding unique rows to the new list, and when read function would encounter again the same recordName, it would check it with previous. If all rest of values are tha same it will pass it without writing, if no, the record will be written in error.csv.
I am afraid, however, that these may not be the most optimal methods, hence asking you for advice and directing me for something better? I have read about numpy, I have not studied it yet, but maybe it is worth it to be in the context of this task? Maybe there is a function that has already been created for this, and I do not know about it?
Can someone help a more sensible (maybe easier) solution?

I think I have a rough idea of where you are going. This is how I would approach it
import pandas as pd
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
df1["filename"] ="file1.csv"
df2["filename"] ="file2.csv"
df_total = pd.concat([df1,df2],axis=1) # stacks them vertically
df_total_no_dupes = df_total.drop_duplicates() # drops duplicate rows
# this gives you the cases where id occures more than once
name_counts = df_total_no_dupes.groupby("name").size().reset_index(name='counts')
names_which_appear_more_than_once = name_counts[name_counts["counts"] > 1]["name"].unique()
filter_condition = df_total_no_dupes["name"].isin(names_which_appear_more_than_once)
# this should be your dataframe where there are at least two rows with same name but different values.
print(df_total_no_dupes[filter_condition].sort_values("name"))

HDFStore get column names

I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!

You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.

For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')

Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.

Streaming Large (5gb) CSV’s FAST (in parallel?) on 16-core Machine?

I have 30 CSV files (saved as .txt files) ranging from 2GB to 11GB each on a server machine with 16 cores.
Each row of each CSV contains a date, a time, and an ID.
I need to construct a dense matrix of size datetime x ID (roughly 35,000 x 2000), where each cell is count of rows that had this datetime and ID (so each CSV row’s datetime and ID are used as matrix indices to update this matrix). Each file contains a unique range of datetimes, so this job is embarrassingly parallel across files.
Question: What is a faster/fastest way to accomplish this & (possibly) parallelize it? I am partial to Python, but could work in C++ if there is a better solution there. Should I re-write with MapReduce or MPI? Look into Dask or Pandas? Compile my python script somehow? Something else entirely?
My current approach (which I would happily discard for something faster):
Currently, I am doing this serially (one CSV at a time) in Python and saving the output matrix in h5 format. I stream a CSV line-by-line from the command line using:
cat one_csv.txt | my_script.py > outputfile.h5
And my python script works like:
# initialize matrix
…
for line in sys.stdin:
# Split the line into data columns
split = line.replace('\n','').split(',')
...(extract & process datetime; extract ID)...
# Update matrix
matrix[datetime, ID] = matrix[datetime, ID] +1
EDIT Below are a few example lines from one of the CSV's. The only relevant columns are 'dateYMDD' (formatted so that '80101' means jan. 1 2008), 'time', and 'ID'. So for example, the code should read use the first row of the CSV below to add 1 to the matrix cell corresponding to (Jan_1_2008_00_00_00, 12).
Also: There are many more unique times than unique ID's, and the CSV's are time-sorted.
Type|Number|dateYMDD|time|ID
2|519275|80101|0:00:00|12
5|525491|80101|0:05:00|25
2|624094|80101|0:12:00|75
5|623044|80102|0:01:00|75
6|658787|80102|0:03:00|4

First of all, you should probably profile your script to make sure the bottleneck is actually where you think.
That said, Python's Global Interpreter Lock will make parallelizing it difficult, unless you use multiprocessing, and I expect it will be faster to simply process them separately and merge the results: feed each Python script one CSV and output to one table, then merge the tables. If the tables are much smaller than the CSVs (as one would expect if the cells have high values) then this should be relatively efficient.
I don't think that will get you all-caps full-throttle FAST like you mentioned, though. If that doesn't meet your expectations I would think of writing it in C++, Rust or Cython.

Sorting in pandas for large datasets

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
Say your file is called stuff.csv, and looks like this:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted().
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html

Blaze might be the tool for you with the ability to work with pandas and csv files out of core.
http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.

If your csv file contains only structured data, I would suggest approach using only linux commands.
Assume csv file contains two columns, COL_1 and P_VALUE:
map.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.

Here is my Honest sugg./ Three options you can do.
I like Pandas for its rich doc and features but I been suggested to
use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
You can also use 'kind' parameter in your pandas-sort function you are using.
Godspeed my friend.

Running update of Pandas dataframe

I'd like to use a DataFrame to manage data from many trials of an experiment I'm controlling with Python code. Ideally I will have one master dataframe with a row for each trial that lives in the main function namespace, and then a separate dict (or dataframe) returned from the function that I call to execute the important bits of code for each trial.
What is the best way to do a running update of the master dataframe with this returned set of data? So far I've come up with:
df = df.append(df_trial, ignore_index=True)
or
df = pd.concat([df, df_trial])
But neither seem ideal (and both take a relatively long time according to %timeit). Is there a more Pandonic way?

You should build a list of the pieces and concatenate them all in one shot at the end.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest approach to read and process 10k Excell cells in Python/Pandas? - python

Related

Advance processing multiple Data Frames in python

HDFStore get column names

Streaming Large (5gb) CSV’s FAST (in parallel?) on 16-core Machine?

Sorting in pandas for large datasets

Running update of Pandas dataframe

Categories

Resources