Sorting in pandas for large datasets - python

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.
data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)
Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.
I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.
Say your file is called stuff.csv, and looks like this:
4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2
Then the following command will sort it by the 3rd column:
sort --parallel=8 -t . -nrk3 stuff.csv
Note that the number of threads here is set to 8.
The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So
split -l 100000 stuff.csv stuff
would split the file into files of length at most 100000 lines.
Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:
sort -m sorted_stuff_* > final_sorted_stuff.csv
Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

As I referred in the comments, this answer already provides a possible solution. It is based on the HDF format.
About the sorting problem, there are at least three possible ways to solve it with that approach.
First, you can try to use pandas directly, querying the HDF-stored-DataFrame.
Second, you can use PyTables, which pandas uses under the hood.
Francesc Alted gives a hint in the PyTables mailing list:
The simplest way is by setting the sortby parameter to true in the
Table.copy() method. This triggers an on-disk sorting operation, so you
don't have to be afraid of your available memory. You will need the Pro
version for getting this capability.
In the docs, it says:
sortby :
If specified, and sortby corresponds to a column with an index, then the copy will be sorted by this index. If you want to ensure a fully sorted order, the index must be a CSI one. A reverse sorted copy can be achieved by specifying a negative value for the step keyword. If sortby is omitted or None, the original table order is used
Third, still with PyTables, you can use the method Table.itersorted().
From the docs:
Table.itersorted(sortby, checkCSI=False, start=None, stop=None, step=None)
Iterate table data following the order of the index of sortby column. The sortby column must have associated a full index.
Another approach consists in using a database in between. The detailed workflow can be seen in this IPython Notebook published at plot.ly.
This allows to solve the sorting problem, along with other data analyses that are possible with pandas. It looks like it was created by the user chris, so all the credit goes to him. I am copying here the relevant parts.
Introduction
This notebook explores a 3.9Gb CSV file.
This notebook is a primer on out-of-memory data analysis with
pandas: A library with easy-to-use data structures and data analysis tools. Also, interfaces to out-of-memory databases like SQLite.
IPython notebook: An interface for writing and sharing python code, text, and plots.
SQLite: An self-contained, server-less database that's easy to set-up and query from Pandas.
Plotly: A platform for publishing beautiful, interactive graphs from Python to the web.
Requirements
import pandas as pd
from sqlalchemy import create_engine # database connection
Import the CSV data into SQLite
Load the CSV, chunk-by-chunk, into a DataFrame
Process the data a bit, strip out uninteresting columns
Append it to the SQLite database
disk_engine = create_engine('sqlite:///311_8M.db') # Initializes database with filename 311_8M.db in current directory
chunksize = 20000
index_start = 1
for df in pd.read_csv('311_100M.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
# do stuff
df.index += index_start
df.to_sql('data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
Query value counts and order the results
Housing and Development Dept receives the most complaints
df = pd.read_sql_query('SELECT Agency, COUNT(*) as `num_complaints`'
'FROM data '
'GROUP BY Agency '
'ORDER BY -num_complaints', disk_engine)
Limiting the number of sorted entries
What's the most 10 common complaint in each city?
df = pd.read_sql_query('SELECT City, COUNT(*) as `num_complaints` '
'FROM data '
'GROUP BY `City` '
'ORDER BY -num_complaints '
'LIMIT 10 ', disk_engine)
Possibly related and useful links
Pandas: in memory sorting hdf5 files
ptrepack sortby needs 'full' index
http://pandas.pydata.org/pandas-docs/stable/cookbook.html#hdfstore
http://www.pytables.org/usersguide/optimization.html

Blaze might be the tool for you with the ability to work with pandas and csv files out of core.
http://blaze.readthedocs.org/en/latest/ooc.html
import blaze
import pandas as pd
d = blaze.Data('my-large-file.csv')
d.P_VALUE.sort() # Uses Chunked Pandas
For faster processing, load it into a database first which blaze can control. But if this is a one off and you have some time then the posted code should do it.

If your csv file contains only structured data, I would suggest approach using only linux commands.
Assume csv file contains two columns, COL_1 and P_VALUE:
map.py:
import sys
for line in sys.stdin:
col_1, p_value = line.split(',')
print "%f,%s" % (p_value, col_1)
then the following linux command will generate the csv file with p_value sorted:
cat input.csv | ./map.py | sort > output.csv
If you're familiar with hadoop, using the above map.py also adding a simple reduce.py will generate the sorted csv file via hadoop streaming system.

Here is my Honest sugg./ Three options you can do.
I like Pandas for its rich doc and features but I been suggested to
use NUMPY as it feel faster comparatively for larger datasets. You can think of using other tools as well for easier job.
In case you are using Python3, you can break your big data chunk into sets and do Congruent Threading. I am too lazy for this and it does nt look cool, you see Panda, Numpy, Scipy are build with Hardware design perspectives to enable multi threading I believe.
I prefer this, this is easy and lazy technique acc. to me. Check the document at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort.html
You can also use 'kind' parameter in your pandas-sort function you are using.
Godspeed my friend.

Related

Fastest approach to read and process 10k Excell cells in Python/Pandas?

I want to read and process realtime DDE data from a trading platform, using Excel as 'bridge' between trading platform (which sends out datas) and Python which process it, and print it back to Excel as front-end 'gui'. SPEED IS CRUCIAL. I need to:
read 6/10 thousands cells in Excel as fast as possible
sum ticks passed at same time (same h:m:sec)
check if DataFrame contains any value in a static array (eg. large quantities)
write output on the same excel file (different sheet), used as front-end output 'gui'.
I imported 'xlwings' library and use it to read data from one sheet, calculate needed values in python and then print out results in another sheet of the same file. I want to have Excel open and visible so to function as 'output dashboard'. This function is run in an infinite loop reading realtime stock prices.
import xlwings as xw
import numpy as np
import pandas as pd
...
...
tickdf = pd.DataFrame(xw.Book('datafile.xlsx').sheets['raw_data'].range((1,5)(1500, 8)).value)
tickdf.columns = ['time', 'price', 'all-tick','symb']
tickdf = tickdf[['time','symb', 'price', 'all-tick']]
#read data and fill a pandas.df with values, then re-order columns
try:
global ttt #this is used as temporary global pandas.df
global tttout #this is used as output global pandas.df copy
#they are global as they can be zeroed with another function
ttt= ttt.append(tickdf, ignore_index=False)
#at each loop, newly read ticks are added as rows to the end of ttt global.df.
ttt.drop_duplicates(inplace=True)
tttout = ttt.copy()
#to prevent outputting incomplete data,for extra-safety, I use a copy of the ttt as DF to be printed out on excel file. I find this as an extra-safety step
tttout = tttout.groupby(['time','symb'], as_index=False).agg({'all-tick':'sum', 'price':'first'})
tttout = tttout.set_index('time')
#sort it by time/name and set time as index
tttout = tttout.loc[tttout['all-tick'].isin(target_ticker)]
#find matching values comparing an array of a dozen values
tttout = tttout.sort_values(by = ['time', 'symb'], ascending = [False, True])
xw.Book(file_path).sheets['OUTPUT'].range('B2').value = tttout
I run this on a i5#4.2ghz, and this function, together with some other small other code, runs in 500-600ms per loop, which is fairly good (but not fantastic!) - I would like to know if there is a better approach and which step(s) might be bottlenecks.
Code reads 1500 rows, one per listed stock in alphabetical order, each of it is the 'last tick' passed on the market for that specific stock and it looks like this:
'10:00:04 | ABC | 10.33 | 50000'
'09:45:20 | XYZ | 5.260 | 200 '
'....
being time, stock symbol, price, quantity.
I want to investigate if there are some specific quantities that are traded on the market, such as 1.000.000 (as it represent a huge order) , or maybe just '1' as often is used as market 'heartbeat', a sort of fake order.
My approach is to use Pandas/Xlwings/ and 'isin' method. Is there a more efficient approach that might improve my script performance?
It would be faster to use a UDF written with PyXLL as that would avoid going via COM and an external process. You would have a formula in Excel with the input set to your range of data, and that would be called each time the input data updated. This would avoid the need to keep polling the data in an infinite loop, and should be much faster than running Python outside of Excel.
See https://www.pyxll.com/docs/introduction.html if you're not already familiar with PyXLL.
PyXLL could convert the input range to a pandas DataFrame for you (see https://www.pyxll.com/docs/userguide/pandas.html), but that might not be the fastest way to do it.
The quickest way to transfer data from Excel to Python is via a floating point numpy array using the "numpy_array" type in PyXLL (see https://www.pyxll.com/docs/userguide/udfs/argtypes.html#numpy-array-types).
As speed is a concern, maybe you could split the data up and have some functions that take mostly static data (eg rows and column headers), and other functions that take variable data as numpy_arrays where possible or other types where not, and then a final function to combine them all.
PyXLL can return Python objects to Excel as object handles. If you need to return intermediate results then it is generally faster to do that instead of expanding the whole dataset to an Excel range.
#Tony Roberts, thank you
I have one doubt and one observation.
DOUBT: Data get updated very fast, every 50-100ms. Would it be feasible to use a UDF fuction to be called so often ? would it be lean ? I have little experience in this.
OBSERVATION: PyXLL is for sure extremely powerful, well done, well maintained but IMHO, costing $25/month it goes beyond the pure nature of free Python language. I although do understand quality has a price.

Reading Parquet File with Array<Map<String,String>> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be:
import pandas as pd
df = pd.DataFrame.from_records([
(1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100),
(5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)],
columns=['uid', 'job_history', 'latency']
)
The when using engine='fastparquet, Dask reads all other columns fine but returns a column of Nones for the column with the complex type. When I set engine='pyarrow', I get the following exception:
ArrowNotImplementedError: lists with structs are not supported.
A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:
Some how tell dask/fastparquet to parse the column using the standard json library. The schema is simple and that would do the job if possible
See if I can possibily re-run the Spark job that generated the output and save it as something else, though this almost isn't an acceptable solution since my company uses parquet everywhere
Turn the keys of the map into columns and break the data up across several columns with dtype list and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0 across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(
I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.
It would be fairer to say that pandas does not support non-simple types very well (currently). It may be the case that pyarrow will, without conversion to pandas, and that as some future point, pandas will use these arrow structures directly.
Indeed, the most direct method that I can think for you to use, is to rewrite the columns as B/JSON-encoded text, and then load with fastparquet, specifying to load using B/JSON. You should get lists of dicts in the column, but the performance will be slow.
Note that the old project oamap and its successor awkward provides a way to iterate and aggregate over nested list/map/struct trees using python syntax, but compiled with Numba, such that you never need to instantiate the intermediate python objects. They were not designed for parquet, but had parquet compatibility, so might just be useful to you.
I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()
and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:
>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]
the individual record can be rendered as a dictionary, simply call .asDict(recursive=True) on the Row object you would like.
Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option

How to keep null values when writing to csv

I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command. The issue is that Python's csv writer automatically converts Nulls into an empty string "" and it fails my job when the column is an int or float datatype and it tries to insert this "" when it should be a None or null value.
To make it as easy as possible to interface with modules which
implement the DB API, the value None is written as the empty string.
https://docs.python.org/3.4/library/csv.html?highlight=csv#csv.writer
What is the best way to keep the null value? Is there a better way to write csvs in Python? I'm open to all suggestions.
Example:
I have lat and long values:
42.313270000 -71.116240000
42.377010000 -71.064770000
NULL NULL
When writing to csv it converts nulls to "":
with file_path.open(mode='w', newline='') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=csv.QUOTE_NONNUMERIC)
if include_headers:
csv_writer.writerow(col[0] for col in self.cursor.description)
for row in self.cursor:
csv_writer.writerow(row)
.
42.313270000,-71.116240000
42.377010000,-71.064770000
"",""
NULL
Specifies the string that represents a null value. The default is \N
(backslash-N) in text format, and an unquoted empty string in CSV
format. You might prefer an empty string even in text format for cases
where you don't want to distinguish nulls from empty strings. This
option is not allowed when using binary format.
https://www.postgresql.org/docs/9.2/sql-copy.html
ANSWER:
What solved the problem for me was changing the quoting to csv.QUOTE_MINIMAL.
csv.QUOTE_MINIMAL Instructs writer objects to only quote those fields
which contain special characters such as delimiter, quotechar or any
of the characters in lineterminator.
Related questions:
- Postgresql COPY empty string as NULL not work
You have two options here: change the csv.writing quoting option in Python, or tell PostgreSQL to accept quoted strings as possible NULLs (requires PostgreSQL 9.4 or newer)
Python csv.writer() and quoting
On the Python side, you are telling the csv.writer() object to add quotes, because you configured it to use csv.QUOTE_NONNUMERIC:
Instructs writer objects to quote all non-numeric fields.
None values are non-numeric, so result in "" being written.
Switch to using csv.QUOTE_MINIMAL or csv.QUOTE_NONE:
csv.QUOTE_MINIMAL
Instructs writer objects to only quote those fields which contain special characters such as delimiter, quotechar or any of the characters in lineterminator.
csv.QUOTE_NONE
Instructs writer objects to never quote fields. When the current delimiter occurs in output data it is preceded by the current escapechar character.
Since all you are writing is longitude and latitude values, you don't need any quoting here, there are no delimiters or quotecharacters present in your data.
With either option, the CSV output for None values is simple an empty string:
>>> import csv
>>> from io import StringIO
>>> def test_csv_writing(rows, quoting):
... outfile = StringIO()
... csv_writer = csv.writer(outfile, delimiter=',', quoting=quoting)
... csv_writer.writerows(rows)
... return outfile.getvalue()
...
>>> rows = [
... [42.313270000, -71.116240000],
... [42.377010000, -71.064770000],
... [None, None],
... ]
>>> print(test_csv_writing(rows, csv.QUOTE_NONNUMERIC))
42.31327,-71.11624
42.37701,-71.06477
"",""
>>> print(test_csv_writing(rows, csv.QUOTE_MINIMAL))
42.31327,-71.11624
42.37701,-71.06477
,
>>> print(test_csv_writing(rows, csv.QUOTE_NONE))
42.31327,-71.11624
42.37701,-71.06477
,
PostgreSQL 9.4 COPY FROM, NULL values and FORCE_NULL
As of PostgreSQL 9.4, you can also force PostgreSQL to accept quoted empty strings as NULLs, when you use the FORCE_NULL option. From the COPY FROM documentation:
FORCE_NULL
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
Naming the columns in a FORCE_NULL option lets PostgreSQL accept both the empty column and "" as NULL values for those columns, e.g.:
COPY position (
lon,
lat
)
FROM "filename"
WITH (
FORMAT csv,
NULL '',
DELIMITER ',',
FORCE_NULL(lon, lat)
);
at which point it doesn't matter anymore what quoting options you used on the Python side.
Other options to consider
For simple data transformation tasks from other databases, don't use Python
If you already querying databases to collate data to go into PostgreSQL, consider directly inserting into Postgres. If the data comes from other sources, using the foreign data wrapper (fdw) module lets you cut out the middle-man and directly pull data into PostgreSQL from other sources.
Numpy data? Consider using COPY FROM as binary, directly from Python
Numpy data can more efficiently be inserted via binary COPY FROM; the linked answer augments a numpy structured array with the required extra metadata and byte ordering, then efficiently creates a binary copy of the data and inserts it into PostgreSQL using COPY FROM STDIN WITH BINARY and the psycopg2.copy_expert() method. This neatly avoids number -> text -> number conversions.
Persisting data to handle large datasets in a pipeline?
Don't re-invent the data pipeline wheels. Consider using existing projects such as Apache Spark, which have already solved the efficiency problems. Spark lets you treat data as a structured stream, and includes the infrastructure to run data analysis steps in parallel, and you can treat distributed, structured data as Pandas dataframes.
Another option might be to look at Dask to help share datasets between distributed tasks to process large amounts of data.
Even if converting an already running project to Spark might be a step too far, at least consider using Apache Arrow, the data exchange platform Spark builds on top of. The pyarrow project would let you exchange data via Parquet files, or exchange data over IPC.
The Pandas and Numpy teams are quite heavily invested in supporting the needs of Arrow and Dask (there is considerable overlap in core members between these projects) and are actively working to make Python data exchange as efficient as possible, including extending Python's pickle module to allow for out-of-band data streams to avoid unnecessary memory copying when sharing data.
your code
for row in self.cursor:
csv_writer.writerow(row)
uses writer as-is, but you don't have to do that. You can filter the values to change some particular values with a generator comprehension and a ternary expression
for row in self.cursor:
csv_writer.writerow("null" if x is None else x for x in row)
You are asking for csv.QUOTE_NONNUMERIC. This will turn everything that is not a number into a string. You should consider using csv.QUOTE_MINIMAL as it might be more what you are after:
Test Code:
import csv
test_data = (None, 0, '', 'data')
for name, quotes in (('test1.csv', csv.QUOTE_NONNUMERIC),
('test2.csv', csv.QUOTE_MINIMAL)):
with open(name, mode='w') as outfile:
csv_writer = csv.writer(outfile, delimiter=',', quoting=quotes)
csv_writer.writerow(test_data))
Results:
test1.csv:
"",0,"","data"
test2.csv:
,0,,data
I'm writing data from sql server into a csv file using Python's csv module and then uploading the csv file to a postgres database using the copy command.
I believe your true requirement is you need to hop data rows through the filesystem, and as both the sentence above and the question title make clear, you are currently doing that with a csv file.
Trouble is that csv format offers poor support for the RDBMS notion of NULL.
Let me solve your problem for you by changing the question slightly.
I'd like to introduce you to parquet format.
Given a set of table rows in memory, it allows you to very quickly persist them to a compressed binary file, and recover them, with metadata and NULLs intact, no text quoting hassles.
Here is an example, using the pyarrow 0.12.1 parquet engine:
import pandas as pd
import pyarrow
def round_trip(fspec='/tmp/locations.parquet'):
rows = [
dict(lat=42.313, lng=-71.116),
dict(lat=42.377, lng=-71.065),
dict(lat=None, lng=None),
]
df = pd.DataFrame(rows)
df.to_parquet(fspec)
del(df)
df2 = pd.read_parquet(fspec)
print(df2)
if __name__ == '__main__':
round_trip()
Output:
lat lng
0 42.313 -71.116
1 42.377 -71.065
2 NaN NaN
Once you've recovered the rows in a dataframe you're free to call df2.to_sql() or use some other favorite technique to put numbers and NULLs into a DB table.
EDIT:
If you're able to run .to_sql() on the PG server, or on same LAN, then do that.
Otherwise your favorite technique will likely involve .copy_expert().
Why?
The summary is that with psycopg2, "bulk INSERT is slow".
Middle layers like sqlalchemy and pandas, and well-written apps that care about insert performance, will use .executemany().
The idea is to send lots of rows all at once, without waiting for individual result status, because we're not worried about unique index violations.
So TCP gets a giant buffer of SQL text and sends it all at once, saturating the end-to-end channel's bandwidth,
much as copy_expert sends a big buffer to TCP to achieve high bandwidth.
In contrast the psycopg2 driver lacks support for high performance executemany.
As of 2.7.4 it just executes items one at a time, sending a SQL command across the WAN and waiting a round trip time for the result before sending next command.
Ping your server;
if ping times suggest you could get a dozen round trips per second,
then plan on only inserting about a dozen rows per second.
Most of the time is spent waiting for a reply packet, rather than spent processing DB rows.
It would be lovely if at some future date psycopg2 would offer better support for this.
I would use pandas,psycopg2,and sqlalchemy. Make sure are installed. Coming from your current workflow and avoiding writing to csv
#no need to import psycopg2
import pandas as pd
from sqlalchemy import create_engine
#create connection to postgres
engine = create_engine('postgres://.....')
#get column names from cursor.description
columns = [col[0] for col in self.cursor.description]
#convert data into dataframe
df = pd.DataFrame(cursor.fetchall(),columns=columns)
#send dataframe to postgres
df.to_sql('name_of_table',engine,if_exists='append',index=False)
#if you still need to write to csv
df.to_csv('your_file.csv')

Streaming Large (5gb) CSV’s FAST (in parallel?) on 16-core Machine?

I have 30 CSV files (saved as .txt files) ranging from 2GB to 11GB each on a server machine with 16 cores.
Each row of each CSV contains a date, a time, and an ID.
I need to construct a dense matrix of size datetime x ID (roughly 35,000 x 2000), where each cell is count of rows that had this datetime and ID (so each CSV row’s datetime and ID are used as matrix indices to update this matrix). Each file contains a unique range of datetimes, so this job is embarrassingly parallel across files.
Question: What is a faster/fastest way to accomplish this & (possibly) parallelize it? I am partial to Python, but could work in C++ if there is a better solution there. Should I re-write with MapReduce or MPI? Look into Dask or Pandas? Compile my python script somehow? Something else entirely?
My current approach (which I would happily discard for something faster):
Currently, I am doing this serially (one CSV at a time) in Python and saving the output matrix in h5 format. I stream a CSV line-by-line from the command line using:
cat one_csv.txt | my_script.py > outputfile.h5
And my python script works like:
# initialize matrix
…
for line in sys.stdin:
# Split the line into data columns
split = line.replace('\n','').split(',')
...(extract & process datetime; extract ID)...
# Update matrix
matrix[datetime, ID] = matrix[datetime, ID] +1
EDIT Below are a few example lines from one of the CSV's. The only relevant columns are 'dateYMDD' (formatted so that '80101' means jan. 1 2008), 'time', and 'ID'. So for example, the code should read use the first row of the CSV below to add 1 to the matrix cell corresponding to (Jan_1_2008_00_00_00, 12).
Also: There are many more unique times than unique ID's, and the CSV's are time-sorted.
Type|Number|dateYMDD|time|ID
2|519275|80101|0:00:00|12
5|525491|80101|0:05:00|25
2|624094|80101|0:12:00|75
5|623044|80102|0:01:00|75
6|658787|80102|0:03:00|4
First of all, you should probably profile your script to make sure the bottleneck is actually where you think.
That said, Python's Global Interpreter Lock will make parallelizing it difficult, unless you use multiprocessing, and I expect it will be faster to simply process them separately and merge the results: feed each Python script one CSV and output to one table, then merge the tables. If the tables are much smaller than the CSVs (as one would expect if the cells have high values) then this should be relatively efficient.
I don't think that will get you all-caps full-throttle FAST like you mentioned, though. If that doesn't meet your expectations I would think of writing it in C++, Rust or Cython.

Is .loc the best way to build a pandas DataFrame?

I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks
No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000))  # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)
If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.
A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe

Categories

Resources