Optimize processing of large CSV file Python

Optimize processing of large CSV file Python - python

I have a CSV file of about 175 millions lines (2.86 GB), composed of three columns as shown below :
I need to get the value in column "val" given "ID1" and "ID2". I query this dataframe constantly with varying combination of ID1 and ID2, which are unique in the whole file.
I have tried to use pandas as shown below, but results are taking a lot of time.
def is_av(Qterm, Cterm, df):
try:
return df.loc[(Qterm, Cterm),'val']
except KeyError:
return 0
Is there a faster way to access CSV values, knowing that this value is located in one single row of the whole file.
If not could you check this function and tell me what might be the issue of slow processing
for nc in L:#ID1
score = 0.0
for ni in id_list:#ID2
e = is_av(ni,nc,df_g)
InDegree = df1.loc[ni].values[0]
SumInMap = df2.loc[nc].values[0]
score = score + term_score(InDegree, SumInMap, e) #compute a score
key = pd_df3.loc[nc].values[0]
tmt[key] = score

TL;DR: Use a DBMS (I suggest MySQL or PostgreSQL). Pandas is definitely not suited for this sort of work. Dask is better, but not as good as a traditional DBMS.
The absolute best way of doing this would be to use SQL, consider MySQL or PostgreSQL for starters (both free and very efficient alternatives for your current use case). While Pandas is an incredibly strong library, when it comes to indexing and quick reading, this is not something it excels at, given that it needs to either load data into memory, or stream over the data with little control compared to a DBMS.
Consider your use case where you have multiple values and you want to skip specific rows, let's say you're looking for (ID1, ID2) with values of (3108, 4813). You want to skip over every row that starts with anything other than 3, then anything other than 31, and so on, and then skip any row starting with anything other than 3108,4 (assuming your csv delimiter is a comma), and so on until you get exactly the ID1 and ID2 you're looking for, this is reading the data at a character level.
Pandas does not allow you to do this (as far as I know, someone can correct this response if it does). The other example uses Dask, which is a library designed by default to handle data much larger than the RAM at scale, but is not optimized for index management as DBMS's are. Don't get me wrong, Dask is good, but not for your use case.
Another very basic alternative would be to index your data based on ID1 and ID2, store them indexed, and only look up your data through actual file reading by skipping lines that do not start with your designated ID1, and then skipping lines that do not start with your ID2, and so on, however, the best practice would be to use a DBMS, as caching, read optimization, among many other serious pros would be available; reducing the I/O read time from your disk.
You can get started with MySQL here: https://dev.mysql.com/doc/mysql-getting-started/en/
You can get started with PostgreSQL here: https://www.postgresqltutorial.com/postgresql-getting-started/

import os
os.system('pip install dask')
import dask.dataframe as dd
dd_data = dd.read_csv('sample.csv')
bool_filter_conditions = (dd_data['ID1'] == 'a') & (dd_data['ID2'] == 'b')
dd_result = dd_data[bool_filter_conditions][['val']]
dd_output = dd_result.compute()
dd_output

Related

check csv every 5 rows with condition using python3.x

csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!

I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !

Get unique values of every column from a gz file

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.
import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique
I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.
any shell script solution?
for example i have the data in my file as
5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,
and in want all unique values from each column.

With the gunzipped file, you could do:
awk -F, 'END { for (i=1;i<=NF;i++) { print "cut -d\",\" -f "i" filename | uniq" } }' filename | sh
Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.

A shell built pipeline could indeed do this job faster, though likely less memory efficient. The primary reasons are two: parallellism and native code.
First, since we have little description of the task, I'll have to read the Python code and figure out what it does.
from sets import Set is an odd line; sets are part of the standard library, and I don't know what your sets module contains. I'll have to guess it's at best another name for the standard set type, or at least a less efficient variant of the same concept.
gzip.open lets the script read a gzipped file. We can replace this with a zcat process.
csv.readerreads character separated values, in this case splitting on '|'. Deeper inside the code we find only one column (line[i]) is read, so we can replace it with cut or awk ... until i changes. awk can handle that case too, but it's a little trickier.
The trickiest part is the end logic. Every time 10 unique values are found in a column, the program outputs those values and switches to the next column. By the way, Python's for has an else clause specifically for this case, so you don't need a flag variable.
One of the odder parts of the code is how you catch all exceptions from the inner data processing block. Why is this? There are basically only two sources of exceptions in there: Firstly, the indexing could fail if there aren't that many columns. Secondly, the unknown Set type could be throwing exceptions; the standard set type would not.
So, the analysis of your function is: in a diagonal manner (since the file is never rewound, and columns are not processed in parallel), collect unique values from each column until ten are found, and print them. This means, for instance, that if the first column had less than ten unique items nothing is ever printed for any other columns. I'm not sure this is the logic you intended.
With such complicated logic, Python's set functionality actually is a good choice; if we could partition the data more easily then uniq might have been better. What throws us off is how the program moves from column to column and only wants a specific number of values.
Thus, the two big time wasters in the Python program are decompressing in the same thread as we do all the parsing, and splitting into all columns when we only need one. The former can be addressed using a thread, and the latter is probably best done using a regular expression such as r'^(?:[^|]*\|){3}([^|]*)'. That expression would skip three columns and the fourth can be read as group 1. It gets more complicated if the CSV has quoting to contain the separator within some column. We could do the line parsing itself in a separate thread, but that wouldn't solve the issue of the many unneeded string allocations.
Note that the problem actually becomes considerably different if what you really want is to process all columns from the start of the file. I also don't know why you specifically process 400 columns regardless of the amount that exist. If we remove those two constraints, the logic would be more like:
firstline=next(reader)
sets = [{column} for column in firstline]
for line in reader:
for column,columnset in zip(line,sets):
columnset.add(column)

this is a pure python version based on your idea:
from io import StringIO
from csv import reader
txt = '''5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,'''
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
s.add(item)
which yields for your input:
[{'129DC8', '41C528', '4DE8CD', '5C4423', '9E7F41', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094',
'CA39260W1023',
'NL0000344265',
'QA000A0NCQB1',
'US2333774071',
'US37253A1034'},
{'2000-01-01', '2008-03-06', '2012-09-07', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
oops, now that i have posted my answer i see, that this is exactly what Yann Vernier proposes at the end of his answer. please upvote this answer which was here way earlier than mine...
if you want to limit the number of unique values, you could use a deque as data structure:
from io import StringIO
from csv import reader
MAX_LEN = 3
with StringIO(txt) as file:
rows = reader(file)
first_row = next(rows)
unique = [{item} for item in first_row]
for row in rows:
for item, s in zip(row, unique):
if len(s) < MAX_LEN:
s.add(item)
print(unique)
with the result:
[{'41C528', '5C4423', 'B62545'},
{'COMP'},
{'ISIN'},
{'CA2372051094', 'NL0000344265', 'US2333774071'},
{'2000-01-01', '2013-02-13', '2016-04-19'},
{'', '2007-05-11', '2013-08-09'}]
this way you would save some memory if one of your columns holds only unique values.

Parse huge text files and identify poorly delimited columns on each row

I am parsing huge text files (3GB each) with millions of rows.
I am reading the files using pandas' read_table, including an iterator, and without specifying a delimiter, because sep = " " keeps giving me following error:
CParserError: Error tokenizing data
A typical row for example:
<www.blabla.com> <mmm> "Hello" <A.C> .
I wrote a function that will return a list with the following elements:
www.blabla.com mmm Hello A.C
But it gets complicated because text outside <> or "" must be ignored. Sometimes there are double quotes inside the quotes that are escaped with the backslash (\"), and sometimes the brackets are replaced by _:mexx for a mysterious reason i do not understand, but in which case it counts.
Writing above conditions into a function made the script very very slow. Took me more than two hours to process ten million rows, and i must process approx 200 million rows.
My goal is not the text per se, but to count the elements on each row.
It could be either three or four. So i decided to use only native functions to pandas and avoid applying. Here is the relevant code so far:
tdf = pd.DataFrame(columns = ['TOPIC', 'COUNTER'])
chunkk = 50000
for ii, f in enumerate(files):
reader = pd.read_table(f, header=None, chunksize = chunkk)
for df in reader:
df = df[0].str.split(" ", expand=True)
df['TOPIC'] = df[0] #first element retrieved from split
# count here across the row the number of elements
tdf = tdf.append(df[['TOPIC', 'COUNTER']], ignore_index=True)
tdf = tdf.groupby('TOPIC', as_index=False).sum()
i += chunkk
print("Completed " + str(i) + " rows from file #" + str(ii +1))
I think i need to use .count(axis =1) but i am not sure how to do it. Looking at pandas documentation http://pandas.pydata.org/pandas-docs/stable/text.html
i think regex could be a key solution.
Any advice on how to count the number of valid elements is greatly appreciated. Also ANY advice to make the code run faster would be great.
There might be a better way to do it using a database and SQL so i am tagging them below.

I see a few options you might have. I am not sure which is best for your use. I am thinking distribute and process with Pyspark would probably be my assumption. However.....
I am not sure putting this in a SQL database would give you anything. You would still have to parse the data and process it so even if we could optimize the querying, you have other issues.
You may want to take a close look at disk access vs memory. If the file is small enough to easily fit in memory, you may get better performance by reading the whole thing and then processing it line by line. This would improve performance from disk I/O but at a steep memory cost.
You could distribute, load at once, and process in Pyspark. This may be the best option since you are just looking at analyzing data and getting an answer back,

Recommendation for writing data from SQLite file via Python sqlite3

I have generated a giant SQLite database and need to get some data out of it. I wrote some script to do so, and profiling let to the unfortunate conclusion that the write process would take approx. 3 days with the current setup. I wrote the script as simplistic as possible to make it as fast as possible.
I am wondering if you have some trick to speed up the whole process. The database has an unique index, but the columns I am querying don't (because of duplicate rows for those).
Would it make sense to use any multi-processing Python library here?
The script would be like this:
import sqlite3
def write_from_query(db_name, table_name, condition, content_column, out_file):
'''
Writes contents from a SQLite database column to an output file
Keyword arguments:
db_name (str): Path of the .sqlite database file.
table_name (str): Name of the target table in the SQLite file.
condition (str): Condition for querying the SQLite database table.
content_colum (str): Name of the column that contains the content for the output file.
out_file (str): Path of the output file that will be written.
'''
# Connecting to the database file
conn = sqlite3.connect('zinc12_drugnow_nrb(copy).sqlite')
c = conn.cursor()
# Querying the database and writing the output file
c.execute('SELECT ({}) FROM {} WHERE {}'.format(content_column, table_name, condition))
with open(out_file, 'w') as outf:
for row in c:
outf.write(row[0])
# Closing the connection to the database
conn.close()
if __name__ == '__main__':
write_from_query(
db_name='my_db.sqlite',
table_name='my_table',
condition='variable1=1 AND variable2<=5 AND variable3="Zinc_Plus"',
content_column='variable4',
out_file='sqlite_out.txt'
)
Link to this script on GitHub
Thanks for your help, I am looking forward to your suggestions!
EDIT:
more information about the database:

I assume that you are running the write_from_query functions for a huge amount of queries.
If so the problem is the missing indices on your filter criteria
This results in the following: for each query you execute, sqlite will loop through the whole 50GB of data and checks whether your conditions hold true. That is VERY inefficient.
The easiest way would be to slap indices on your columns
An alternative would be to formulate less queries that include multiple of your cases and then loop over that data again to split it it in different files. How well this can done however depends on how your data is structured.
I'm not sure about multiprocessing/threading, sqlite is not really made for concurrency, but I guess it could work out since you only read data...

Either you dump the content and filter in your own program - or you add indices to all columns you use in your conditions.
Adding indices to all the columns will take a long long time.
But for many different queries there is no alternative.
No multiprocessing will probably not help. An SSD might, or 64GiB Ram. But they are not needed with indices, queries will be fast on normal disks too.
In conclusion you created a Database without creating indices for the columns you want to query. With 8Mio rows this wont work.

Whilst the process of actually writing this data to a file will take a while I would expect it to be more like minutes than days e.g. at a 50MB/s sequential write speed 15GB works out at around 5 mins.
I suspect that the issue is with the queries / lack of indexes. I would suggest trying to build composite indexes based on the combinations of columns that you need to filter on. As you will see from the documentation here, you can actually add as many columns as you want to an index.
Just to make you aware adding indexes will slow down inserts / updates to your database as every time it now needs to find the appropriate place in the relevant indexes to add data as well as appending data to the end of the tables, but this is probably your only option to speed up the queries.

I will look at the unique indices! But meanwhile another thing I just stumbled upon... Sorry for writing an own answer for my question here, but I thought it is better for the organization...
I was thinking that the .fetchall() command could also speed up the whole process, but I find the sqlite3 documentation on this a little bit brief ... Would something like
with open(out_file, 'w') as outf:
c.excecute ('SELECT * ...')
results = c.fetchmany(10000)
while results:
for row in results:
outf.write(row[0])
results = c.fetchmany(10000)
make sense?

Reading .dat without delimiters into array in python

I have a .dat file with no delimiters that I am trying to read into an array. Say each new line represents one person, and variables in each line are defined in terms of a fixed number of characters, e.g the first variable "year" is the first four characters, the second variable "age" is the next 2 characters (no delimiters within the line) e.g.:
201219\n
201220\n
201256\n
Here is what I am doing right now:
data_file = 'filename.dat'
file = open(data_file, 'r')
year = []
age = []
for line in file:
year.append(line[0:4])
age.append(line[4:])
This works fine for a small number of lines and variables, but when I try loading the full data file (500Mb with 10 million lines and 20 variables) I get a MemoryError. Is there a more efficient way to load this type of data into arrays?

First off, you're probably better off with a list of class instances than a bunch of parallel lists, from a software engineering standpoint. If you try this, you probably should look into __slots__ to decrease the memory overhead.
You could also try pypy - it has some memory optimizations for homogeneous lists.
I'd probably use gdbm or bsddb rather than sqlite, if you want an on-disk solution. gdbm and bsddb look like dict's, except you index them (the keys) by a string and the values are strings too. So your class (the one I mentioned above) would have a __str__ and/or __repr__ method(s) that would convert to a string (could use pickle) for storage in the table. Then your constructor would be made to deal with reversing the process somehow.
If you ever get to such large data that a gdbm or bsddb is too slow, you could try just writing to a flat file - that'll not be as nice for jumping around obviously, but it eliminates a lot of seek()'ing which can be very advantageous sometimes.
HTH

The problem here doesn't appear to be that you're having problems reading it as fitting it into memory. When you're talking about 200 million anything in memory you're going to have some issues.
Try storing it as a list of strings (i.e. trade memory for CPU), or if you can just don't store it at all.
Another option to try is dumping it into a sqlite database. If you use an in-memory db you might end out with the same issue, but maybe not.
If you go for the string style, do something like this:
def get_age(person):
return int(person[4:])
people = file.readlines() # Wait a while....
for person in people:
print(get_age(person)*2) # Or something else
Here's an example of getting mean income for a particular age in a particular year:
def get_mean_income_by_age_and_year(people, target_age, target_year):
count = 0
total = 0.0
for person in people:
income, age, year = get_income(person), get_age(person), get_year(person)
if age == target_age and year == target_year:
total += income
count += 1
if count:
return total/count
else:
return 0.0
Really, though, this basically does what storing it in a sqlite database would do for you. If there are only a couple of very specific things you want to do, then going this way is probably reasonable. But it sounds like there are probably several things you want to be doing with this info - if so a sqlite database is probably what you want.

A more efficient data structure for lots of uniform numeric data is the array. Depending on how much memory you have, using an array may work.
import array
year = array.array('i') # int
age = array.array('i') # int
income = array.array('f') # float
with open('data.txt', 'r') as f:
for line in f:
year.append(int(line[0:4]))
age.append(int(line[4:6]))
income.append(float(line[6:12]))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.