Processing large Pandas Dataframes (fuzzy matching)

Processing large Pandas Dataframes (fuzzy matching) - python

I would like to do fuzzy matching where I match strings from a column of a large dataframe (130.000 rows) to a list (400 rows).
The code I wrote was tested on a small sample (matching 3000 rows to 400 rows) and works fine. It is too large to copy here but it roughly works like this:
1) data normalization of columns
2) create Cartesian product of columns and calculate Levensthein distance
3) select highest scoring matches and store 'large_csv_name' in seperate list.
4) compare list of 'large_csv_names' to 'large_csv', pull out all the intersecting data and write to a csv.
Because the Cartesian product contains over 50 million records I quickly run into memory errors.
That's why I would like to know how to divide the large dataset up in chunks on which I can then run my script.
So far I have tried:
df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
(step 1/4 as above)
As well as:
for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
(step 1/4 as above)
None of these methods seem to work. I would like to know how to run the fuzzy matching in chunks, that is cut the large csv up in pieces take a piece, run the code, take a piece, run the code etc.

In the meanwhile I wrote a script that slices a dataframe in chunks, each of which is then ready to be processed further. Since I'm new to python the code is probably a bit messy but I still wanted to share it with those who might be stuck with the same problem as I was.
import pandas as pd
import math
partitions = 3 #number of ways to split df
length = len(df)
list_index = list(df.index.values)
counter = 0 #var that will be used to stop slicing when df ends
block_counter0 = 0 #var which will indicate the begin index of slice
block_counter1 = block_counter0 + math.ceil(length/partitions) #likewise
while counter < int(len(list_index)): #stop slicing when df ends
df1 = df.iloc[block_counter0:block_counter1] #temp df that forms chunk
for i in range(block_counter0, block_counter1 ):
#insert operations on row of df1 here
counter += 1 #increase counter by 1 to stop slicing in time
block_counter0 = block_counter1 #when for loop ends indices areupdated
if block_counter0 + math.ceil(length / partitions) >
int(len(list_index)):
block_counter1 = len(list_index)
counter +=1
else:
block_counter1 = block_counter0 + math.ceil(length / partitions)

Related

'For' loop which relies heavily on subsetting is too slow. Alternatives to optimize?

I'm switching from R to Python. Unfortunately I'm stumbling upon a variety of loops which happen to run fast in my R scripts and too slow in Python (at least in my literal translations of such scripts). This code sample is one of them.
I'm slowly getting used to the idea that, when it comes to pandas, it's advisable to drop for loops and use instead while, vectorizing funcions, and apply.
I need a few examples on how to exactly do this, since unfortunately my loops rely too much on classic subsetting, matching and appending, operations that are too slow in its raw form.
# Create two empty lists to append results during loop
values = []
occurrences = []
#Create sample dataset, and sample series. It's just a sorted column (time series) and a column of random values:
time = np.arange(0,5000000,1)
variable = np.random.uniform(1,1000,5000000).round()
data = pd.DataFrame({'time' : time, 'variable':variable })
#Time datapoints to match
time_datapoints_to_match = np.random.uniform(0,5000000,200).round()
for i in time_datapoints_to_match:
time_window = data[(data['time'] > i) & (data['time'] <= i+1000 )] #Subset a time window
first_value_1pct = time_window['variable'].iloc[0] * 0.01 #extract 1/100 of the first value in time window
try: #Check if we have a value which is lower than this 1/100 value within the time window
first_occurence = time_window.loc[time_window['variable'] < first_value_1pct , 'time' ].iloc[0]
except IndexError: #In case there are no matches, let's return NaN
first_occurence = float('nan')
values.append(first_value_1pct)
occurrences.append(first_occurence)
#Create DataFrame out of the two output lists
final_report = pd.DataFrame({'values': values, 'first_occurence': occurrences})

What is a more efficient way to load 1 column with 1 000 000+ rows than pandas read_csv()?

I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is to make a sum of each column. However, the files are too large to import with pandas.read_csv() as it consumes too much RAM.
sample data:
Therefore I wrote following code to import 1 column, perform the sum of that column, store the result in a dataframe (= summed_cols), delete the column, and go on with the next column of the file:
x=10 ###columns I'm interested in start at col 11
#empty dataframe to fill
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
Each column represents a sample and the ''read sum'' is the sum of that column. So the output of this code is a dataframe with 2 columns with in the first column one sample per row, and in the second column the corresponding read sum.
This code does exactly what I want to do, however, it is not efficient. For this large file it takes about 1-2 hours to complete the calculations. Especially the loading of just 1 columns takes quiet a long time.
My question: Is there a faster way to import just one column of this large tab file and perform the same calculations as I'm doing with the code above?

You can try something like this:
samples = []
sums = []
with open('file.txt','r') as f:
for i,line in enumerate(f):
columns = line.strip().split('\t')[10:] #from column 10 onward
if i == 0: #supposing the sample_name is the first row of each column
samples = columns #save sample names
sums = [0 for s in samples] #init the sums to 0
else:
for n,v in enumerate(columns):
sums[n] += float(v)
result = dict(zip(samples,sums)) #{sample_name:sum, ...}
I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need.
Mind that this code does not deal with missing values.
The else block can be improved using numpy:
import numpy as np
...
else:
sums = np.add(sums, map(float,columns))

Python fast DataFrame concatenation

I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random
def normalize(data, expectation):
"""Normalize data by duplicating existing rows"""
counts = data[expectation].value_counts()
max_count = int(counts.max())
for tag, group in data.groupby(expectation, sort=False):
array = pandas.DataFrame(columns=data.columns.values)
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
i = max_count % counts[tag]
if i > 0:
array = pandas.concat([array, group.ix[random.sample(group.index, i)]])
data = pandas.concat([data, array])
return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?

There are a couple of things that stand out.
To begin with, the loop
i = 0
while i < (max_count // int(counts[tag])):
array = pandas.concat([array, group])
i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you're doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I'd just append things into a list, and just concat everything at the end:
stuff = []
for tag, group in data.groupby(expectation, sort=False):
# Call stuff.append for any DataFrame you were going to concat.
pandas.concat(stuff)

Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.
As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.
One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.
In anycase, here is my code:
import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
if i is None:
break
if index%1000 == 0:
sys.stderr.write('Processed %s...\n' % index)
df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))
all_indices = set([])
for j in df:
all_indices |= set(j.index.tolist())
print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
subset = [df.pop() for i in xrange(10) if df]
all_indices = set([])
for j in subset:
all_indices |= set(j.index.tolist())
df2 = pd.DataFrame(index=all_indices)
df2.sort_index(inplace=True, axis=0)
df2.sort_index(inplace=True, axis=1)
del all_indices
ind=0
while subset:
t2 = time.time()
ind+=1
arr = subset.pop()
df2[arr.name] = arr
print ind,time.time()-t,time.time()-t2
df2.reindex(main_df.index)
t2 = time.time()
for i in df2.columns:
main_df[i] = df2[i]
if first:
main_df = main_df.to_sparse()
first = False
print 'join time', time.time()-t,time.time()-t2
print len(df), 'entries remain'
Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.
Some additional info:
1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.
2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.
3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.
Update:
I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

Split pandas dataframe in two if it has more than 10 rows

I have a huge CSV with many tables with many rows. I would like to simply split each dataframe into 2 if it contains more than 10 rows.
If true, I would like the first dataframe to contain the first 10 and the rest in the second dataframe.
Is there a convenient function for this? I've looked around but found nothing useful...
i.e. split_dataframe(df, 2(if > 10))?

I used a List Comprehension to cut a huge DataFrame into blocks of 100'000:
size = 100000
list_of_dfs = [df.loc[i:i+size-1,:] for i in range(0, len(df),size)]
or as generator:
list_of_dfs = (df.loc[i:i+size-1,:] for i in range(0, len(df),size))

This will return the split DataFrames if the condition is met, otherwise return the original and None (which you would then need to handle separately). Note that this assumes the splitting only has to happen one time per df and that the second part of the split (if it is longer than 10 rows (meaning that the original was longer than 20 rows)) is OK.
df_new1, df_new2 = df[:10, :], df[10:, :] if len(df) > 10 else df, None
Note you can also use df.head(10) and df.tail(len(df) - 10) to get the front and back according to your needs. You can also use various indexing approaches: you can just provide the first dimensions index if you want, such as df[:10] instead of df[:10, :] (though I like to code explicitly about the dimensions you are taking). You can can also use df.iloc and df.ix to index in similar ways.
Be careful about using df.loc however, since it is label-based and the input will never be interpreted as an integer position. .loc would only work "accidentally" in the case when you happen to have index labels that are integers starting at 0 with no gaps.
But you should also consider the various options that pandas provides for dumping the contents of the DataFrame into HTML and possibly also LaTeX to make better designed tables for the presentation (instead of just copying and pasting). Simply Googling how to convert the DataFrame to these formats turns up lots of tutorials and advice for exactly this application.

There is no specific convenience function.
You'd have to do something like:
first_ten = pd.DataFrame()
rest = pd.DataFrame()
if df.shape[0] > 10: # len(df) > 10 would also work
first_ten = df[:10]
rest = df[10:]

A method based on np.split:
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
listOfDfs = [df.loc[idx] for idx in np.split(df.index,5)]
A small function that uses a modulo could take care of cases where the split is not even (e.g. np.split(df.index,4) will throw an error).
(Yes, I am aware that the original question was somewhat more specific than this. However, this is supposed to answer the question in the title.)

Below is a simple function implementation which splits a DataFrame to chunks and a few code examples:
import pandas as pd
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
#print("%s : %s" % (start, count))
dfs.append(df.iloc[start : count])
return dfs
# Create a DataFrame with 10 rows
df = pd.DataFrame([i for i in range(10)])
# Split the DataFrame to chunks of maximum size 2
split_df_to_chunks_of_2 = split_dataframe_to_chunks(df, 2)
print([len(i) for i in split_df_to_chunks_of_2])
# prints: [2, 2, 2, 2, 2]
# Split the DataFrame to chunks of maximum size 3
split_df_to_chunks_of_3 = split_dataframe_to_chunks(df, 3)
print([len(i) for i in split_df_to_chunks_of_3])
# prints [3, 3, 3, 1]

If you have a large data frame and need to divide into a variable number of sub data frames rows, like for example each sub dataframe has a max of 4500 rows, this script could help:
max_rows = 4500
dataframes = []
while len(df) > max_rows:
top = df[:max_rows]
dataframes.append(top)
df = df[max_rows:]
else:
dataframes.append(df)
You could then save out these data frames:
for _, frame in enumerate(dataframes):
frame.to_csv(str(_)+'.csv', index=False)
Hope this helps someone!

def split_and_save_df(df, name, size, output_dir):
"""
Split a df and save each chunk in a different csv file.
Parameters:
df : pandas df to be splitted
name : name to give to the output file
size : chunk size
output_dir : directory where to write the divided df
"""
import os
for i in range(0, df.shape[0],size):
start = i
end = min(i+size-1, df.shape[0])
subset = df.loc[start:end]
output_path = os.path.join(output_dir,f"{name}_{start}_{end}.csv")
print(f"Going to write into {output_path}")
subset.to_csv(output_path)
output_size = os.stat(output_path).st_size
print(f"Wrote {output_size} bytes")

You can use the DataFrame head and tail methods as syntactic sugar instead of slicing/loc here. I use a split size of 3; for your example use headSize=10
def split(df, headSize) :
hd = df.head(headSize)
tl = df.tail(len(df)-headSize)
return hd, tl
df = pd.DataFrame({ 'A':[2,4,6,8,10,2,4,6,8,10],
'B':[10,-10,0,20,-10,10,-10,0,20,-10],
'C':[4,12,8,0,0,4,12,8,0,0],
'D':[9,10,0,1,3,np.nan,np.nan,np.nan,np.nan,np.nan]})
# Split dataframe into top 3 rows (first) and the rest (second)
first, second = split(df, 3)

The method based on list comprehension and groupby, which stores all the split dataframes in a list variable and can be accessed using the index.
Example:
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]***
ans[0]
ans[0].column_name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing large Pandas Dataframes (fuzzy matching) - python

Related

'For' loop which relies heavily on subsetting is too slow. Alternatives to optimize?

What is a more efficient way to load 1 column with 1 000 000+ rows than pandas read_csv()?

Python fast DataFrame concatenation

Creation of large pandas DataFrames from Series

Split pandas dataframe in two if it has more than 10 rows

Categories

Resources