Q: PySpark, how to iterate over rows in a large datafram? - python

I have a large data frame containing at least half a million records, I need to convert each string to a custom object(should end up with a list of objects)
collect () is not acceptable because the data frame is too large.
There was a problem with iterating over all the rows in the data frame, at first I tried to do it like this (led to a map, and a map to a list of objects):
result_list = map(Lambda row: row, test_df.collect())
But collect () turned out to be unacceptable because the dataframe is too big.
Could you please tell me what other options are there to iterate over each row in the dataframe?

If collect() for your DataFrame doesn't fit into memory, it's unlikely your transformed DataFrame would fit either. However, if you just need to stream over your DataFrame and process each row one at a time, you could use test_df.rdd.toLocalIterator() to iterate over it. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.toLocalIterator.html

Related

Can't understand a code snippet using Pandas library in Python 3 involving slicing tabular data in chunks

I'm reading a guide where we slice the tabular data into chunksizes of 15 and then process it. But i'm having trouble understanding the code. I don't have access to a teacher hence I have asked here additionally the internet doesn't have a specific solution for my problem.
Thank You
This is code could someone please explain how exactly it's able to take each value in a loop and calculate the mean?
data_chunks = pd.read_csv("../data/microbiome.csv", chunksize=14)
mean_tissue = pd.Series({chunk.iloc[0].Taxon:chunk.Tissue.mean() for chunk in data_chunks})
I assume the different types of "Taxon" are 14 records long, so that is why they iterate through the DataFrame in 14 long chunks. (See first line! chunksize=14) Chunk is a part of the DataFrame, similar to slicing and indexing: here a chunk will have 14 rows and as many columns as the original DataFrame. The first chunk will contain the first 14 rows, the second will contain the second 14 rows and so on.
Next it creates a pandas Series, you can think of this like a fancy 1d array, or list, or 1d version of the DataFrame. (See pd.Series(...))
What will be in this Series instance? A dictionary (see the curly brackets!). The keys in this dictionary will be the Taxon type the chunk contains, and the value will be the mean of Tissue values in the current chunk (14 rows):
chunk.iloc[0].Taxon selects the first row of the chunk, then it's value in the column Taxon
chunk.Tissue.mean() will select every Tissue value in the chunk (14 values) then take the average of it
chunk.iloc[0].Taxon : chunk.Tissue.mean() will pair up the aforementioned values in key : value fashion
They do this for every chunk: adding a new key and value pair to the dictionary each time (like list comprehension, only this is dict comprehension). The pandas Series has a constructor that accepts a dictionary.
If you have any remaining questions feel free to ask and I will try to answer it.

How to properly save each large chunk of data as a pandas dataframe and concatenate them with each other

I have a dataframe that has over 400K rows and several hundred columns that I have decided to read in with chunks because it does not fit into Memory and gives me MemoryError.
I have managed to read it in in chunks like this:
x = pd.read_csv('Training.csv', chunksize=10000)
and afterwards I can get each of the chunks by doing this:
a = x.get_chunk()
b = x.get_chunk()
etc etc keep doing this over 40 times which is obviously slow and bad programming practice.
When I try doing the following in an attempt to create a loop that can save each chunk into a dataframe and somehow concatenate them:
for x in pd.read_csv('Training.csv', chunksize=500):
x.get_chunk()
I get:
AttributeError: 'DataFrame' object has no attribute 'get_chunk'
What is the easiest way I can read in my file and concatenate all my chunks during the import?
Also, how do I do further manipulation on my dataset to avoid memory error issues (particularly, imputing null values, standardizing/normalizing the dataframe, and then running machine learning models on it using scikit learn?
When you specify chunksize in a call to pandas.read_csv you get back a pandas.io.parsers.TextFileReader object rather than a DataFrame. Try this to go through the chunks:
reader = pd.read_csv('Training.csv',chunksize=500)
for chunk in reader:
print(type(chunk)) # chunk is a dataframe
Or grab all the chunks (which probably won't solve your problem!):
reader = pd.read_csv('Training.csv',chunksize=500)
chunks = [chunk for chunk in reader] # list of DataFrames
Depending on what is in your dataset a great way of reducing memory use is to identify columns that can be converted to categorical data. Any column where the number of distinct values is much lower than the number of rows is a candidate for this. Suppose a column contains some sort of status with limited values (e.g. 'Open','Closed','On hold') do this:
chunk['Status'] = chunk.assign(Status=lambda x: pd.Categorical(x['Status']))
This will now store just an integer for each row and the DataFrame will hold a mapping (e.g 0 = 'Open', 1 = 'Closed etc. etc.)
You should also look at whether or not any of your data columns are redundant (they effectively contain the same information) - if any are then delete them. I've seen spreadsheets containing dates where people have generated columns for year, week, day as they find it easier to work with. Get rid of them!

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Will numpy read the whole data every time it iterates?

I have a very large data file consists of N*100 real numbers, where N is very large. I want to read the data by columns. I can read it as whole then manipulate it column by column:
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
Or I can read it column by column and expecting this will save memory and be fast:
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col)
However, the second approach seems not to be faster. Is it because every time the code read the whole data and extract the desired the column? So it is 100 times slower than the first one. Is there any method to read one column after another but much faster?
If I just want to get the 100 number at last row from the file, reading the whole col and get the last elements is not wise choice, how to achieve this?
If I understand your question right, you want only the last row. This would read only the last row for N rows:
data = np.loadtxt(fname='data.txt', skiprows=N-1)
You are asking two things: how to sum across all rows, and how to read the last row.
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
data is a (N,100) 2d array. You don't need to iterate to sum along each column
np.sum(data, axis=0)
gives you a (100,) array, one sum per column.
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col) # just throwing this away??
With this you read the file 100 times. In each loadtxt call it has to read each line, select the ith string, interpret it, and collect it in col. It might be faster, IF data was so large that the machine bogged down with memory swapping. Other wise, array operations on data will be lot faster than file reads.
As the other answer shows loadtxt lets you specify a skiprows parameter. It will still read all the lines (i.e. f.readline() calls), but it just doesn't process them and collect the values in a list.
Do some of your own time tests.

Python Pandas: .apply taking forever?

I have a DataFrame 'clicks' created by parsing CSV of size 1.4G. I'm trying to create a new column 'bought' using apply function.
clicks['bought'] = clicks['session'].apply(getBoughtItemIDs)
In getBoughtItemIDs, I'm checking if 'buys' dataframe has values I want, and if so, return a string concatenating them. The first line in getBoughtItemIDs is taking forever. What are the ways to make it faster?
def getBoughtItemIDs(val):
boughtSessions = buys[buys['session'] == val].values
output = ''
for row in boughtSessions:
output += str(row[1]) + ","
return output
There are a couple of things that make this code run slowly.
apply is essentially just syntactic sugar for a for loop over the rows of a column. There's also an explicit for loop over a NumPy array in your function (the for row in boughtSessions part). Looping in this (non-vectorised) way is best avoided whenever possible as it impacts performance heavily.
buys[buys['session'] == val].values is looking up val across an entire column for each row of clicks, then returning a sub-DataFrame and then creating a new NumPy array. Repeatedly looking for values in this way is expensive (O(n) complexity each lookup). Creating new arrays is going to be expensive since memory has to be allocated and the data copied across each time.
If I understand what you're trying to do, you could try the following approach to get your new column.
First use groupby to group the rows of buys by the values in 'session'. apply is used to join up the strings for each value:
boughtSessions = buys.groupby('session')[col_to_join].apply(lambda x: ','.join(x))
where col_to_join is the column from buys which contains the values you want to join together into a string.
groupby means that only one pass through the DataFrame is needed and is pretty well-optimised in Pandas. The use of apply to join the strings is unavoidable here, but only one pass through the grouped values is needed.
boughtSessions is now a Series of strings indexed by the unique values in the 'session' column. This is useful because lookups to Pandas indexes are O(1) in complexity.
To match each string in boughtSessions to the approach value in clicks['session'] you can use map. Unlike apply, map is fully vectorised and should be very fast:
clicks['bought'] = clicks['session'].map(boughtSessions)

Categories

Resources