Accessing data in chunks using Python Pandas - python

I have a large text file, separated by semi-column. I am trying to retrieve the values of a column (e.g. the second column) and work on it iteratively using numpy. An example of data contained in a text file is given below:
10862;2;1;1;0;0;0;3571;0;
10868;2;1;1;1;0;0;3571;0;
10875;2;1;1;1;0;0;3571;0;
10883;2;1;1;1;0;0;3571;0;
...
11565;2;1;1;1;0;0;3571;0;
11572;2;1;1;1;0;0;3571;0;
11579;2;1;1;1;0;0;3571;0;
11598;2;1;1;1;0;0;3571;0;
11606;2;1;1;
Please note that the last line may not contain the same number of values as the previous ones.
I am trying to use pandas.read_csv to read this large file by chunks. For the purpose of the example, let's assume that the chunk size is 40.
I have tried so far 2 different approaches:
1)
Set nrows, and iteratively increase the skiprows so as to read the entire file by chunk.
nrows_set = 40
n_it = 0
while(1):
df = pd.read_csv(filename, nrows=nrows_set , sep=';',skiprows = n_it * nrows_set)
vect2 = df[1] # trying to access the values of the second column -- works
n_it = n_it+1
Issue when accessing the end of the file: Pandas generates an error when ones tries to read a number of rows bigger than the number of rows contained in the file.
For example, if the file contains 20 lines, and nrows is set as 40, the file cannot be read. My first approach hence generated a bug when I was trying to read the last 40 lines of my file, when less than 40 lines were remaining.
I do not know how to check for the end of file before trying to read from the file - and I do not want to load the entire file to obtain the total rows number since the file is large. Hence, I tried a second approach.
2)
Set chunksize. This works well, however I have an issue when I then try to acess the data in chunk:
reader = pd.read_csv(filename, chunksize=40, sep=';')
for chunk in reader :
print(chunk) # displays data -- the data looks correct
chunk[1] # trying to access the values of the second column -- generates an error
What is the data type of chunk, and how can I convert it so as this operation works?
Alternatively, how can I retrieve the number of lines contained in a file without loading the entire file in memory (solution 1))?
Thank you for your help!
Gaelle

chunk is a data frame.
so you can access it using indexers (accesors) like .ix / .loc / .iloc / .at / etc.:
chunk.ix[:, 'col_name']
chunk.iloc[:, 1] # second column

Related

Use different filters on different pandas data frames on each loop

I am learning python in order to automate some tasks at my work. What I need to do is to read many big TSV files and use filter each one of them on a list of keys that may be different from a TSV to another and then return the result in an Excel report using certain template. In order to do that I load at once all the common data to all the files: the list of TSV files to process, some QC values, the filter to be used for each TSV then in a loop I read all the files that I need to process and load one TSV on each loop round in pandas data frame then filter and fill the Excel report. The problem is that the filter does not seem to work and I am getting Excel files of the same length each time (number of rows)!!. I suspect that the data frame does not get reinitialized on each round but trying to del the df and releasing the memory do not work as well. How can I solve the issue? Is there a better data structure to handle my problem. Below a part of the code.
Thanks
# This how I load the filter for each file in each round of the the loop (as an array)
target = load_workbook(target_input)
target_sheet = target.active
targetProbes = []
for k in range (2, target_sheet.max_row+1):
targetProbes.append(target_sheet.cell(k, 1).value)
# This is how I load each TSV file in each round of the loop (an df)
tsv_reader = pd.read_csv(calls, delimiter='\t', comment = '#', low_memory=False)
# This is how I filter the df based on the filter
tsv_reader = tsv_reader[tsv_reader['probeset_id'].isin(targetProbes)]
tsv_reader.reset_index(drop = True, inplace = True)
# This how I try to del the df and clear the current filter so I can re-use them for another TSV file with different values in the next round of the loop
del tsv_reader
targetProbes.clear()

How can I periodically skip rows reading txt with pandas?

I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row

How to properly save each large chunk of data as a pandas dataframe and concatenate them with each other

I have a dataframe that has over 400K rows and several hundred columns that I have decided to read in with chunks because it does not fit into Memory and gives me MemoryError.
I have managed to read it in in chunks like this:
x = pd.read_csv('Training.csv', chunksize=10000)
and afterwards I can get each of the chunks by doing this:
a = x.get_chunk()
b = x.get_chunk()
etc etc keep doing this over 40 times which is obviously slow and bad programming practice.
When I try doing the following in an attempt to create a loop that can save each chunk into a dataframe and somehow concatenate them:
for x in pd.read_csv('Training.csv', chunksize=500):
x.get_chunk()
I get:
AttributeError: 'DataFrame' object has no attribute 'get_chunk'
What is the easiest way I can read in my file and concatenate all my chunks during the import?
Also, how do I do further manipulation on my dataset to avoid memory error issues (particularly, imputing null values, standardizing/normalizing the dataframe, and then running machine learning models on it using scikit learn?
When you specify chunksize in a call to pandas.read_csv you get back a pandas.io.parsers.TextFileReader object rather than a DataFrame. Try this to go through the chunks:
reader = pd.read_csv('Training.csv',chunksize=500)
for chunk in reader:
print(type(chunk)) # chunk is a dataframe
Or grab all the chunks (which probably won't solve your problem!):
reader = pd.read_csv('Training.csv',chunksize=500)
chunks = [chunk for chunk in reader] # list of DataFrames
Depending on what is in your dataset a great way of reducing memory use is to identify columns that can be converted to categorical data. Any column where the number of distinct values is much lower than the number of rows is a candidate for this. Suppose a column contains some sort of status with limited values (e.g. 'Open','Closed','On hold') do this:
chunk['Status'] = chunk.assign(Status=lambda x: pd.Categorical(x['Status']))
This will now store just an integer for each row and the DataFrame will hold a mapping (e.g 0 = 'Open', 1 = 'Closed etc. etc.)
You should also look at whether or not any of your data columns are redundant (they effectively contain the same information) - if any are then delete them. I've seen spreadsheets containing dates where people have generated columns for year, week, day as they find it easier to work with. Get rid of them!

How to add rows to pandas dataframe with reasonable performance

I have an empty data frame with about 120 columns, I want to fill it using data I have in a file.
I'm iterating over a file that has about 1.8 million lines.
(The lines are unstructured, I can't load them to a dataframe directly)
For each line in the file I do the following:
Extract the data I need from the current line
Copy the last row in the data frame and append it to the end df = df.append(df.iloc[-1]). The copy is critical, most of the data in the previous row won't be changed.
Change several values in the last row according to the data I've extracted df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
This is very slow, I assume the fault is in the append.
What is the correct approach to speed things up ? preallocate the dataframe ?
EDIT:
After reading the answers I did the following:
I preallocated the dataframe (saved like 10% of the time)
I replaced this : df = df.append(df.iloc[-1]) with this : df.iloc[i] = df.iloc[i-1] (i is the current iteration in the loop).(save like 10% of the time).
Did profiling, even though I removed the append the main issue is copying the previous line, meaning : df.iloc[i] = df.iloc[i-1] takes about 95% of the time.
You may need plenty of memory, whichever option you choose.
However, what you should certainly avoid is using pd.DataFrame.append within a loop. This is expensive versus list.append.
Instead, aggregate to a list of lists, then feed into a dataframe. Since you haven't provided an example, here's some pseudo-code:
# initialize empty list
L = []
for line in my_binary_file:
# extract components required from each line to a list of Python types
line_vars = [line['var1'], line['var2'], line['var3']]
# append to list of results
L.append(line_vars)
# create dataframe from list of lists
df = pd.DataFrame(L, columns=['var1', 'var2', 'var3'])
The Fastest way would be load to dataframe directly via pd.read_csv()
Try separating the logic to clean out unstructured to structured data and then use pd.read_csv to load the dataframe.
You can share the sample unstructured line and logic to take out the structured data, So that might share some insights on the same.
Where you use append you end up copying the dataframe which is inefficient. Try this whole thing again but avoiding this line:
df = df.append(df.iloc[-1])
You could do something like this to copy the last row to a new row (only do this if the last row contains information that you want in the new row):
df.iloc[...calculate the next available index...] = df.iloc[-1]
Then edit the last row accordingly as you have done
df.iloc[-1, df.columns.get_loc('column_name')] = some_extracted_value
You could try some multiprocessing to speed things up
from multiprocessing.dummy import Pool as ThreadPool
def YourCleaningFunction(line):
for each line do the following
blablabla
return(your formated lines with ,) # or use the kind of function jpp just provided
pool = ThreadPool(8) # your number of cores
lines = open('your_big_csv.csv').read().split('\n') # your csv as a list of lines
df = pool.map(YourCleaningFunction, lines)
df = pandas.DataFrame(df)
pool.close()
pool.join()

Will numpy read the whole data every time it iterates?

I have a very large data file consists of N*100 real numbers, where N is very large. I want to read the data by columns. I can read it as whole then manipulate it column by column:
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
Or I can read it column by column and expecting this will save memory and be fast:
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col)
However, the second approach seems not to be faster. Is it because every time the code read the whole data and extract the desired the column? So it is 100 times slower than the first one. Is there any method to read one column after another but much faster?
If I just want to get the 100 number at last row from the file, reading the whole col and get the last elements is not wise choice, how to achieve this?
If I understand your question right, you want only the last row. This would read only the last row for N rows:
data = np.loadtxt(fname='data.txt', skiprows=N-1)
You are asking two things: how to sum across all rows, and how to read the last row.
data=np.loadtxt(fname='data.txt')
for i in range(100):
np.sum(data[:,i])
data is a (N,100) 2d array. You don't need to iterate to sum along each column
np.sum(data, axis=0)
gives you a (100,) array, one sum per column.
for i in range(100):
col=np.loadtxt(fname='data.txt',usecols=(i,))
np.sum(col) # just throwing this away??
With this you read the file 100 times. In each loadtxt call it has to read each line, select the ith string, interpret it, and collect it in col. It might be faster, IF data was so large that the machine bogged down with memory swapping. Other wise, array operations on data will be lot faster than file reads.
As the other answer shows loadtxt lets you specify a skiprows parameter. It will still read all the lines (i.e. f.readline() calls), but it just doesn't process them and collect the values in a list.
Do some of your own time tests.

Categories

Resources