I have a csv file which is too large to completely fit into my laptop's memory (about 10GB). Is there a way to truncate the file such that only the first n entries are saved in a new file? I started by trying
df = pandas.read_csv("path/data.csv").as_matrix()
but this doesn´t work since the memory is too small.
Any help will be appreciated!
Leon
Use nrows:
df = pandas.read_csv("path/data.csv", nrows=1000)
The nrows docs say:
Number of rows of file to read. Useful for reading pieces of large files
Related
I am having trouble with reading and writing moderately sized excel files in Pandas. I have 5 files each around 300 MB large. I need to combine these files into one, do some processing and then save it (as excel preferably):
import pandas as pd
f1 = pd.read_excel('File_1.xlsx')
f2 = pd.read_excel('File_2.xlsx')
f3 = pd.read_excel('File_3.xlsx')
f4 = pd.read_excel('File_4.xlsx')
f5 = pd.read_excel('File_5.xlsx')
FULL = pd.concat([f1,f2,f3,f4,f5], axis=0, ignore_index=True, sort=False)
FULL.to_excel('filename.xlsx', index=False)'
But unfortunately read takes way too much time (around 15 minutes or so), and write used up 100% of memory (on my 16 GB ram PC), and was taking so much time that I was forced to interrupt the program.
Is there any way I could accelerate both read/write?
In this post it is defined a nice function append_df_to_excel().
You can use that function to read the files one by one and append their content to the final excel file. This will save you RAM since you are not going to keep all the files in memory at once.
files = ['File_1.xlsx','File_2.xlsx',...]
for file in files:
df = pd.read_excel(file)
append_df_to_excel('filename.xlsx', df)
Depending on your input files, you may need to pass some extra arguments to the function. Check the linked post for extra info.
Note that you could use df.to_csv() with mode='a' to append to a csv file. Most of the time you can swap excel files for csv easily. If this is also your case, I would suggest this method instead of the custom function.
Not ideal (and dependent on use case), but I've always found it much quicker to load up the XLSX (in Excel) and save it as a CSV file, just because I tend to do multiple reads on the data and in the long run the time taken to wait for the XLSX load outweighs the amount of time it takes to convert the file.
I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.
I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1
Let us say i have an excel file with 100k rows. My code is trying to read it row by row, and for each row do computation (including benchmark of how long it takes to perform each row). Then, my code will produce an array of results, with 100k rows. I did my python code but it is not efficient and taking me several days and also the benchmark results getting worse due to high consumption of memory i guess. Please see my attempt and let me know how to improve it.
My code save results=[] and only write it at the end. Also, at the start i store the whole excel file in worksheet.. I think like this will cause memory issue since my excel has very large text in cells (not only numbers).
ExcelFileName = 'Data.xlsx'
workbook = xlrd.open_workbook(ExcelFileName)
worksheet = workbook.sheet_by_name("Sheet1") # We need to read the data
num_rows = worksheet.nrows #Number of Rows
num_cols = worksheet.ncols #Number of Columns
results=[]
for curr_row in range(1,num_rows,1):
row_data = []
for curr_col in range(0, num_cols, 1):
data = worksheet.cell_value(curr_row, curr_col) # Read the data in the current cell
row_data.append(data)
#### do computation here ####
## save results like results+=[]
### save results array in dataframe and then print it to excel
df = pd.DataFrame(results)
writer = pd.ExcelWriter("XX.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name= 'results')
writer.save()
What i would like is to read the first row from excel and store it in memory, do the calculation, get the result and save it into excel,, then go for the second row,,, without keep memory so busy. By doing so, i will not have results array containing 100k rows, since each loop i erase it.
To solve the issue about loading the input file into memory, I would look into using a generator. A Generator works by iterating over any iterable, but only returning the next element instead of the entire iterable. In your case, this would return only the next row from your .xlsx file, instead of keeping the entire file in memory.
However, this will not solve the issue of having a very large "results" array. Unfortunately, updating a .csv or .xlsx file in as you go will take a very long time, significantly longer than updating the object in memory. There is a trade off here, you can either use up lots of memory by updating your "results" array and then writing it all to a file at the end, or you can very slowly update a file in the file system with the results as you go at the cost of much slower execution.
For this kind of operation you are probably better off loading the csv directly into a DataFrame, there are several methods for dealing with large files in pandas that are detailed here, How to read a 6 GB csv file with pandas. Which method you choose will have a lot to do with the type of computation you need to do, since you seem to be processing one row at a time, using chunks will probably be the way to go.
Pandas has a lot of built in optimization for dealing with operations on large sets of data, so the majority of the time you will find increased performance working with data within a DataFrame or Series than you will using pure Python. For the best performance consider vectorizing your function or looping using the apply method, which allows pandas to apply the function to all rows in the most efficient way possible.
Suppose I have a csv file containing 5 rows.
Now I iterate over this file using a chunksize of 2.
data = pd.read_csv(data_name, header=None, iterator=True, chunksize=2)
Suppose I am doing some magic on this data chunk and appending it to another csv file.
processed_data.to_csv(fname, index=None, mode="a")
Problem: The last row is not written.
I do not know how to solve this problem. Can someone help?
I need to use chunks because I don't have enough RAM.
I can not use chunksize=1, because opening/closing a file is too time consuming.
If you are running out of memory I would use blaze for this type of data.
https://blaze.readthedocs.io/en/latest/ooc.html
Then you don't have to mess with the chunksize.