I've a large CSV files (about a million records). I want to process write each record into a DB.
Since loading the complete file into the RAM makes no sense, hence I need to read the file in chunks (or any other better way).
So, I wrote this code .
import csv
with open ('/home/praful/Desktop/a.csv') as csvfile:
config_file = csv.reader(csvfile, delimiter = ',', quotechar = '|')
print config_file
for row in config_file:
print row
I guess it loads everything into its memory first and then process.
Upon looking at this thread and many others, I didnt see any difference in o/p code and the solution. Kindly advise, is it the only method for efficient processing of csv files
No, the csv module produces an iterator; rows are produced on demand. Unless you keep references to row elsewhere the file will not be loaded into memory in its entirety.
Note that that is exactly what I am saying in the other answer you linked to; the problem there is that the OP was building a list (data) holding all rows after reading instead of processing the rows as they were being read.
Related
I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.
I have a very large dataframe with many million rows and it is normally not feasible to load the entire file into memory to work with. Recently some bad data have gotten in and I need to remove them from the database. So far what I've done is:
file = '/path to database'
rf = pd.read_csv(f'{file}.csv', chunksize = 3000000, index_col=False)
res = pd.concat([chunk[chunk['timestamp'] < 1.6636434764745E+018] for chunk in rf)]
res.to_csv(f'{file}.csv', index=False)
Basically it is opening the database and saving the part i want, overwriting the original file.
However the data has gotten so large that this is failing to fit in memory. Is there a better way to truncate a part of the dataframe based on a simple query?
The truncated part would usually be very small compared to the rest, say 100k rows and always at the end.
I would avoid using pandas in this case and just directly edit the csv file itself. For example:
import csv
with open("test_big.csv", "r") as f_in, open("test_out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
for row in reader:
if int(row[-1]) > 9900: # your condition here
writer.writerow(row)
For context, test_big.csv looks like this
1,2,3,4,5891
1,2,3,4,7286
1,2,3,4,7917
1,2,3,4,937
...
And is 400,000 records long. Execution took 0.2s.
Edit: Ran with 40,000,000 records and took 15s.
I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.
I have an excel file with more than 1 million rows. Now i need to split that for every n rows and save it in a new file. am very new to python. Any help, is much appreciated and needed
As suggested by OhAuth you can save the Excel document to a csv file. That would be a good start to begin the processing of you data.
Processing your data you can use the Python csv library. That would not require any installation since it comes with Python automatically.
If you want something more "powerful" you might want to look into Pandas. However, that requires an installation of the module.
If you do not want to use the csv module of Python nor the pandas module because you do not want to read into the docs, you could also do something like.
f = open("myCSVfile", "r")
for row in f:
singleRow = row.split(",") #replace the "," with the delimiter you chose to seperate your columns
print singleRow
> [value1, value2, value3, ...] #it returns a list and list comprehension is well documented and easy to understand, thus, further processing wont be difficult
However, I strongly recommend looking into the moduls since they handle csv data better, more efficient and on 'the long shot' save you time and trouble.
simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.