read data from a huge CSV file efficiently - python

I was trying to process my huge CSV file (more than 20G), but the process was killed when reading the whole CSV file into memory. To avoid this issue, I am trying to read the second column line by line.
For example, the 2nd column contains data like
xxx, computer is good
xxx, build algorithm
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for line in infile:
wordcount.update(line.split())
My code is working for the whole columns, how to only read the second column without using CSV reader?

As far as I know, calling csv.reader(infile) opens and reads the whole file...which is where your problem lies.
You can just read line-by-line and parse manually:
X=[]
with open('desc.csv', 'r') as infile:
for line in infile:
# Split on comma first
cols = [x.strip() for x in line.split(',')]
# Grab 2nd "column"
col2 = cols[1]
# Split on spaces
words = [x.strip() for x in col2.split(' ')]
for word in words:
if word not in X:
X.append(word)
for w in X:
print w
That will keep a smaller chunk of the file in memory at a given time (one line). However, you may still potentially have problems with variable X increasing to quite a large size, such that the program will error out due to memory limits. Depends how many unique words are in your "vocabulary" list

It looks like the code in your question is reading the 20G file and splitting each line into space separated tokens then creating a counter that keeps a count of every unique token. I'd say that is where your memory is going.
From the manual csv.reader is an iterator
a reader object which will iterate over lines in the given csvfile.
csvfile can be any object which supports the iterator protocol and
returns a string each time its next() method is called
so it is fine to iterate through a huge file using csv.reader.
import collections
wordcount = collections.Counter()
with open('desc.csv', 'rb') as infile:
for row in csv.reader(infile):
# count words in strings from second column
wordcount.update(row[1].split())

Related

Running out of Memory when trying to find and replace in a CSV

When trying to Find and Replace on a 12MB CSV, I am running out of memory.
This code checks against a list of 5000 names for names in a CSV file and replaces them with the word 'REDACTED'
I've tried putting this onto an AWS XL instance and still ran out of memory.
import csv
input_file = csv.DictReader(open("names.csv"))
newword = 'REDACTED'
with open('new.txt', 'w') as outfile, open('test.txt') as infile:
for line in infile:
for oldword, newword in input_file:
line = line.replace(oldword, newword)
print('Replaced')
outfile.write(line)
I expect it to output the new.txt with the replacements intact. I currently getting MemoryError.
There are multiple problems with your code before we can even check whats causing the MemoryError problem.
for oldword, newword in input_file: overrides newword = 'REDACTED'
Then, as far as i know, you cannot iterate over DictReader multiple times
input_file = csv.DictReader(open("names.csv"))
for line in infile:
for oldword, newword in input_file:
And at last, i assume "names.csv" contains all possible names, why read it with a DictReader. What is the structure of the names file, and if it is a csv-file, shouldn't you only take the values of one column and not the whole line.

Handle huge bz2-file

I should work with a huge bz2-file (5+ GB) using python. With my actual code, I always get a memory error. Somewhere, I read that I could use sqlite3 to handle the problem. Is this right? If yes, how should I adapt my code?
(I'm not very experienced using sqlite3...)
Here is my actual beginning of the code:
import csv, bz2
names = ('ID', 'FORM')
filename = "huge-file.bz2"
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
tokens = [sentence for sentence in reader]
After this, I need to go through the 'tokens'. It would be great if I could handle this huge bz2-file - so, any help is very very welcome! Thank you very much for any advide!
The file is huge, and reading all the file won't work because your process will run out of memory.
The solution is to read the file in chunks/lines, and process them before reading the next chunk.
The list comprehension line
tokens = [sentence for sentence in reader]
is reading the whole file to tokens and it may cause the process to run out of memory.
The csv.DictReader can read the CSV records line by line, meaning on each iteration, 1 line of data will be loaded to memory.
Like this:
with open(filename) as f:
f = bz2.BZ2File(f, 'rb')
reader = csv.DictReader(f, fieldnames=names, delimiter='\t')
for sentence in reader:
# do something with sentence (process/aggregate/store/etc.)
pass
Please note that if on the added loop, agian the data from the sentence is being stored in another variable (like tokens) still lots of memory may be consumed depending on how big is the data. So it's better to aggregate them, or use other type of storage for that data.
Update
About having some of the previous lines available in your process (as discussed in the comments), you can do something like this:
Then you can store the previous line in another variable, which gets replaced on each iteration.
Or if you needed multiple lines (back), then you can have a list of the last n lines.
How
Use a collections.deque with a maxlen to keep track of last n lines. Import deque from collections standard module at the top of your file.
from collections import deque
# rest of the code ...
last_sentences = deque(maxlen=5) # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
last_sentences.append(sentence)
I suggest the above solution, but you can also implement it yourself using a list, and manually keep track of its size.
define an empty list before the loop, at the end of the loop check if the length of the list is larger than what you need, remove older items from the list, then append the current line.
last_sentences = [] # keep the previous lines as we need for processing new lines
for sentence in reader:
# process the sentence
if len(last_sentences) > 5: # make sure we won't keep all the previous sentences
last_sentences = last_sentences[-5:]
last_sentences.append(sentence)

CSV.Reader importing a list of lists

I am running the following on a csv of UIDs:
with open('C:/uid_sample.csv',newline='') as f:
reader = csv.reader(f,delimiter=' ')
uidlist = list(reader)
but the list returned is actually a list of lists:
[['27465307'], ['27459855'], ['27451353']...]
I'm using this workaround to get individual strings within one list:
for r in reader:
print(' '.join(r))
i.e.
['27465307','27459855','27451353',...]
Am I missing something where I can't do this automatically with the csv.reader or is there an issue with the formatting of my csv perhaps?
A CSV file is a file where each line, or row, contains columns that are usually delimited by commas. In your case, you told csv.reader() that your columns are delimited by a space. Since there aren't any spaces in any of the lines, each row of the csv.reader object has only one item. The problem here is that you aren't looking for a row with a single column; you are looking for a single item.
Really, you just want a list of the lines in the file. You could use f.readlines(), but that would include the newline character in each line. That actually isn't a problem if all you need to do with each line is convert it to an integer, but you might want to remove those characters. That can be done quite easily with a list comprehension:
newlist = [line.strip() for line in f]
If you are merely iterating through the lines (with afor loop, for example), you probably don't need a list. If you don't mind the newline characters, you can iterate through the file object directly:
for line in f:
uid = int(line)
print(uid)
If the newline characters need to go, you could either take them out per line:
for line in f:
line = line.strip()
...
or create a generator object:
uids = (line.strip() for line in f)
Note that reading a file is like reading a book: you can't read it again until you turn back to the first page, so remember to use f.seek(0) if you want to read the file more than once.

Compare CSV Values Against Another CSV And Output Results

I have a csv containing various columns (full_log.csv). One of the columns is labeled "HASH" and contains the hash value of the file shown in that row. For Example, my columns would have the following headers:
Filename - Hash - Hostname - Date
I need my python script to take another CSV (hashes.csv) containing only 1 column of multiple hash values, and compare the hash values against my the HASH column in my full_log.csv.
Anytime it finds a match I want it to output the entire row that contains the hash to an additional CSV (output.csv). So my output.csv will contain only the rows of full_log.csv that contain any of the hash values found in hashes.csv, if that makes sense.
So far I have the following. It works for the hash value that I manually enter in the script, but now I need it to look at hashes.csv to compare instead of manually putting the hash in the script, and instead of printing the results I need to export them to output.csv.
import csv
with open('full_log.csv', 'rb') as input_file1:
reader = csv.DictReader(input_file1)
rows = [row for row in reader if row ['HASH'] == 'FB7D9605D1A38E38AA4C14C6F3622E5C3C832683']
for row in rows:
print row
I would generate a set from the hashes.csv file. Using membership in that set as a filter, I would iterate over the full_log.csv file, outputting only those lines that match.
import csv
with open('hashes.csv') as hashes:
hashes = csv.reader(hashes)
hashes = set(row[0] for row in hashes)
with open('full_log.csv') as input_file:
reader = csv.DictReader(input_file)
with open('output.csv', 'w') as output_file:
writer = csv.DictWriter(output_file, reader.fieldnames)
writer.writeheader()
writer.writerows(row for row in reader if row['Hash'] in hashes)
look at pandas lib for python:
http://pandas.pydata.org/pandas-docs/stable/
it has various helpful function for your question, easily read, transform and write to csv file
Iterating through the rows of files and hashes and using a filter with any to return matches in the collection of hashes:
matching_rows = []
with open('full_log.csv', 'rb') as file1, open('hashes.csv', 'rb') as file2:
reader = csv.DictReader(file1)
hash_reader = csv.DictReader(file2)
matching_rows = [row for row in reader if any(row['Hash'] == r['Hash'] for r in hash_reader)]
with open('output.csv', 'wb') as f:
writer = csv.DictWriter(f)
writer.writerows(matching_rows)
I am a bit unclear as to exactly how much help that you require in solving this. I will assume that you do not need a full solution, but rather, simply tips on how to craft your solution.
First question, which file is larger? If you know that hashes.csv is not too large, meaning it will fit in memory with no problem, then I would simply suck that file in one line at a time and store each hash entry in a Set variable. I won't provide full code, but the general structure is as follows:
hashes = Set()
for each line in the hashes.csv file
hashes.add(hash from the line)
Now, I believe you to already know how to read a CSV file, since you have an example above, but, what you want to do is to now iterate through each row in the full log CSV file. For each of those lines, do not check to see if the hash is a specific value, instead, check to see if that value is contained in the hashes variable. if it is, then use the CSV writer to write the single line to a file.
The biggest gotcha, I think, is knowing if the hashes will always be in a particular case so that you can perform the compare. For example, if one file uses uppercase for the HASH and the other uses lowercase, then you need to be sure to convert to use the same case.

Memory issues with splitting lines in huge files in Python

I'm trying to read from disk a huge file (~2GB) and split each line into multiple strings:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip().split() for line in f]
return split_lines
Problem is, it tries to allocate tens and tens of GB in memory. I found out that it doesn't happen if I change my code in the following way:
def get_split_lines(file_path):
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f] # no splitting
return split_lines
I.e., if I do not split the lines, memory usage drastically goes down.
Is there any way to handle this problem, maybe some smart way to store split lines without filling up the main memory?
Thank you for your time.
After the split, you have multiple objects: a tuple plus some number of string objects. Each object has its own overhead in addition to the actual set of characters that make up the original string.
Rather than reading the entire file into memory, use a generator.
def get_split_lines(file_path):
with open(file_path, 'r') as f:
for line in f:
yield line.rstrip.split()
for t in get_split_lines(file_path):
# Do something with the tuple t
This does not preclude you from writing something like
lines = list(get_split_lines(file_path))
if you really need to read the entire file into memory.
In the end, I ended up storing a list of stripped lines:
with open(file_path, 'r') as f:
split_lines = [line.rstrip() for line in f]
And, in each iteration of my algorithm, I simply recomputed on-the-fly the split line:
for line in split_lines:
split_line = line.split()
#do something with the split line
If you can afford to keep all the lines in memory like I did, and you have to go through all the file more than once, this approach is faster than the one proposed by #chepner as you read the file lines just once.

Categories

Resources