Deleting/rearranging/adding in very large tsv files Python

Deleting/rearranging/adding in very large tsv files Python - python

I have a very large tsv file (1.2GB, 5 columns, 38m lines). I want to delete a column, add a column of ID's (1 to 38m), and rearrange the column order. How can I do this without using a ridiculous amount of memory?
Language of choice is Python, though open to other solutions.

You can read, manipulate, and write one row at a time. Not loading the entire file to memory, this will have a very low memory signature.
import csv
with open(fileinpath, 'rb') as fin, open(fileoutpath, 'wb') as fout:
freader = csv.reader(fin, delimiter = '\t')
fwriter = csv.writer(fout, delimiter = '\t')
idx = 1
for line in freader:
line[4], line[0] = line[0], line[4] #switches position between first and last column
del line[3] #delete fourth column
line.insert(0, idx)
fwriter.writerow(line)
idx += 1
(This is written in python2.7, and deletes the fourth column for an example)
Regarding rearranging the order - I assume it's the order of columns - this could be done in the manipulation part. There's an example of switching the order of the first and last column.

you can use awk to do this, i will not say 1.2GB will take huge amount of memory.
if you want to delete c3
awk -F"\t" 'BEGIN{OFS="\t"}{print $1,$2,$4,$5,NR}' input.txt > output.txt
the raw output is
c1 c2 c4 c5 columnId(1 to 38m)
$1 is coloumn1, $2 is column2, and so on. NR is the number of line.
if you want to rearrange, just change the order of $1,$2,$4,$5 and NR,

The answer depends enormously on how much context is needed need to rewrite the lines and to determine the new ordering.
If it's possible to rewrite the individual lines without regard to context (depends on how the ID number is derived), then you can use the csv module to read the file line-by-line as #Tal Kremerman illustrates, and write it out line-by-line in the same order. If you can determine the correct ordering of the lines at this time, then you can add an extra field indicating the new order they should appear in.
Then you can do a second pass to sort/rearrange the lines into the correct order. There are many recent threads on "how to sort huge files with Python", e.g. How to sort huge files with Python? I think Tal Kremerman is right that the OP only wants to rearrange columns, and not rows

Related

All of my data from columns of one file go into one column in my output file. How to keep it the same?

I'm trying to delete some number of data rows from a file, essentially just because there are too many data points. I can easily print them to IDLE but when I try to write the lines to a file, all of the data from one row goes into one column. I'm definitely a noob but it seems like this should be "trivial"
I've tried it with writerow and writerows, zip(), with and without [], I've changed the delimiter and line terminator.
import csv
filename = "velocity_result.csv"
with open(filename, "r") as source:
for i, line in enumerate(source):
if i % 2 == 0:
with open ("result.csv", "ab") as result:
result_writer = csv.writer(result, quoting=csv.QUOTE_ALL, delimiter=',', lineterminator='\n')
result_writer.writerow([line])
This is what happens:
input = |a|b|c|d| <row
|e|f|g|h|
output = |abcd|
<every other row deleted
(just one column)
My expectaion is
input = |a|b|c|d| <row
|e|f|g|h|
output = |a|b|c|d|
<every other row deleted

Once you've read the line, it becomes a single item as far as Python is concerned. Sure, maybe it is a string which has comma separated values in it, but it is a single item still. So [line] is a list of 1 item, no matter how it is formatted.\
If you want to make sure the line is recognized as a list of separate values, you need to make it such, perhaps with split:
result_writer.writerow(line.split('<input file delimiter here>'))
Now the line becomes a list of 4 items, so it makes sense for csv writer to write them as 4 separated values in the file.

Compare 2 csv files with the same header and output a third csv with some calculations

I want to compare 2 csv files and store the results in a new csv file.
I have 2 csv (old.csv and new.csv) with the same headers.
How can I compare the values of each and do calculations based on those?
with open('new.csv') as new_csv, open('old.csv') as old_csv:
reader_old = csv.DictReader(old_csv)
reader_new = csv.DictReader(new_csv)
for row_o in reader_old:
for row_n in reader_new:
if row_n['Account'] == row_o['Account']:
amt_diff = float(row_n['Number']) - float(row_o['Number'])
print(amt_diff)

Python has a module called csv that will let you do all sorts of reading and writing of csv files, without having to go through the tedious task of manually writing lines to take strings, breaking them up along commas, etc.. For example, you can use csv.DictReader() to read lines into a dictionary where the keys are the same as your column names:
import csv
with open('new.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
ranking = row['Ranking']
percentage = row['Percentage']
print("The percentage in this row is", percentage)
After extracting what you need and doing the calculations, you can use csv.DictWriter to write data to your new, third csv file. A search on the web for python csv module should give you a number of examples.
EDIT: I read your comment and saw your updated code. Let's look at what your nested loop does, as far as I can tell:
Take the first line of the old CSV data
Take the first line of the new CSV data
Compare their values for "Account". If they're the same, then print their difference (which should be zero if the two numbers are the same, right?)
Do the same with line #1 of the old and line #2 of the new.
Do the same with line #1 of the old and line #3 of the new.
Continue until you compare line #1 of the old and the last line of the new.
Repeat all of the above with line #2 of the old and line #1 of the new, then line #2 of the old and line #2 of the new, line #2 of the old and line #3 of the new, etc.
Is that what you want? Or are you just trying to compare them line by line and write the differences?
EDIT #2:
I don't know if this will make a difference, but try this instead:
reader_old = csv.DictReader(open("old.csv"))
reader_new = csv.DictReader(open("new.csv"))
for row_o in reader_old:
for row_n in reader_new:
amt_diff = float(row_n['Number']) - float(row_o['Number'])
print(amt_diff)
If you want to write this to a new file instead of just printing the results, see csv.DictWriter().

Python getting exact cell from csv file

import csv
filename = str(input("Give the file name: "))
file = open(filename, "r")
with file as f:
size = sum(1 for _ in f)
print("File", filename, "has been read, and it has", size, "lines.", size - 1, "rows has been analyzed.")
I pretty much type the csv file path to analyze and do different things with it.
First question is: How can I print the exact cell from the CSV file? I have tried different methods, but I can't seem to get it working.
For example I want to print the info of those two cells
The other question is: Can I automate it to print the very first cell(1 A) and the very last row first cell (1099 A), without me needing to type the cell locations?
Thank you
Small portion of data
Example of the data:
Time Solar Carport Solar Fixed SolarFlatroof Solar Single
1.1.2016 317 1715 6548 2131
2.1.2016 6443 1223 1213 23121
3.1.2016 0 12213 0 122

You import csv at the very top but then decided not to use it. I wonder why – it seems just what you need here. So after a brief peek at the official documentation, I got this:
import csv
data = []
with open('../Downloads/htviope2016.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=';')
for row in spamreader:
data.append (row)
print("File has been read, and it has ", len(data), " lines.")
That is all you need to read in the entire file. You don't need to – for some operations, it is sufficient to process one line at a time – but with the full data loaded and ready in memory, you can play around with it.
print (f'First row length: {len(data[0])}')
The number of cells per row. Note that this first row contains the header, and you probably don't have any use for it. Let's ditch it.
print ('Discarding 1st row NOW. Please wait.')
data.pop(0)
Done. A plain pop() removes the last item but you can also use an index. Alternatively, you could use the more pythonic (because "slicing") data = data[1:] but I assume this could involve copying and moving around large amounts of data.
print ('First 10 rows are ...')
for i in range(10):
print ('\t'.join(data[i])+'(end)')
Look, there is data in memory! I pasted on the (end) because of the following:
print (f'First row, first cell contains "{data[0][0]}"')
print (f'First row, last cell contains "{data[0][-1]}"')
which shows
First row, first cell contains "2016-01-01 00:00:00"
First row, last cell contains ""
because each line ends with a ;. This empty 'cell' can trivially be removed during reading (ideally), or afterwards (as we still have it in memory):
data = [row[:-1] for row in data]
and then you get
First row, last cell contains "0"
and now you can use data[row][column] to address any cell that you want (in valid ranges only, of course).
Disclaimer: this is my very first look at the csv module. Some operations could possibly be done more efficiently. Practically all examples verbatim from the official documentation, which proves it's always worth taking a look there first.

Removing unwanted characters in each line of a file then matching what is left to another file in Python

I would like to write a python script that addresses the following problem:
I have two tab separated files, one has just one column of a variety of words. The other file has one column that contains similar words, as well as columns other information. However, within the first file, some lines contain multiple words, separated by " /// ". The other file has a similar problem, but the separator is " | ".
File #1
RED
BLUE /// GREEN
YELLOW /// PINK /// PURPLE
ORANGE
BROWN /// BLACK
File #2 (Which contains additional columns of other measurements)
RED|PINK
ORANGE
BROWN|BLACK|GREEN|PURPLE
YELLOW|MAGENTA
I want to parse through each file and match the words that are the same, and then append the columns of additional measurements too. But I want to ignore the /// in the first file, and the | in the second, so that each word will be compared to the other list on its own. The output file should have just one column of any words that appear in both lists, and then the appended additional information from file 2. Any help??
Addition info / update:
Here are 8 lines of File #1, I used color names above to make it more simple but this is what the words really are: These are the "symbols":
ANKRD38
ANKRD57
ANKRD57
ANXA8 /// ANXA8L1 /// ANXA8L2
AOF1
AOF2
AP1GBP1
APOBEC3F /// APOBEC3G
Here is one line of file #2: What I need to do is run each symbol from file1 and see if it matches with any one of the "synonyms", found in file2, in column 5 (here the synonyms are A1B|ABG|GAP|HYST2477). If any symbols from file1 match ANY of the synonyms from col 5 file 2, then I need to append the additional information (the other columns in file2) onto the symbol in file1 and create one big output file.
9606 '\t' 1 '\t' A1BG '\t' - '\t' A1B|ABG|GAB|HYST2477'\t' HGNC:5|MIM:138670|Ensembl:ENSG00000121410|HPRD:00726 '\t' 19 '\t' 19q13.4'\t' alpha-1-B glycoprotein '\t' protein-coding '\t' A1BG'\t' alpha-1-B glycoprotein'\t' O '\t' alpha-1B-glycoprotein '\t' 20120726
File2 is 22,000 KB, file 1 is much smaller. I have thought of creating a dict much like has been suggested, but I keep getting held up with the different separators in each of the files. Thank you all for questions and help thus far.

EDIT
After your comments below, I think this is what you want to do. I've left the original post below in case anything in that was useful to you.
So, I think you want to do the following. Firstly, this code will read every separate synonym from file1 into a set - this is a useful structure because it will automatically remove any duplicates, and is very fast to look things up. It's like a dictionary but with only keys, no values. If you don't want to remove duplicates, we'll need to change things slightly.
file1_data = set()
with open("file1.txt", "r") as fd:
for line in fd:
file1_data.update(i.strip() for i in line.split("///") if i.strip())
Then you want to run through file2 looking for matches:
with open("file2.txt", "r") as in_fd:
with open("output.txt", "w") as out_fd:
for line in in_fd:
items = line.split("\t")
if len(items) < 5:
# This is so we don't crash if we find a line that's too short
continue
synonyms = set(i.strip() for i in items[4].split("|"))
overlap = synonyms & file1_data
if overlap:
# Build string of columns from file2, stripping out 5th column.
output_str = "\t".join(items[:4] + items[5:])
for item in overlap:
out_fd.write("\t".join((item, output_str)))
So what this does is open file2 and an output file. It goes through each line in file2, and first checks it has enough columns to at least have a column 5 - if not, it ignores that line (you might want to print an error).
Then it splits column 5 by | and builds a set from that list (called synonyms). The set is useful because we can find the intersection of this with the previous set of all the synonyms from file1 very fast - this intersection is stored in overlap.
What we do then is check if there was any overlap - if not, we ignore this line because no synonym was found in file1. This check is mostly for speed, so we don't bother building the output string if we're not going to use it for this line.
If there was an overlap, we build a string which is the full list of columns we're going to append to the synonym - we can build this as a string once even if there's multiple matches because it's the same for each match, because it all comes from the line in file2. This is faster than building it as a string each time.
Then, for each synonym that matched in file1, we write to the output a line which is the synonym, then a tab, then the rest of the line from file2. Because we split by tabs we have to put them back in with "\t".join(...). This is assuming I am correct you want to remove column 5 - if you do not want to remove it, then it's even easier because you can just use the line from file2 having stripped off the newline at the end.
Hopefully that's closer to what you need?
ORIGINAL POST
You don't give any indication of the size of the files, but I'm going to assume they're small enough to fit into memory - if not, your problem becomes slightly trickier.
So, the first step is probably to open file #2 and read in the data. You can do it with code something like this:
file2_data = {}
with open("file2.txt", "r") as fd:
for line in fd:
items = line.split("\t")
file2_data[frozenset(i.strip() for i in items[0].split("|"))] = items[1:]
This will create file2_data as a dictionary which maps a word on to a list of the remaining items on that line. You also should consider whether words can repeat and how you wish to handle that, as I mentioned in my earlier comment.
After this, you can then read the first file and attach the data to each word in that file:
with open("file1.txt", "r") as fd:
with open("output.txt", "w") as fd_out:
for line in fd:
words = set(i.strip() for i in line.split("///"))
for file2_words, file2_cols in file2_data.iteritems():
overlap = file2_words & words
if overlap:
fd_out.write("///".join(overlap) + "\t" + "\t".join(file2_cols))
What you should end up with is each row in output.txt being one where the list of words in the two files had at least one word in common and the first item is the words in common separated by ///. The other columns in that output file will be the other columns from the matched row in file #2.
If that's not what you want, you'll need to be a little more specific.
As an aside, there are probably more efficient ways to do this than the O(N^2) approach I outlined above (i.e. it runs across one entire file as many times as there are rows in the other), but that requires more detailed information on how you want to match the lines.
For example, you could construct a dictionary mapping a word to a list of the rows in which that word occurs - this makes it a lot faster to check for matching rows than the complete scan performed above. This is rendered slightly fiddly by the fact you seem to want the overlaps between the rows, however, so I thought the simple approach outlined above would be sufficient without more specifics.

Look at http://docs.python.org/2/tutorial/inputoutput.html for file i/o
Loop through each line in each file
file1set = set(file1line.split(' /// '))
file2set = set(file2line.split('|'))
wordsineach = list(file1set & file2set)
split will create an array of the color names
set() turns it into a set so we can easily compare differences in each line
Loop over 'wordsineach' and write to your new file

Use the str.replace function
with open('file1.txt', 'r') as f1:
content1 = f1.read()
content1 = content1.replace(' /// ', '\n').split('\n')
with open('file2.txt', 'r') as f2:
content2 = f2.read()
content2 = content1.replace('|', '\n').split('\n')
Then use a list comprehension
common_words = [i for i in content1 if i in content2]
However, if you already know that none of the words in each file are the same, you can use set intersections to make life easier
common_words = list(set(content1) & set(content2))
Then to output the remainder to another file:
common_words = [i + '\n' for i in common_words] #so that we print each word on a new line
with open('common_words.txt', 'w') as f:
f.writelines(common_words)
As to your 'additional information', I cannot help you unless you tell us how it is formatted, etc.

sorting large text data

I have a large file (100 million lines of tab separated values - about 1.5GB in size). What is the fastest known way to sort this based on one of the fields?
I have tried hive. I would like to see if this can be done faster using python.

Have you considered using the *nix sort program? in raw terms, it'll probably be faster than most Python scripts.
Use -t $'\t' to specify that it's tab-separated, -k n to specify the field, where n is the field number, and -o outputfile if you want to output the result to a new file.
Example:
sort -t $'\t' -k 4 -o sorted.txt input.txt
Will sort input.txt on its 4th field, and output the result to sorted.txt

you want to build an in-memory index for the file:
create an empty list
open the file
read it line by line (using f.readline(), and store in the list a tuple consisting of the value on which you want to sort (extracted with line.split('\t').strip()) and the offset of the line in the file (which you can get by calling f.tell() before calling f.readline())
close the file
sort the list
Then to print the sorted file, reopen the file and for each element of your list, use f.seek(offset) to move the file pointer to the beginning of the line, f.readline() to read the line and print the line.
Optimization: you may want to store the length of the line in the list, so that you can use f.read(length) in the printing phase.
Sample code (optimized for readability, not speed):
def build_index(filename, sort_col):
index = []
f = open(filename)
while True:
offset = f.tell()
line = f.readline()
if not line:
break
length = len(line)
col = line.split('\t')[sort_col].strip()
index.append((col, offset, length))
f.close()
index.sort()
return index
def print_sorted(filename, col_sort):
index = build_index(filename, col_sort)
f = open(filename)
for col, offset, length in index:
f.seek(offset)
print f.read(length).rstrip('\n')
if __name__ == '__main__':
filename = 'somefile.txt'
sort_col = 2
print_sorted(filename, sort_col)

Split up into files that can be sorted in memory. Sort each file in memory. Then merge the resulting files.
Merge by reading a portion of each of the files to be merged. The same amount from each file leaving enough space in memory for the merged result. Once merged saving this. Repeating adding blocks of merged data onto the file.
This minimises the file i/o and moving around the file on the disk.

I would store the file in a good relational database, index it on the field your are interested in and then read the ordered items.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.