Text file combining script - Python - Big Data

Text file combining script - Python - Big Data - python

I was wondering if anyone could help me come up with a better way of doing this,
basically I have text files that are formatted like this (some have more columns some have less, each column separated by spaces)
AA BB CC DD Col1 Col2 Col3
XX XX XX Total 1234 1234 1234
Aaaa OO0 LAHB TEXT 111 41 99
Aaaa OO0 BLAH XETT 112 35 176
Aaaa OO0 BALH TXET 131 52 133
Aaaa OO0 HALB EXTT 144 32 193
These text files ranged in size from a few hundred KB to around 100MB for the newest and largest filesWhat I need to do is combine two or more files by adding the checking to see if there are any duplicate data first of all so checking if AA BB CC and DD from each row match with any rows from the other files, if so then I append the data from Col1 Col2 Col3 (etc) on to that row, if not then I fill the new columns in with zeros. The I calculate the top 100 rows based on the total of each row and output the top 100 results to a webpage.
here is the python code I'm using
import operator
def combine(dataFolder, getName, sort):
files = getName.split(",")
longestHeader = 0
longestHeaderFile =[]
dataHeaders = []
dataHeaderCode = []
fileNumber = 1
combinedFile = {}
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
x = 0
for line in f:
lines.append(line.upper().split())
if x == 1:
break
splitLine = lines[1].index("TOTAL")+1
dataHeaders.extend(lines[0][splitLine:])
headerNumber = 1
for name in lines[0][splitLine:]:
dataHeaderCode.append(str(fileNumber)+"_"+str(headerNumber))
headerNumber += 1
if splitLine > longestHeader:
longestHeader = splitLine
longestHeaderFile = lines[0][:splitLine]
fileNumber += 1
for fileName in files:
lines = []
file = dataFolder+"/tableFile/"+fileName+".txt"
with open(file) as f:
for line in f:
lines.append(line.upper().split())
splitLine = lines[1].index("TOTAL")+1
headers = lines[0][:splitLine]
data = lines[0][splitLine:]
for x in range(2, len(lines)):
normalizedLine = {}
lineName = ""
total = 0
for header in longestHeaderFile:
try:
if header == "EE" or header == "DD":
index = splitLine-1
else:
index = headers.index(header)
normalizedLine[header] = lines[x][index]
except ValueError:
normalizedLine[header] = "XX"
lineName += normalizedLine[header]
combinedFile[lineName] = normalizedLine
for header in dataHeaders:
headIndex = dataHeaders.index(header)
name = dataHeaderCode[headIndex]
try:
index = splitLine+data.index(header)
value = int(lines[x][index])
except ValueError:
value = 0
except IndexError:
value = 0
try:
value = combinedFile[lineName][header]
combinedFile[lineName][name] = int(value)
except KeyError:
combinedFile[lineName][name] = int(value)
total += int(value)
combinedFile[lineName]["TOTAL"] = total
combined = sorted(combinedFile.values(), key=operator.itemgetter(sort), reverse=True)
return combined
I'm pretty new to Python so this may not be the most "Pythonic" way of doing it, anyway this works but its slow (about 12 seconds for two files about 6MB each) and when we uploaded the code to our AWS server we found that we would get a 500 error from the server saying headers were too large (when we tried to combine larger files). Can anyone help me refine this into something a bit quicker and more suited for a web environment. Also just to clarify I don't have access to the AWS server or the setting of it, that goes through our Lead Developer, so I have no actual clue on how its set up, I do most of my dev work through localhost then commit to Github.

Related

Splitting a CSV file into multiple csv by target columns values

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).
Here's a simplified version of the CSV file data that I want to split.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I want to split so that the output extracts the data in different csv files like below:
sample1.csv
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
sample2.csv
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
sample3.csv
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
sample4.csv
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.
Any help would be appreciated.

What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.

Assume the file 'input.csv' contains the original data.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
code below
target = None
counter = 0
with open('input.csv', 'r') as file_in:
lines = file_in.readlines()
tmp = []
for idx, line in enumerate(lines):
_target = line.split(' ')[-1].strip()
if idx == 0:
tmp.append(line)
target = _target
continue
else:
last_line = idx + 1 == len(lines)
if _target != target or last_line:
if last_line:
tmp.append(line)
counter += 1
with open('sample{}.csv'.format(counter), 'w') as file_out:
file_out.writelines(tmp)
tmp = [line]
else:
tmp.append(line)
target = _target

Perhaps you want something like this:
from itertools import groupby
from operator import itemgetter
sep = ' '
with open('data.csv') as f:
data = f.read()
split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))
for index, (key, group) in enumerate(gb):
with open('sample{}.csv'.format(index), 'w') as f:
write_data = '\n'.join(sep.join(cell) for cell in group)
f.write(write_data)
Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.

I propose to use a function to do what was asked for.
There is the possibility of leaving unreferenced the file objects that
we have opened for writing, so that they are automatically closed when
garbage collected but here I prefer to explicitly close every output
file before opening another one.
The script is heavily commented, so no further explanations:
def split_data(data_fname, key_len=1, basename='file%03d.txt')
data = open(data_fname)
current_output = None # because we have yet not opened an output file
prev_key = int(1) # because a string is always different from an int
count = 0 # because we want to count the output files
for line in data:
# line has a trailing newline so that to extract the key
# we have to take into account that
key = line[-key_len-1:-1]
if key != prev_key # key has changed!
count += 1 # a new file is going to be opened
prev_key = key # remember the new key
if current_output: # if a file was opened, close it
current_output.close()
# open a new output file, its name derived from the variable count
current_output = open(basename%count, 'w')
# now we can write to the output file
current_output.write(line)
# note that line is already newline terminated
# clean up what is still going
current_output.close()
This answer has an history.

Select column from multiple DataFrames based on same header prefix

I have a function that iterates over the rows of a csv for the Age column and if an age is negative, it will print the Key and the Age value to a text file.
def neg_check():
results = []
file_path = input('Enter file path: ')
file_data = pd.read_csv(file_path, encoding = 'utf-8')
for index, row in file_data.iterrows():
if row['Age'] < 0:
results.append((row['Key'], row['Age']))
with open('results.txt', 'w') as outfile:
outfile.write("\n".join(map(str, results)))
outfile.close()
In order to make this code repeatable, how can I modify it so it will iterate the rows if the column starts with "Age"? My files have many columns that start with "Age" but end differently. . I tried the following...
if row.startswith['Age'] < 0:
and
if row[row.startswith('Age')] < 0:
but it throws AttributeError: 'Series' object has no attribute 'startswith' error.
My csv files:
sample 1
Key Sex Age
1 Male 46
2 Female 34
sample 2
Key Sex AgeLast
1 Male 46
2 Female 34
sample 3
Key Sex AgeFirst
1 Male 46
2 Female 34

I would do this in one step, but there are a few options. One is filter:
v = df[df.filter(like='AgeAt').iloc[:, 0] < 0]
Or,
c = df.columns[df.columns.str.startswith('AgeAt')][0]
v = df[df[c] < 0]
Finally, to write to CSV, use
if not v.empty:
v.to_csv('invalid.csv')
Looping over your data is not necessary with pandas.

Improve python code in terms of speed

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?

You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2

I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

Split the input file-format into a multiple lines list, interpolating number ranges "n-m"

I would need help to separate the csv into a list.
Here is the input file and out put file that I need.
I have a CSV file which look like this (line by line):
1-6
97
153,315,341,535
15,~1510,~1533,~1534,~1535,~1590
I need my output to be:
Col 1 Col 2
1 ~1510
2 ~1533
3 ~1534
4 ~1535
5 ~1590
6
97
153
315
341
535
15
Meaning when I detect "-" sign example 1-6 will be (1 until 6)
and separate the number with and without "~" into 2 different column
However results i get with my code is as below:
Col1 Col2 Col3 Col4 Col5 Col6
6-Jan
97
153 315 341 535
15 ~1510 ~1533 ~1534 ~1535 ~1590
my code:
import csv
with open('testing.csv') as f, open("testing1.csv", "w") as outfile:
writer = csv.writer(outfile)
f.readline() # these are headings should remove them
csv_reader = csv.reader(f, delimiter=",")
for line_list in csv_reader:
skills_list = [line_list[0].split(',')]
for skill in skills_list:
writer.writerow(skill)
Please help. Thanks A lot.

This is how I would do it. read all the data first and construct your columns. Then iterate over the columns and build your csv.
Here is code for building the columns.
import csv
fin = open('testing.csv', 'r')
column_1 = []
column_2 = []
for line in fin:
items = line.split(',')
for item in items:
if '-' in item:
num_range = item.split('-')
column_1 += range(int(num_range[0])+1, int(num_range[1])+1)
elif '~' in item:
column_2.append(item.strip())
else:
column_1.append(item.strip())
fin.close()

You cannot write output until you have read the required input. So the first output line can be written when you have obtained the input ~1510.
The simplest solution is to read the entire input file into memory, then write. I would maintain two lists, then push to the first if no tilde, otherwise to the other one. For output, then, simply loop over these lists, supplying empty values if one of them runs out.
If you need to optimize memory usage (e.g. if there is more input than you can fit into memory), maybe write each line as soon as you can and free up its memory; but this is more challenging to get right.

import itertools as it
results = {
'col1': [],
'col2': [],
}
with open('data.txt') as f:
for line in f:
line = line.rstrip()
entries = line.split(",")
for entry in entries:
if entry.startswith('~'):
column = 'col2'
entry = entry[1:]
else:
column = 'col1'
if '-' in entry:
start, end = entry.split('-')
results[column].extend(
list(range(int(start), int(end)+1))
)
else:
results[column].append(entry)
print("{} \t {}".format('Col 1', 'Col 2'))
column_iter = it.zip_longest(
results['col1'],
["~{}".format(num) for num in results['col2']],
fillvalue=''
)
for col1_num, col2_num in column_iter:
print(
"{} \t {}".format(col1_num, col2_num)
)
--output:--
Col 1 Col 2
1 ~1510
2 ~1533
3 ~1534
4 ~1535
5 ~1590
6
97
153
315
341
535
15
And with this data.txt:
1-6
~7-10,97
153,315,341,535
15,~1510,~1533,~1534,~1535,~1590
output:
Col 1 Col 2
1 ~7
2 ~8
3 ~9
4 ~10
5 ~1510
6 ~1533
97 ~1534
153 ~1535
315 ~1590
341
535
15

how do I match a specific number into number set efficiently?

I have a number set which contains 2375013 unique numbers in txt file. The data structure looks like this:
11009
900221
2
3
4930568
293
102
I want to match a number in a line from another data to the number set for extracting data what I need. So, I coded like this:
6 def get_US_users_IDs(filepath, mode):
7 IDs = []
8 with open(filepath, mode) as f:
9 for line in f:
10 sp = line.strip()
11 for id in sp:
12 IDs.append(id.lower())
13 return IDs
75 IDs = "|".join(get_US_users_IDs('/nas/USAuserlist.txt', 'r'))
76 matcher = re.compile(IDs)
77 if matcher.match(user_id):
78 number_of_US_user += 1
79 text = tweet.split('\t')[3]
But it takes a lot of time for running. Is there any idea to reduce run time?

What I understood is that you have a huge number of ids in a file and you want to know if a specific user_id is in this file.
You can use a python set.
fd = open(filepath, mode);
IDs = set(int(id) for id in fd)
...
if user_id in IDs:
number_of_US_user += 1
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text file combining script - Python - Big Data - python

Related

Splitting a CSV file into multiple csv by target columns values

Select column from multiple DataFrames based on same header prefix

Improve python code in terms of speed

Split the input file-format into a multiple lines list, interpolating number ranges "n-m"

how do I match a specific number into number set efficiently?

Categories

Resources