Splitting a CSV file into multiple csv by target columns values - python

I'm fairly new to programming and Python in general. I've a big CSV file that I need to split into multiple CSV files based on the target values of the target column (last column).
Here's a simplified version of the CSV file data that I want to split.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I want to split so that the output extracts the data in different csv files like below:
sample1.csv
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
sample2.csv
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
sample3.csv
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
sample4.csv
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
I tried with pandas and some groupby functions but it merges all 1 and 0 together in separate files one containing all values with 1 and another 0, which is not the output that I needed.
Any help would be appreciated.

What you can do is get the value of the last column in each row. If the value is the same as the value in previous row, add that row to the same list, and if it's not just create a new list and add that row to that empty list. For data structure use list of lists.

Assume the file 'input.csv' contains the original data.
1254.00 1364.00 4562.33 4595.32 1
1235.45 1765.22 4563.45 4862.54 1
6235.23 4563.00 7832.31 5320.36 1
8623.75 5632.09 4586.25 9361.86 0
5659.92 5278.21 8632.02 4567.92 0
4965.25 1983.78 4326.50 7901.10 1
7453.12 4993.20 4573.30 8632.08 1
8963.51 7496.56 4219.36 7456.46 1
9632.23 7591.63 8612.37 4591.00 1
7632.08 4563.85 4632.09 6321.27 0
4693.12 7621.93 5201.37 7693.48 0
6351.96 7216.35 795.52 4109.05 0
code below
target = None
counter = 0
with open('input.csv', 'r') as file_in:
lines = file_in.readlines()
tmp = []
for idx, line in enumerate(lines):
_target = line.split(' ')[-1].strip()
if idx == 0:
tmp.append(line)
target = _target
continue
else:
last_line = idx + 1 == len(lines)
if _target != target or last_line:
if last_line:
tmp.append(line)
counter += 1
with open('sample{}.csv'.format(counter), 'w') as file_out:
file_out.writelines(tmp)
tmp = [line]
else:
tmp.append(line)
target = _target

Perhaps you want something like this:
from itertools import groupby
from operator import itemgetter
sep = ' '
with open('data.csv') as f:
data = f.read()
split_data = [row.split(sep) for row in data.split('\n')]
gb = groupby(split_data, key=itemgetter(4))
for index, (key, group) in enumerate(gb):
with open('sample{}.csv'.format(index), 'w') as f:
write_data = '\n'.join(sep.join(cell) for cell in group)
f.write(write_data)
Unlike pd.groupby, itertools.groupby doesn't sort the source beforehand. This parses the input CSV into a list of lists and performs a groupby on the outer list based on the 5th column, which contains the target. The groupby object is an iterator over the groups; by writing each group to a different file, the result you want can be achieved.

I propose to use a function to do what was asked for.
There is the possibility of leaving unreferenced the file objects that
we have opened for writing, so that they are automatically closed when
garbage collected but here I prefer to explicitly close every output
file before opening another one.
The script is heavily commented, so no further explanations:
def split_data(data_fname, key_len=1, basename='file%03d.txt')
data = open(data_fname)
current_output = None # because we have yet not opened an output file
prev_key = int(1) # because a string is always different from an int
count = 0 # because we want to count the output files
for line in data:
# line has a trailing newline so that to extract the key
# we have to take into account that
key = line[-key_len-1:-1]
if key != prev_key # key has changed!
count += 1 # a new file is going to be opened
prev_key = key # remember the new key
if current_output: # if a file was opened, close it
current_output.close()
# open a new output file, its name derived from the variable count
current_output = open(basename%count, 'w')
# now we can write to the output file
current_output.write(line)
# note that line is already newline terminated
# clean up what is still going
current_output.close()
This answer has an history.

Related

Python & Pandas: appending data to new column

With Python and Pandas, I'm writing a script that passes text data from a csv through the pylanguagetool library to calculate the number of grammatical errors in a text. The script successfully runs, but appends the data to the end of the csv instead of to a new column.
The structure of the csv is:
The working code is:
import pandas as pd
from pylanguagetool import api
df = pd.read_csv("Streamlit\stack.csv")
text_data = df["text"].fillna('')
length1 = len(text_data)
for i, x in enumerate(range(length1)):
# this is the pylanguagetool operation
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
# this pulls the error count "message" from the pylanguagetool json
error_count = result.count("message")
output_df = pd.DataFrame({"error_count": [error_count]})
output_df.to_csv("Streamlit\stack.csv", mode="a", header=(i == 0), index=False)
The output is:
Expected output:
What changes are necessary to append the output like this?
Instead of using a loop, you might consider lambda which would accomplish what you want in one line:
df["error_count"] = df["text"].fillna("").apply(lambda x: len(api.check(x, api_url='https://languagetool.org/api/v2/', lang='en-US')["matches"]))
>>> df
user_id ... error_count
0 10 ... 2
1 11 ... 0
2 12 ... 0
3 13 ... 0
4 14 ... 0
5 15 ... 2
Edit:
You can write the above to a .csv file with:
df.to_csv("Streamlit\stack.csv", index=False)
You don't want to use mode="a" as that opens the file in append mode whereas you want (the default) write mode.
My strategy would be to keep the error counts in a list then create a separate column in the original database and finally write that database to csv:
text_data = df["text"].fillna('')
length1 = len(text_data)
error_count_lst = []
for i, x in enumerate(range(length1)):
errors = api.check(text_data, api_url='https://languagetool.org/api/v2/', lang='en-US')
result = str(errors)
error_count = result.count("message")
error_count_lst.append(error_count)
text_data['error_count'] = error_count_lst
text_data.to_csv('file.csv', index=False)

Python Remove duplicates from csv if value in column duplicated

I am trying to write csv parser so if i have the same name in the name column i will delete the second name's line. For example:
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-04-18-192446'],
['CSE_MAIN\\IT-Laptop12', 'DEREGISTERED', '2018-03-28-144236'],
['CSE_MAIN\\LC-CSEWS61', 'DEREGISTERED', '2018-03-28-144236']]
I need that the last line will be deleted because it has the same name as the first one.
What i wrote is:
file2 = str(sys.argv[2])
print ("The first file is:" + file2)
reader2 = csv.reader (open(file2))
with open("result2.csv",'wb') as result2:
wtr2= csv.writer( result2 )
for r in reader2:
wtr2.writerow( (r[0], r[6], r[9] ))
newreader2 = csv.reader (open("result2.csv"))
sortedlist2 = sorted(newreader2, key=lambda col: col[2] , reverse = True)
for i in range(len(sortedlist2)):
for j in range(len(sortedlist2)-1):
if (sortedlist2[i][0] == sortedlist2[j+1][0] and sortedlist2[i][1]!=sortedlist2[j+1][1]):
if(sortedlist2[i][1]>sortedlist2[j+1][1]):
del sortedlist2[i][0-2]
else:
del sortedlist2[j+1][0-2]
Thanks.
Try with pandas:
import pandas as pd
df = pd.read_csv('path/name_file.csv')
df = df.drop_duplicates([0]) #0 this is columns which will compare.
df.to_csv('New_file.csv') #save to csv
This method delete all duplicates from columns 1.
If you need simple delete you can use method drop.
#You file after use pandas (print(df)):
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236
2 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-03-28-144236
For example you need delete 2 row.
df.drop(2,axis=0, inplace=True) #axis=0 means row, if you switch 1 this is columns.
Output:
0 1 2
0 CSE_MAIN\LC-CSEWS61 DEREGISTERED 2018-04-18-192446
1 CSE_MAIN\IT-Laptop12 DEREGISTERED 2018-03-28-144236

compare values in a dictionary and do stuff with each based on value

I have a CSV with a bunch of columns. I only want three of the columns.
I imported this into my python script and turned the three columns into three lists
Then added each list to a dictionary. List 1 being the keys and the other lists being the two values. (maybe theres a better way to do this?)
key is a transaction id
value1 is a filename
value2 is a date
In the end what want is this:
run through the dict and find all duplicate file names (there will be multiple sets of duplicates)
for each set of the duplicate filenames, find the one id(key) that has the latest(most recent) date value (if time and date are the same, then highest id(key))
print key of the latest date (all i need is the id)
for each of the other duplicates print "this is a duplicate"+ (key) (again just need the id of each)
I want to repeat that for all keys until i essentially get the ids(keys) of only the latest items in the list. There could be 5 duplicate filenames of filename x and 100 dupes for filename y and 30 for filename t and so on.
I'm using an API to actually move data, which is why I need to get the latest and move that ID to "x" and all duplicates to "y" in this external system.
Here's what I have in terms of building the dict (assuming its building in the correct order), but I don't really know where to go from here:
import csv
def readcsv(filename, column):
file = open(filename, "rU")
reader = csv.reader(file, delimiter=",")
list = []
for row in reader:
list.append(row[(column)])
file.close()
return list
def makeDict(id, fileName, detDate):
iList = {z[0]:list(z[1:]) for z in zip((id),(fileName),(detDate))}
return (iList)
id = (readcsv("jul.csv", 2))
fileName = (readcsv("jul.csv", 1))
detDate = (readcsv("jul.csv", 0))
mainDict = makeDict((id), (fileName), (detDate))
sample data (extracted the columns into a simpler sheet for testing)
Date fileURL ID
7/24/2018 16:04 https://localhost/file1.docx 2599302
7/24/2018 16:03 https://localhost/file3.docx 2349302
7/24/2018 16:01 https://localhost/file1.docx 2599302
7/24/2018 16:04 https://localhost/fil232.xml 2599303
7/24/2018 16:03 https://localhost/file1.docx 2349333
7/24/2018 16:01 https://localhost/file3.docx 2529374
UPDATE:
Using the answer from below, this is what I ended up with that made it work:
import csv
def readcsv(filename, column):
file = open(filename, "rU")
reader = csv.reader(file, delimiter=",")
list = []
for row in reader:
list.append(row[(column)])
file.close()
return list
def makeDict(id, fileName, detDate):
iList = {z[0]:list(z[1:]) for z in zip((id),(fileName),(detDate))}
return (iList)
## Group Keys by like file names ##
def groupKeys(mainDict):
same_filename = {}
for key, line in (mainDict).items():
name, date = line
if name not in same_filename:
same_filename[name] = [key]
else:
same_filename[name].append( key )
return(same_filename)
########################################### Get latest ID ##################
def getLatestID(same_filename, mainDict):
## for each file
for k in (same_filename.keys()):
curDate = 0
curID = 0
## get each id value (aka matching ids holding same file)
for v in (same_filename.get((k))):
moveDupeList.append(v) ## add to a list of dupes
## if current id's date is equal to the highest found so far - note:date already set since its same
if ((mainDict.get((v)))[1]) == (curDate):
## check which id is highest and set curId if new high found
if (v) > (curId):
curId = (v)
## else if date of current is greater than greatest found so far set new highest date and id
elif ((mainDict.get((v)))[1]) > (curDate):
curDate = ((mainDict.get((v)))[1])
curId = (v)
if (curId) in moveDupeList:
moveDupeList.remove((curId)) #remove latest from dupe list
moveProperList.append((curId)) #add latest to proper list
########################################### Get latest ID ##################
id = (readcsv("jul.csv", 2))
fileName = (readcsv("jul.csv", 1))
detDate = (readcsv("jul.csv", 0))
mainDict = makeDict((id), (fileName), (detDate))
same_filename = (groupKeys(mainDict))
getLatestID((same_filename), (mainDict))
A starting point could be to build another dictionary giving for each filename the list of all corresponding key(id):
data = {2349302: ['7/24/2018 16:03', 'https://localhost/file3.docx'],
2349333: ['7/24/2018 16:03', 'https://localhost/file1.docx'],
2529374: ['7/24/2018 16:01', 'https://localhost/file3.docx'],
2599302: ['7/24/2018 16:01', 'https://localhost/file1.docx'],
2599303: ['7/24/2018 16:04', 'https://localhost/fil232.xml']}
similar_filename = {}
for key, line in data.items():
date, name = line
if name not in similar_filename:
similar_filename[name] = [key]
else:
similar_filename[name].append( key )
similar_filename
>>> {'https://localhost/fil232.xml': [2599303],
'https://localhost/file1.docx': [2599302, 2349333],
'https://localhost/file3.docx': [2529374, 2349302]}
This is your first point.

Improve python code in terms of speed

I have a very big file (1.5 billion lines) in the following form:
1 67108547 67109226 gene1$transcript1 0 + 1 0
1 67108547 67109226 gene1$transcript1 0 + 2 1
1 67108547 67109226 gene1$transcript1 0 + 3 3
1 67108547 67109226 gene1$transcript1 0 + 4 4
.
.
.
1 33547109 33557650 gene2$transcript1 0 + 239 2
1 33547109 33557650 gene2$transcript1 0 + 240 0
.
.
.
1 69109226 69109999 gene1$transcript1 0 + 351 1
1 69109226 69109999 gene1$transcript1 0 + 352 0
What I want to do is to reorganize/sort this file based on the identifier on column 4. The file is consisted of blocks. If you concatenate columns 4,1,2 and 3 you create the unique identifier for each block. This is the key for the dicionary all_exons and the value is a numpy array containing all the values of column 8. Then I have a second dictionary unique_identifiers that has as key the attributes from column 4 and values a list of the corresponding block identifiers. As output I write a file in the following form:
>gene1
0
1
3
4
1
0
>gene2
2
0
I already wrote some code (see below) that does this, but my implementation is very slow. It takes around 18 hours to run.
import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np
def parse_blocks(bedtools_file):
unique_identifiers = {} # Dictionary with key: gene, value: list of exons
all_exons = {} # Dictionary contatining all exons
# Parse file and ...
with open(bedtools_file) as fp:
sp_line = []
for line in fp:
sp_line = line.strip().split("\t")
current_id = sp_line[3].split("$")[0]
identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
if(identifier in all_exons):
item = float(sp_line[7])
all_exons[identifier]=np.append(all_exons[identifier],item)
else:
all_exons[identifier] = np.array([sp_line[7]],float)
if(current_id in unique_identifiers):
unique_identifiers[current_id].add(identifier)
else:
unique_identifiers[current_id] =set([identifier])
return unique_identifiers, all_exons
identifiers, introns = parse_blocks(options.bed)
w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
w.write(">"+str(gene)+"\n")
for intron in sorted(list(identifiers[gene])):
for base in introns[intron]:
w.write(str(base)+"\n")
w.close()
How can I impove the above code in order to run faster?
You also import pandas, therefore, I provide a pandas solution which requires basically only two lines of code.
However, I do not know how it performs on large data sets and whether that is faster than your approach (but I am pretty sure it is).
In the example below, the data you provide is stored in table.txt. I then use groupby to get all the values in your 8th column, store them in a list for the respective identifier in your column 4 (note that my indices start at 0) and convert this data structure into a dictionary which can then be printed easily.
import pandas as pd
df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'
op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))
So in this case op looks like this:
{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}
Now you could print the output like this and pipeline it in a certain file:
for k,v in op.iteritems():
print k.split('$')[0]
for val in v:
print val
This gives you the desired output:
gene1
0
1
3
4
1
0
gene2
2
0
Maybe you can give it a try and let me know how it compares to your solution!?
Edit2:
In the comments you mentioned that you would like to print the genes in the correct order. You can do this as follows:
# add some fake genes to op from above
op['gene0$stuff'] = [7,9]
op['gene4$stuff'] = [5,9]
# print using 'sorted'
for k,v in sorted(op.iteritems()):
print k.split('$')[0]
for val in v:
print val
which gives you:
gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9
EDIT1:
I am not sure whether duplicates are intended but you could easily get rid of them by doing the following:
op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))
Now op2 would look like this:
{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}
You print the output as before:
for k,v in op2.iteritems():
print k.split('$')[0]
for val in v:
print val
which gives you
gene1
0
1
3
4
gene2
0
2
I'll try to simplify your question, my solution is like this:
First, scan over the big file. For every different current_id, open a temporary file and append value of column 8 to that file.
After the full scan, catenate all chunks to a result file.
Here's the code:
# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess
class ChunkBoss(object):
"""Boss for file chunks"""
def __init__(self):
self.opened_files = {}
def write_chunk(self, current_id, value):
if current_id not in self.opened_files:
self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
self.opened_files[current_id].write('>%s\n' % current_id)
self.opened_files[current_id].write('%s\n' % value)
def cat_result(self, filename):
"""Catenate chunks to one big file
"""
# Sort the chunks
chunk_file_list = []
for current_id in sorted(self.opened_files.keys()):
chunk_file_list.append(self.opened_files[current_id].name)
# Flush chunks
[chunk.flush() for chunk in self.opened_files.values()]
# By calling cat command
with open(filename, 'wb') as fp:
subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)
def clean_up(self):
[os.unlink(chunk.name) for chunk in self.opened_files.values()]
def main():
boss = ChunkBoss()
with open('bigfile.data') as fp:
for line in fp:
data = line.strip().split()
current_id = data[3].split("$")[0]
value = data[7]
# Write value to temp chunk
boss.write_chunk(current_id, value)
boss.cat_result('result.txt')
boss.clean_up()
if __name__ == '__main__':
main()
I tested the performance of my script, with bigfile.data containing about 150k lines. It took about 0.5s to finish on my laptop. Maybe you can give it a try.

Sorting a hash table and printing key and value at the same time

I have written a program in python, where I have used a hash table to read data from a file and then add data in the last column of the file corresponding to the values in the 2nd column of the file. for example, for all entries in column 2 with same values, the corresponding last column values will be added.
Now I have implemented the above successfully. Now I want to sort the table in descending order according to last column values and print these values and the corresponding 2nd column (key) values. i am not able to figure out on how to do this. Can anyone please help ?
pmt txt file is of the form
0.418705 2 3 1985 20 0
0.420657 4 5 119 3849 5
0.430000 2 3 1985 20 500
and so on...
So, for example, for number 2 in column 2, i have added all data of last column corresponding to all numbers '2' in the 2nd column. So, this process will continue for the next set of numbers lie 4, 5 ,etc in column 2.
I'm using python 3
import math
source_ip = {}
f = open("pmt.txt","r",1)
lines = f.readlines()
for line in lines:
s_ip = line.split()[1]
bit_rate = int(line.split()[-1]) + 40
if s_ip in source_ip.keys():
source_ip[s_ip] = source_ip[s_ip] + bit_rate
print (source_ip[s_ip])
else:
source_ip[s_ip] = bit_rate
f.close()
for k in source_ip.keys():
print(str(k)+": "+str(source_ip[k]))
print ("-----------")
It sounds like you want to use the sorted function with a key parameter that gets the value from the key/value tuple:
sorted_items = sorted(source_ip.items(), key=lambda x: x[1])
You could also use itemgetter from the operator module, rather than a lambda function:
import operator
sorted_items = sorted(source_ip.items(), key=operator.itemgetter(1))
How about something like this?
#!/usr/local/cpython-3.4/bin/python
import collections
source_ip = collections.defaultdict(int)
with open("pmt.txt","r",1) as file_:
for line in file_:
fields = line.split()
s_ip = fields[1]
bit_rate = int(fields[-1]) + 40
source_ip[s_ip] += bit_rate
print (source_ip[s_ip])
for key, value in sorted(source_ip.items()):
print('{}: {}'.format(key, value))
print ("-----------")

Categories

Resources