How to organise columnar data into a network table - python

I have a tab delimited file text file with two columns. I need to find a way that prints all values that “hit” each other to one line.
For example, my input looks like this:
A B
A C
A D
B C
B D
C D
B E
D E
B F
C F
F G
F H
H I
K L
My desired output should look like this:
A B C D
B D E
B C F
F G H
H I
K L
My actual data file is much larger than this if that makes any difference. I would prefer to do this in Unix or Python where possible.
Can anybody help?
Thanks in advance!

There's no way to put input file as .csv? It would be easier to parse delimiters.
If it wouldn't be posible, try next example:
from itertools import groupby
from operator import itemgetter
with open('example.txt','rb') as txtfile:
cleaned = []
#store file information in a list of lists
for line in txtfile.readlines():
cleaned.append(line.split())
#group py first element of nested list
for elt, items in groupby(cleaned, itemgetter(0)):
row = [elt]
for item in items:
row.append(item[1])
print row
Hope it helps you.
Solution using a .csv file:
from itertools import groupby
from operator import itemgetter
import csv
with open('example.csv','rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
cleaned.append(row) #group py first element of nested list
for elt, items in groupby(cleaned, itemgetter(0)):
row = [elt]
for item in items:
row.append(item[1])
print row

Related

How save dict in csv in column form where the first line is keys and next lines are vectors?

I have a dict which is in the following form:
Sigs_dict['a']=[1,2,3,4,5]
Sigs_dict['b']=[6,7,8,9,0]
Sigs_dict['c']=[1,2,3,4,5]
I whould like to have a csv file where the first line is the keys of the dict and the next line the vectors in a column shape.
Like:
a b c
1 6 1
2 7 2
3 8 3
4 9 4
5 0 5
what I have for now is the first line but I don't understand how to write the vectors properly.
with open(fileName[0], 'w') as f:
writer = csv.writer(f, delimiter = ' ')
writer.writerow(Sigs_dict.keys())
#missing something here
All vectors have the same length.
Simple option using the zip built-in (no pandas dependency, eventhough pandas allows for less code lines):
my_dict = {}
my_dict['a'] = [1, 2, 3, 4, 5]
my_dict['b'] = [6, 7, 8, 9, 0]
my_dict['c'] = [1, 2, 3, 4, 5]
# get keys and values in the same order, in this case, sorted by key
# key_list = sorted(my_dict.keys())
key_list = [k for k, _ in sorted(my_dict.items())]
val_list = [v for _, v in sorted(my_dict.items())]
with open(fileName[0], 'w') as f:
writer = csv.writer(f, delimiter=' ')
writer.writerow(key_list)
for row in zip(*val_list):
writer.writerow(row)
Note how I unpack val_list using the asterisk (*); without that unpacking this won't work as expected.
Edited my code to have a fixed ordering to prevent mismatch between column header and content.
If you are ok with using pandas
import pandas as pd
pd.DataFrame(Sigs_dict).to_csv('your_csv_file_path.csv',index=False,sep=' ')
Will produce
a b c
1 6 1
2 7 2
3 8 3
4 9 4
5 0 5
Even if the input is a dictionary, you cannot use csv.DictWriter, which would require 1 key per data row, not column.
Just zip the dict values (in a fixed order) to "transpose" and create rows that csv module can write properly.
Also, sort your dict keys so the order of columns is the same everytime:
import csv,sys
title = sorted(Sigs_dict)
cw = csv.writer(sys.stdout)
cw.writerow(title) # write header
cw.writerows(zip(*(Sigs_dict[k] for k in title)))
result:
a,b,c
1,6,1
2,7,2
3,8,3
4,9,4
5,0,5
to write to a file, don't forget newline="" for python 3, and "wb" for python 2 to avoid the infamous extra newline issue on windows:
with open(fileName[0], 'w', newline = "") as f: # or just ,'wb') for python 2
cw = csv.writer(f)

Compare two csv files

I am trying to compare two csv files to look for common values in column 1.
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
print(x,y)
I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.
test1.csv
Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5
test2.csv
Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72
I believe you're looking for the set intersection:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])
print(x & y)
Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:
with open('test1.csv') as f:
set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus',
# 'Velociraptor', 'Stegosaurus'}
I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
if x[1] == y[1]:
print('they match!')
Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this
from collections import defaultdict
d = defaultdict(list)
for row in f1_csv:
d[row[0]].append(row[1])
for row in f2_csv:
d[row[0]].append(row[1])
d = {k: d[k] for k in d if len(d[k]) > 1}
print(d)
Output:
{'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'],
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}

Output the rows with certain initial string of some specific column

I have a tab-delimited txt file like this:
A B aaaKP
C D bbbZ
E F cccLL
This is tab-delimited.
If
phrase = aaa or bbb
column = 3
then I would like only those rows whose 3rd column starts with aaa or bbb
The output will be
A B aaaKP
C D bbbZ
I have a code for the case where there is only one phrase.
phrase, column = 'aaa', 3
fn = lambda l : len(l) >= column and len(l[column-1]) >= len(phrase) and phrase == l[column-1][:len(phrase)]
fp = open('output.txt', 'w')
fp.write(''.join(row for row in open('input.txt') if fn(row.split('\t'))))
fp.close()
But if there are multiple phrases.. I tried
phrase, column = {'aaa','bbb'}, 3
but it didn't work.
In general case you can use regular expressions with branches for quick matching and searching:
import re
phrases = [ 'aaa', 'bbb' ]
column = 3
pattern = re.compile('|'.join(re.escape(i) for i in phrases))
column -= 1
with open('input.txt') as inf, open('output.txt', 'w') as outf:
for line in inf:
row = line.split('\t')
if pattern.match(row[column]):
outf.write(line)
The code builds a regular expression from all the possible phrases, using re.escape to escape special characters. The resulting expression in this case is aaa|bbb. pattern.match matches the beginning of the string against the pattern (the match must start from the first character).
However if you must only match the beginning of string against fixed phrases, then do note that startswith accepts a tuple, and this is the fastest code:
phrases = [ 'aaa', 'bbb' ]
column = 3
phrase_tuple = tuple(phrases)
column -= 1
with open('input.txt') as inf, open('output.txt', 'w') as outf:
for line in inf:
row = line.split('\t')
if row[column].startswith(phrase_tuple):
outf.write(line)
Also it demonstrates the use of context managers for opening the file, opens the input.txt before output.txt so that if the former does not exist, the latter does not get created. And finally shows that this looks nicest without any generators and lambdas.
You could use python's re module for this,
>>> import re
>>> data = """A B aaaKP
... C D bbbZ
... E F cccLL"""
>>> m = re.findall(r'^(?=\S+\s+\S+\s+(?:aaa|bbb)).*$', data, re.M)
>>> for i in m:
... print i
...
A B aaaKP
C D bbbZ
Positive lookahead is used to check whether the line contains particular string. The above regex checks for the lines in which the third column starts with aaa or bbb . If yes, then the corresponding lines will be printed.
You could try this regex code also,
>>> s = """A B aaaKP
... C D bbbZ
... E F cccLL
... """
>>> m = re.findall(r'^(?=\S+\t\S+\t(?:aaa|bbb)).*$', s, re.M)
>>> for i in m:
... print i
...
A B aaaKP
C D bbbZ
Solution:
#!/usr/bin/env python
import csv
from pprint import pprint
def read_phrases(filename, phrases):
with open(filename, "r") as fd:
reader = csv.reader(fd, delimiter="\t")
for row in reader:
if any((row[2].startswith(phrase) for phrase in phrases)):
yield row
pprint(list(read_phrases("foo.txt", ["aaa"])))
pprint(list(read_phrases("foo.txt", ["aaa", "bbb"])))
Example:
$ python foo.py
[['A', 'B', 'aaaKP']]
[['A', 'B', 'aaaKP'], ['C', 'D', 'bbbZ']]

Python 2: IndexError: list index out of range

I have an "asin.txt" document:
in,Huawei1,DE
out,Huawei2,UK
out,Huawei3,none
in,Huawei4,FR
in,Huawei5,none
in,Huawei6,none
out,Huawei7,IT
I'm opening this file and make an OrderedDict:
from collections import OrderedDict
reader = csv.reader(open('asin.txt','r'),delimiter=',')
reader1 = csv.reader(open('asin.txt','r'),delimiter=',')
d = OrderedDict((row[0], row[1].strip()) for row in reader)
d1 = OrderedDict((row[1], row[2].strip()) for row in reader1)
Then I want to create variables (a,b,c,d) so if we take the first line of the asin.txt it should be like: a = in; b = Huawei1; c = Huawei1; d = DE. To do this I'm using a "for" loop:
from itertools import izip
for (a, b), (c, d) in izip(d.items(), d1.items()): # here
try:
.......
It worked before, but now, for some reason, it prints an error:
d = OrderedDict((row[0], row[1].strip()) for row in reader)
IndexError: list index out of range
How do I fix that?
Probably you have a row in your textfile which does not have at least two fields delimited by ",". E.g.:
in,Huawei1
Try to find the solution along these lines:
d = OrderedDict((row[0], row[1].strip()) for row in reader if len(row) >= 2)
or
l = []
for row in reader:
if len(row) >= 2:
l.append(row[0], row[1].strip())
d = OrderedDict(l)

Write multiple lists to CSV

I have two lists:
x = [['a','b','c'], ['d','e','f'], ['g','h','i']]
y = [['j','k','l'], ['m','n','o'], ['p','q','r']]
I'd like to write lists x and y to a CSV file such that it reads in columns:
Col 1:
a
b
c
Col 2:
j
k
l
Col 3:
d
e
f
Col 4:
m
n
o
etc. I'm not really sure how to do this.
You can use zip to do the transpose and csv to create your output file, eg:
x = [['a','b','c'], ['d','e','f'], ['g','h','i']]
y = [['j','k','l'], ['m','n','o'], ['p','q','r']]
from itertools import chain
import csv
res = zip(*list(chain.from_iterable(zip(x, y))))
with open(r'yourfile.csv', 'wb') as fout:
csvout = csv.writer(fout)
csvout.writerows(res)
If you have unequal lengths, then you may wish to look at itertools.izip_longest and specify a suitable fillvalue= instead of using the builtin zip

Categories

Resources