Converting coordinate representation to adjecency list representation - python

What is the most efficient way to convert this file:
10 3
10 5
12 6
12 19
19 12
19 14
19 10
to this:
10 3 5
12 6 19
19 12 14 10
First column of the input file is numerically sorted in increasing order.
Any solutions using Python, AWK, etc. are welcome.

from itertools import groupby
lines, op_file = [line.split() for line in open("In.txt")], open("Out.txt", "w")
for key, grp in groupby(lines, key = lambda x: x[0]):
print >> op_file, "{} {}".format(key, " ".join([i[1] for i in grp]))
op_file.close()
Output
10 3 5
12 6 19
19 12 14 10

Since you mentioned awk:
$ awk '{a[$1]=a[$1]" "$2}END{for (i in a){print i a[i]}}' input
19 12 14 10
10 3 5
12 6 19
pipe it to sort to have it, well, sorted:
$ awk '...' input | sort
10 3 5
12 6 19
19 12 14 10

In Python 2:
import itertools, operator
with open(infilename) as infile:
input = (line.split() for line in infile)
output = itertools.groupby(input, operator.itemgetter(0))
with open(outfilename, 'w') as outfile:
for key, line in output:
print >>outfile, key, ' '.join(val[1] for val in line)
This assumes that the input and output files are different: you could just write the output to standard out and leave it as the user's problem to save it.

Try out this code
fp = open('/tmp/test.txt')
list_dict = {}
for line in fp.readlines():
split_values = line.split()
if split_values[0] in list_dict:
list_dict[split_values[0]].extend(split_values[1:])
else:
list_dict[split_values[0]] = split_values
for val in list_dict.values():
print " ".join(val)

Related

Print line in file until line blank line

I have a file "testread.txt" having below data.
A
1
2
3
4
BA
5
6
7
8
CB
9
10
11
D
12
13
14
15
I Wanted to read and extract data each section wise and write it to different files. Eg;
1
2
3
4
Write it to File "a.txt"
5
6
7
8
Write it to File "b.txt"
9
10
11
Write it to File "c.txt"
and so on...
A (rough) solution could be get using:
collections.defaultdict to divide and store items;
numpy.savetxt to save them into files.
import numpy as np
from collections import defaultdict
with open('testread.txt', 'r') as f:
content = f.readlines()
d = defaultdict(list)
i = 0
for line in content:
if line == '\n':
i+=1
else:
d[i].append(line.strip())
for k,v in d.items():
np.savetxt('file{}.txt'.format(k), v[1:], delimiter=",", fmt='%s')
and you get:
file0.txt
1
2
3
4
file1.txt:
5
6
7
8
file2.txt:
9
10
11
file3.txt
12
13
14
15
The idea is to skip file when a new empty line is available. The below code should do the trick.
files_list = ['a.txt', 'b.txt', 'c.txt']
fpr = open('input.txt')
for f in files_list:
with open(f, 'w') as fpw:
for i, line in enumerate(fpr):
if i == 0: # skips first line
continue
if line.strip():
fpw.write(line)
else:
break

Merging files in Python

I want to merge 2 files. First file (co60.txt) contains only integer values and the second file (bins.txt) contains float numbers.
co60.txt:
11
14
12
14
18
15
18
9
bins.txt:
0.00017777777777777795
0.0003555555555555559
0.0005333333333333338
0.0007111111111111118
0.0008888888888888898
0.0010666666666666676
0.0012444444444444456
0.0014222222222222236
When I merge those two file with this code:
with open("co60.txt", 'r') as a:
a1 = [re.findall(r"[\w']+", line) for line in a]
with open("bins.txt", 'r') as b:
b1 = [re.findall(r"[\w']+", line) for line in b]
with open("spectrum.txt", 'w') as c:
for x,y in zip(a1,b1):
c.write("{} {}\n".format(" ".join(x),y[0]))
I get:
11 0
14 0
12 0
14 0
18 0
15 0
18 0
9 0
It appears that when I merge these 2 files, this code only merges round values of the file bins.txt.
How do I get that files merge like this:
11 0.00017777777777777795
14 0.0003555555555555559
12 0.0005333333333333338
14 0.0007111111111111118
18 0.0008888888888888898
15 0.0010666666666666676
18 0.0012444444444444456
9 0.0014222222222222236
You can do it without regex::
with open("co60.txt") as a, open("bins.txt") as b, \
open("spectrum.txt", 'w') as c:
for x,y in zip(a, b):
c.write("{} {}\n".format(x.strip(), y.strip()))
Content of spectrum.txt:
11 0.00017777777777777795
14 0.0003555555555555559
12 0.0005333333333333338
14 0.0007111111111111118
18 0.0008888888888888898
15 0.0010666666666666676
18 0.0012444444444444456
9 0.0014222222222222236
As mentioned by #immortal, if you want to use regex then use -
b1 = [re.findall(r"[0-9\.]+", line) for line in b]

python script to find the unique value

I have a script to find the unique value from 2 files
1.csv
11 12 13 14
21 22 23 24
11 32 33 34
2.csv
41 42 43 44 45
51 52 53 54 55
41 62 63 64 65
script is:
import csv
import sys
# Count all first-column numbers.
counts = {}
# Loop over all input files.
for a in sys.argv[1:]:
# Open the file for reading.
with open(a) as c:
# Read it as a CSV file.
reader = csv.reader(c, delimiter=' ')
for row in reader:
count = counts.get(row[0], 0)
# Increment the count by 1.
counts[row[0]] = count + 1
# Print only those numbers that have a count of 1.
print([i for i, c in counts.items() if c == 1])
Usage:
$ python 1.py 1.csv 2.csv
output is
['51', '21']
but i want the output in different row like
51
21
Use string.join to join the list items on a \n:
l = ['51', '21']
print("\n".join(l))
Edit:
In your code (which actually is from an answer I gave you yesterday), do this:
print("\n".join([i for i, c in counts.items() if c == 1]))
Replace the last line by the following:
for result, count in counts.items():
if count == 1:
print(result)
It's not the most concise way to do it but at least it's quite readable

Python combine rows from different files into one data file

I have distributed information over multiple large csv files.
I want to combine all the files into one new file such as the first row from the first file is combined to the first row from the other file etc.
file1.csv
A,B
A,C
A,D
file2.csv
F,G
H,I
J,K
expected result:
output.csv
A,B,F,G
A,C,H,I
A,D,J,K
so consider I have an array ['file1.csv', 'file2.csv', ...] How to go from here ?
I tried to load each file into the memory and combine by np.column_stack but my files are too large to fit in memory.
Not pretty code, but this should work.
I'm not using with(open'filename','r') as myfile for the inputs. It could get a bit messy with 50 files, so these are opened and closed explicitly.
It opens each file then places the handle in a list. The first handle is taken as the master file, then we iterate through it line-by-line, each time reading one line from all the other open files and joining them with ',' then output that to the output file.
Note that if the other files have more lines, they won't be included. If any have less lines, this will raise an exception. I'll leave it to you to deal with these situations gracefully.
Note also that you can use glob to create filelist if the names follow a logical pattern (thanks to N. Wouda, below)
filelist = ['book1.csv','book2.csv','book3.csv','book4.csv']
openfiles = []
for filename in filelist:
openfiles.append(open(filename,'rb'))
# Use first file in the list as the master
# All files must have same number of lines (or greater)
masterfile = openfiles.pop(0)
with (open('output.csv','w')) as outputfile:
for line in masterfile:
outputlist = [line.strip()]
for openfile in openfiles:
outputlist.append(openfile.readline().strip())
outputfile.write(str.join(',', outputlist)+'\n')
masterfile.close()
for openfile in openfiles:
openfile.close()
Input Files
a b c d e f
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
Output
a b c d e f a b c d e f a b c d e f a b c d e f
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12 7 8 9 10 11 12
13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18 13 14 15 16 17 18
Instead of completely reading the files into the memory you can iterate over them line by line.
from itertools import izip # like zip but gives us an iterator
with open('file1.csv') as f1, open('file2.csv') as f2, open('output.csv', 'w') as out:
for f1line, f2line in izip(f1, f2):
out.write('{},{}'.format(f1line.strip(), f2line))
Demo:
$ cat file1.csv
A,B
A,C
A,D
$ cat file2.csv
F,G
H,I
J,K
$ python2.7 merge.py
$ cat output.csv
A,B,F,G
A,C,H,I
A,D,J,K

Group and Count unique string values and lengths

I would like to print the count of unique string values, length of characters, and the respective string. Python is fine but am open to suggestion using other tools. If a specific output is required, tab separated or similar that can be readily parsed would work. This is a followup to Parsing URI parameter and keyword value pairs.
Example source:
date=2012-11-20
test=
y=5
page=http%3A//domain.com/page.html&unique=123456
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
test=
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
y=5
page=http%3A//support.domain.com/downloads/index.asp
page=http%3A//support.domain.com/downloads/index.asp
view=month
y=5
y=5
y=5
Example output:
5 3 y=5
3 78 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
2 52 page=http%3A//support.domain.com/downloads/index.asp
2 5 test=
1 15 date=2012-11-20
1 10 view=month
Here is an example where I was able to use a one-liner but assume it may be easier to come up with something in Python that can handle this and the length counting.
$ sort test | uniq -c | sort -nr
5 y=5
3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
2 test=
2 page=http%3A//support.domain.com/downloads/index.asp
1 view=month
1 page=http%3A//domain.com/page.html&unique=123456
1 date=2012-11-20
Yes you can easily do it with Python. Usually people would tend to use dictionary to keep a track of duplicates
>>> from collections import defaultdict
>>> group = defaultdict(list)
>>> with open("test.txt") as fin:
for line in fin:
group[len(line.rstrip())].append(line)
>>> for k, g in group.items():
print k, len(g), g[0].strip()
3 5 y=5
5 2 test=
10 1 view=month
78 3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
15 1 date=2012-11-20
48 1 page=http%3A//domain.com/page.html&unique=123456
52 2 page=http%3A//support.domain.com/downloads/index.asp
Instead if you would like to mimic what you shell command, a similar thing can be achieved using itertools.groupby, which behaves similar to uniq
>>> with open("test.txt") as fin:
file_it = (e.rstrip() for e in fin)
for k, g in groupby(sorted(file_it, key = len), len):
first_elem = next(g).strip()
print k, sum(1 for _ in g) + 1, first_elem
3 5 y=5
5 2 test=
10 1 view=month
15 1 date=2012-11-20
48 1 page=http%3A//domain.com/page.html&unique=123456
52 2 page=http%3A//support.domain.com/downloads/index.asp
78 3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname

Categories

Resources