Group and Count unique string values and lengths - python

I would like to print the count of unique string values, length of characters, and the respective string. Python is fine but am open to suggestion using other tools. If a specific output is required, tab separated or similar that can be readily parsed would work. This is a followup to Parsing URI parameter and keyword value pairs.
Example source:
date=2012-11-20
test=
y=5
page=http%3A//domain.com/page.html&unique=123456
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
test=
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
y=5
page=http%3A//support.domain.com/downloads/index.asp
page=http%3A//support.domain.com/downloads/index.asp
view=month
y=5
y=5
y=5
Example output:
5 3 y=5
3 78 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
2 52 page=http%3A//support.domain.com/downloads/index.asp
2 5 test=
1 15 date=2012-11-20
1 10 view=month
Here is an example where I was able to use a one-liner but assume it may be easier to come up with something in Python that can handle this and the length counting.
$ sort test | uniq -c | sort -nr
5 y=5
3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
2 test=
2 page=http%3A//support.domain.com/downloads/index.asp
1 view=month
1 page=http%3A//domain.com/page.html&unique=123456
1 date=2012-11-20

Yes you can easily do it with Python. Usually people would tend to use dictionary to keep a track of duplicates
>>> from collections import defaultdict
>>> group = defaultdict(list)
>>> with open("test.txt") as fin:
for line in fin:
group[len(line.rstrip())].append(line)
>>> for k, g in group.items():
print k, len(g), g[0].strip()
3 5 y=5
5 2 test=
10 1 view=month
78 3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname
15 1 date=2012-11-20
48 1 page=http%3A//domain.com/page.html&unique=123456
52 2 page=http%3A//support.domain.com/downloads/index.asp
Instead if you would like to mimic what you shell command, a similar thing can be achieved using itertools.groupby, which behaves similar to uniq
>>> with open("test.txt") as fin:
file_it = (e.rstrip() for e in fin)
for k, g in groupby(sorted(file_it, key = len), len):
first_elem = next(g).strip()
print k, sum(1 for _ in g) + 1, first_elem
3 5 y=5
5 2 test=
10 1 view=month
15 1 date=2012-11-20
48 1 page=http%3A//domain.com/page.html&unique=123456
52 2 page=http%3A//support.domain.com/downloads/index.asp
78 3 refer=http%3A//domain2.net/results.aspx%3Fq%3Dbob+test+1.21+some%26file%3Dname

Related

How to sort a dictionary alphabetically?

def wordCount(inPath):
inFile = open(inPath, 'r')
lineList = inFile.readlines()
counter = {}
for line in range(len(lineList)):
currentLine = lineList[line].rstrip("\n")
for letter in range(len(currentLine)):
if currentLine[letter] in counter:
counter[currentLine[letter]] += 1
else:
counter[currentLine[letter]] = 1
sorted(counter.keys(), key=lambda counter: counter[0])
for letter in counter:
print('{:3}{}'.format(letter, counter[letter]))
inPath = "file.txt"
wordCount(inPath)
This is the output:
a 1
k 1
u 1
l 2
12
h 5
T 1
r 4
c 2
d 1
s 5
i 6
o 3
f 2
H 1
A 1
e 10
n 5
x 1
t 5
This is the output I want:
12
A 1
H 1
T 1
a 1
c 2
d 1
e 10
f 2
h 5
i 6
k 1
l 2
n 5
o 3
r 4
s 5
t 5
u 1
x 1
How do I sort the "counter" alphabetically?
I've tried simply sorting by keys and values but it doesn't return it alphabetically starting with capitals first
Thank you for your help!
sorted(counter.keys(), key=lambda counter: counter[0])
alone does nothing: it returns a result which isn't used at all (unless you recall it using _ but that's rather a command-line practice)
As opposed to what you can do with a list with .sort() method, you cannot sort dictionary keys "in-place". But what you can do is iterating on the sorted version of the keys:
for letter in sorted(counter.keys()):
, key=lambda counter: counter[0] is useless here: you only have letters in your keys.
Aside: your whole code could be simplified a great deal using collections.Counter to count the letters.
import collections
c = collections.Counter("This is a Sentence")
for k,v in sorted(c.items()):
print("{} {}".format(k,v))
result (including space char):
3
S 1
T 1
a 1
c 1
e 3
h 1
i 2
n 2
s 2
t 1

Python script for copying a column

This seems like a simple question, but I can't find an answer.
Input:
a 3 4
b 1 4
c 8 3
d 3 8
Wanted output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
Note: the file .txt input has many rows in the first column.
You didn't ask for it, but would you want awk? You could do:
awk '{$1=$1 OFS $1}1' Input
or the more obvious but less flexible:
awk '{print $1 $1 $2 $3}' Input
Assuming you've read your results in an array, you want:
values = ["a",1,2,3]
values.insert(0,values[0])
This inserts the value of index 0 (in this case "a") at position 0, moving all the other contents of values to the right.
This will also work on strings, so if your results are read as a string you can do the following - please note that I am including the spaces after each digit and am doing it a bit differently:
values="a 1 2 3"
values = values[:2] + values
In this example we take the first two array members (values[:2] or values[0:2]) and adding the existing array values to the end.
Hope this helps!
Try this:
fin = open("text.txt")
content = fin.readlines()
fin.close()
for elem in content:
print(elem[0],elem[0]+elem[1:-1])
Output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
with open("sample.csv") as inputs:
for line in inputs:
trimed_line = line.strip()
parts = trimed_line.split()
print("{0} {1}".format(parts[0], trimed_line))
output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8

Finding the most frequent items in a dataset

I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?
You have to iterate again the dataset and, for each line, show only those who are int the most common data set.
If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item
for line in dataset:
for element in line.split()
if element in most_common_elements:
print(element, end=' ')
print()
PS: For Python 2, add from __future__ import print_function on top of your script
According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:
with open('dataset-mostcommon.txt', 'w') as output:
for item, occurence in p:
output.writelines("%d has %d occurrences,\n"%(item, occurence))

Finding the same numbers in an input and summing the paired numbers

I want to find number pairs in an input, then sum up those pairs, but leave out unpaired numbers. By that I mean
8 8 8 = 16
8 8 8 8 8 = 32
so numbers with a pair of two will get counted but a number that doesn't have a pair won't get counted. Sorry if I worded this weird I don't know how to explain it, but the example will help.
For example:
8 3 4 4 5 9 9 5 2
Would output:
36
4+4+5+5+9+9 = 36
In Python.
Use collections.Counter
>>> import collections
>>> s = "8 3 4 4 5 9 9 5 2"
>>> l = s.split()
>>> sum([int(item)*count for item, count in collections.Counter(l).items() if count > 1])
36
or
>>> s = "8 3 4 4 5 9 9 5 2"
>>> l = s.split()
>>> sum([int(item)*count for item, count in collections.Counter(l).items() if count%2 == 0])
36
As a correction of the answer #avinash-raj gave:
import collections
s = "8 3 4 4 5 9 9 5 2"
l = s.split()
print(sum([int(item) * 2 * (count // 2) for item, count in collections.Counter(l).items()]))
As explanation:
we vacuum up all the numbers in to a Counter, which will tell us the number of times a key has been seen
the expression (count // 2) is integer division, and gives us the number of complete pairs. Thus if we've seen a key 9 times, (count // 2) -> 9 / 2 -> 4.

Converting coordinate representation to adjecency list representation

What is the most efficient way to convert this file:
10 3
10 5
12 6
12 19
19 12
19 14
19 10
to this:
10 3 5
12 6 19
19 12 14 10
First column of the input file is numerically sorted in increasing order.
Any solutions using Python, AWK, etc. are welcome.
from itertools import groupby
lines, op_file = [line.split() for line in open("In.txt")], open("Out.txt", "w")
for key, grp in groupby(lines, key = lambda x: x[0]):
print >> op_file, "{} {}".format(key, " ".join([i[1] for i in grp]))
op_file.close()
Output
10 3 5
12 6 19
19 12 14 10
Since you mentioned awk:
$ awk '{a[$1]=a[$1]" "$2}END{for (i in a){print i a[i]}}' input
19 12 14 10
10 3 5
12 6 19
pipe it to sort to have it, well, sorted:
$ awk '...' input | sort
10 3 5
12 6 19
19 12 14 10
In Python 2:
import itertools, operator
with open(infilename) as infile:
input = (line.split() for line in infile)
output = itertools.groupby(input, operator.itemgetter(0))
with open(outfilename, 'w') as outfile:
for key, line in output:
print >>outfile, key, ' '.join(val[1] for val in line)
This assumes that the input and output files are different: you could just write the output to standard out and leave it as the user's problem to save it.
Try out this code
fp = open('/tmp/test.txt')
list_dict = {}
for line in fp.readlines():
split_values = line.split()
if split_values[0] in list_dict:
list_dict[split_values[0]].extend(split_values[1:])
else:
list_dict[split_values[0]] = split_values
for val in list_dict.values():
print " ".join(val)

Categories

Resources