Finding the most frequent items in a dataset

Finding the most frequent items in a dataset - python

I am working with a big dataset and thus I only want to use the items that are most frequent.
Simple example of a dataset:
1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20
4 has 4 occurrences,
1 has 2 occurrences,
2 has 2 occurrences,
5 has 2 occurrences,
I want to be able to generate a new dataset just with the most frequent items, in this case the 4 most common:
The wanted result:
1 2 3 4 5
1 2
3 4 5
4 5
4
I am finding the first 50 most common items, but I am failing to print them out in a correct way. (my output is resulting in the same dataset)
Here is my code:
from collections import Counter
with open('dataset.dat', 'r') as f:
lines = []
for line in f:
lines.append(line.split())
c = Counter(sum(lines, []))
p = c.most_common(50);
with open('dataset-mostcommon.txt', 'w') as output:
..............
Can someone please help me on how I can achieve it?

You have to iterate again the dataset and, for each line, show only those who are int the most common data set.
If the input lines are sorted, you may just do a set intersection and print those in sorted order. If it is not, iterate your line data and check each item
for line in dataset:
for element in line.split()
if element in most_common_elements:
print(element, end=' ')
print()
PS: For Python 2, add from __future__ import print_function on top of your script

According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow:
with open('dataset-mostcommon.txt', 'w') as output:
for item, occurence in p:
output.writelines("%d has %d occurrences,\n"%(item, occurence))

Related

How to make a grid of the size a rows x b columns from a list containing exactly a*b items? Python grid, list, matrix?

How do I make a 3x5 grid out of a list containing 15 items/strings?
I have a list containing 15 symbols but it could very well also just be a list such as mylist = list(range(15)), that I want to portray in a grid with 3 rows and columns. How does that work without importing another module?
I've been playing around with the for loop a bit to try and find a way but it's not very intuitive yet so I've been printing long lines of 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 etc I do apologize for this 'dumb' question but I'm an absolute beginner as you can tell and I don't know how to move forward with this simple problem
This is what I was expecting for an output, as I want to slowly work my way up to making a playing field or a tictactoe game but I want to understand portraying grids, lists etc as best as possible first
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15

A mxn Grid? There are multiple ways to do it. Print for every n elements.
mylist = list(range(15))
n = 5
chunks = (mylist[i:i+n] for i in range(0, len(mylist), n))
for chunk in chunks:
print(*chunk)
Gives 3x5
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
Method 2
If you want more cosmetic then you can try
Ref
pip install tabulate
Code
mylist = list(range(15))
wrap = [mylist[x:x+5] for x in range(0, len(mylist),5)]
from tabulate import tabulate
print(tabulate(wrap))
Gives #
-- -- -- -- --
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
-- -- -- -- --

Python - Create multiple lists and zip

I am looking to produce multiple lists based on the same function which randomises data based on a list. I want to be able to easily change how many of these new lists I want to have and then combine. The code which creates each list is the following:
"""
"""
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
This perturbs each value from the list based on a normal distribution.
To combine them is fine when I just want a handful of lists:
"""
"""
ensemble_form_1,ensemble_form_2,ensemble_form_3 = [],[],[]
ensemble_form_1 = normal_transform(R)
ensemble_form_2 = normal_transform(R)
ensemble_form_3 = normal_transform(R)
zipped_ensemble = list(zip(ensemble_form_1,ensemble_form_2,ensemble_form_3))
df_ensemble = pd.DataFrame(zipped_ensemble, columns = ['Ensemble_1', 'Ensemble_2','Ensemble_3'])
return ensemble_form_1, ensemble_form_2, ensemble_form_3
How could I repeat the same randomisation process to create a fixed number of lists (say 50 or 100), and then combine them into a table? Is there an easy way to do this with a for loop, or any other method? I'd need to be able to pick out each new list/column individually, as I would be combining the results in some way.
Any help would be greatly appreciated.

You can construct multiple lists and a table like this:
import pandas as pd
import numpy as np
# Your function for creating the individual lists
def normal_transform(R):
R_ensemble=[]
for i in range(0,len(R)):
if R[i]==0:
R_ensemble.append(0)
else:
R_ensemble.append(np.random.normal(loc=R[i],scale=R[i]/4,size=None))
return R_ensemble
# Construction of multiple lists and the dataframe
NUM_LISTS = 50
R = list(range(100))
data = dict()
for i in range(NUM_LISTS):
data['Ensemble_' + str(i)] = normal_transform(R)
df_ensemble = pd.DataFrame(data)
You can access the individual lists/ columns like this:
df_ensemble['Ensemble_42']
df_ensemble[df_ensemble.columns[42]]

You can use zip() with * to create dataframe with variable number of columns. For example:
import pandas as pd
def generate_list(n):
#... generate your list here
return [*range(n)]
def get_dataframe(n_columns, n):
return pd.DataFrame(zip(*[generate_list(n) for _ in range(n_columns)]), columns=['Ensemble_{}'.format(i) for i in range(1, n_columns+1)])
print(get_dataframe(8, 10))
Prints (8 columns, 10 rows):
Ensemble_1 Ensemble_2 Ensemble_3 Ensemble_4 Ensemble_5 Ensemble_6 Ensemble_7 Ensemble_8
0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9

reading a file that is detected as being one column

I have a file full of numbers in the form;
010101228522 0 31010 3 3 7 7 43 0 2 4 4 2 2 3 3 20.00 89165.30
01010222852313 3 0 0 7 31027 63 5 2 0 0 3 2 4 12 40.10 94170.20
0101032285242337232323 7 710153 9 22 9 9 9 3 3 4 80.52 88164.20
0101042285252313302330302323197 9 5 15 9 15 15 9 9 110.63 98168.80
01010522852617 7 7 3 7 31330 87 6 3 3 2 3 2 5 15 50.21110170.50
...
...
I am trying to read this file but I am not sure how to go about it, when I use the built in function open and loadtxt from numpy and i even tried converting to pandas but the file is read as one column, that is, its shape is (364 x 1) but I want it to separate the numbers to columns and the blank spaces to be replaced by zeros, any help would be appreciated. NOTE, some places there are two spaces following each other

If the columns content type is a string have you tried using str.split() This will turn the string into an array, then you have each number split up by each gap. You could then use a for loop for the amount of objects in the mentioned array to create a table out of it, not quite sure this has answered the question, sorry if not.
str.split():

So I finally solved my problem, I actually had to strip the lines and then read each "letter" from the line, in my case I am picking individual numbers from the stripped line and then appending them to an array. Here is the code for my solution;
arr = []
with open('Kp2001', 'r') as f:
for ii, line in enumerate(f):
arr.append([]) #Creates an n-d array
cnt = line.strip() #Strip the lines
for letter in cnt: #Get each 'letter' from the line, in my case it's the individual numbers
arr[ii].append(letter) #Append them individually so python does not read them as one string
df = pd.DataFrame(arr) #Then converting to DataFrame gives proper columns and actually keeps the spaces to their respectful columns
df2 = df.replace(' ', 0) #Replace the spaces with what you will

Why is set not calculating my unique integers?

I just started teaching myself Python last night via Python documentation, tutorials and SO questions.
So far I can ask a user for a file, open and read the file, remove all # and beginning \n in the file, read each line into an array, and count the number of integers per line.
I want to calculate the number of unique integers per line. I realized that Python uses a set capability which I thought would work perfectly for this calculation. However, I always receive the value of one greater than the prior value (I will show you). I looked at other SO posts related to sets and do not see what I am not missing and have been stumped for a while.
Here is the code:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
#calculate the number of integers per line
names_list.append(line)
#print "There are ", len(line.split()), " numbers on this line"
#print names_list
#calculate the number of unique integers
myset = set(names_list)
print myset
myset_count = len(myset)
print "unique:",myset_count
For further explanation:
names_list is:
['1 2 3 4 5 6 5 4 5\n', '14 62 48 14\n', '1 3 5 7 9\n', '123 456 789 1234 5678\n', '34 34 34 34 34\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n']
and my_set is:
set(['1 2 3 4 5 6 5 4 5\n', '1 3 5 7 9\n', '34 34 34 34 34\n', '14 62 48 14\n', '1\n', '1 2 2 2 2 2 3 3 4 4 4 4 5 5 6 7 7 7 1 1\n', '123 456 789 1234 5678\n'])
The output I receive is:
unique: 1
unique: 2
unique: 3
unique: 4
unique: 5
unique: 6
unique: 7
The output that should occur is:
unique: 6
unique: 3
unique: 5
unique: 5
unique: 1
unique: 1
unique: 7
Any suggestions as to why my set per line is not calculating the correct number of unique integers per line? I would also like any suggestions on how to improve my code in general (if you would like) because I just started learning Python by myself last night and would love tips. Thank you.

The problem is that as you are iterating over your file you are appending each line to the list names_list. After that, you build a set out of these lines. Your text file does not seem to have any duplicate lines, so printing the length of your set just displays the current number of lines you have processed.
Here's a commented fix:
with open(filename, 'r') as file:
for line in file:
if line.strip() and not line.startswith("#"):
numbers = line.split() # splits the string by whitespace and gives you a list
unique_numbers = set(numbers) # builds a set of the strings in numbers
print(len(unique_numbers)) # prints number of items in the set
Note that we are using the currently processed line and build a set from it (after splitting the line). Your original code stores all lines and then builds a set from the lines in each loop.

myset = set(names_list)
should be
myset = set(line.split())

How to iterate through an intersection properly

I am trying to iterate through a series of intersections, where each iteration is the intersection of a new set of rows. I have code that looks somewhat like the following:
for liness in range(len(NNCatelogue)):
for iii in [iii for iii, y in enumerate(NNCatelogue[iii]) if y in set(NNCatelogue[liness]).intersection(catid)]:
print iii, y
NNCatelogue is essentially a 1268 X 12 matrix, and each new iteration of liness calls a new row. If I simply put in the row number that I want (ie: 0, 1, 2...) then I get the expected output (without the for loop in front). The code that is written above gives the following output:
10 C-18-1064
4 C-18-1122
4 C-18-1122
5 C-18-1122
5 C-18-1122
7 C-18-1122
8 C-18-1122
9 C-18-1122
10 C-18-1122
11 C-18-1122
6 C-18-1122
...
The expected output should be:
0 C-18-1
1 C-18-259
2 C-18-303
3 C-18-304
4 C-18-309
5 C-18-324
6 C-18-335
7 C-18-351
8 C-18-372
9 C-18-373
10 C-18-518
11 C-18-8
Any idea where I might be going wrong? Any help is greatly appreciated!
UPDATE:
I tried a variation of one of the answers, and while it is closer to what I am expecting, it isn't quite there. Here is what I tried:
counter = 0
for row in NNCatelogue:
for value in row:
if value in set(NNCatelogue[counter]).intersection(catid):
print counter, value
counter += 1
The resultant output is:
0 C-18-1
1 C-18-324
2 C-18-351
3 C-18-4
4 C-18-5
5 C-18-6
6 C-18-7
7 C-18-8
8 C-18-9
9 C-18-10
10 C-18-11
11 C-18-12
12 C-18-13
...
So some of the intersections are correct, though it isn't my desired output... Any ideas from here?

You use iii too often. I cannot even imagine what's exactly going on if you execute this code. Just give your variables useful speaking names and your problem is probably solved.

As I understand what you need:
counter = 0
for row in NNCatelogue:
for value in row:
if value in catid:
print counter, value
counter += 1

It appears that the intersection offers a non-numeric sort... Do you get the proper set (just the wrong permutation)?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the most frequent items in a dataset - python

According to the documentation, c.most-common returns a list of tuples, you can get the desired output as follow: with open('dataset-mostcommon.txt', 'w') as output: for item, occurence in p: output.writelines("%d has %d occurrences,\n"%(item, occurence))

Related

How to make a grid of the size a rows x b columns from a list containing exactly a*b items? Python grid, list, matrix?

Python - Create multiple lists and zip

reading a file that is detected as being one column

Why is set not calculating my unique integers?

How to iterate through an intersection properly

Categories

Resources