Python: Count occurrences of each number in a python data-frame - python

I have a dataset for itemset mining. I want to find occurences of each unique number i.e. Candidate 1 itemsets.
The shape of the data is 3000x1. I'm unable to figure out how to count the unique occurences.
List of distict values of the data are stored.
Using the ndarray distinct, how can I find the frequency of each item in the dataset?
Update
Got the solution with #jojo help.
df = pd.read_csv('sample.csv', sep=',')
all_values = dataset.values.ravel()
notNan = np.logical_not(np.isnan(all_values))
distinct, counts = np.unique(all_values[notNan], return_counts=True)

First note that if you have a normal csv (comma separated) you should use sep=','. This is because '\t' is assuming TAB as delimiter.
Also, consider adding header=None in your read_csv call, as otherwise the first line will be taken as column names in your data-frame.
Lastly, since the column appear to have different lengths, you will have nan values in all columns that are shorter than the longest one, to remove them you can mask all nan values when getting unique values. Something like values[np.logical_not(np.isnan(values))], but see below.
Putting things together:
dataset = pd.read_csv('dataset.csv', sep=',', header=None)
all_values = dataset.values.ravel()
You can directly use unique from numpy which allows to get the counts of each unique value:
import numpy as np
notNan = np.logical_not(np.isnan(all_values))
distinct, counts = np.unique(all_values[notNan], return_counts=True)
If you care for the frequency, simply divide counts by all_values[notNan].size.
Here is a simple example (from the docs linked above) to highlight how np.unique works:
>>> import numpy as np
>>> a = np.array([1, 2, 6, 4, 2, 3, 2])
>>> values, counts = np.unique(a, return_counts=True)
>>> values # list of all unique values in a
array([1, 2, 3, 4, 6])
>>> counts # count of the occurrences of each value in values
array([1, 3, 1, 1, 1])

Related

How to create a new column in pandas dataframe based on a condition?

I have a data frame with the following columns:
d = {'find_no': [1, 2, 3], 'zip_code': [32351, 19207, 8723]}
df = pd.DataFrame(data=d)
When there are 5 digits in the zip_code column, I want to return True. When there are not 5 digits, I want to return the "find_no".
Sample output would have the results in an added column to the dataframe, corresponding to the row it's referencing.
You could try np.where:
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, True, df['find_no'])
Only downside with this approach is that NumPy will convert your True values to 1's, which could be confusing. An approach to keep the values you want is to do
import numpy as np
df['result'] = np.where(df['zip_code'].astype(str).str.len() == 5, 'True', df['find_no'].astype(str))
The downside here being that you lose the meaning of those values by casting them to strings. I guess it all depends on what you're hoping to accomplish.

Select values from different Dataframe column based on a list of index

Here is my issue, I have a dataframe, let's say:
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
I also have a list of index:
idx = [1, 2]
I would like to store in a list, the corresponding value in each column.
Meaning I want the first value of the col1 and the second value of the col2.
I'm sure there is a simple answer to my issue however I'm mixing everything up with iloc and cannot find a way of developing a optimized method in my case (I have 1000 rows and 4 columns).
IIUC, you can try:
you can extract the complete rows and then pick the diagonal elements
result = np.diag(df.values[idx])
Alternative:
convert the dataframe to numpy array.
use numpy indexing to access the required values.
result = df.values[idx, range(len(df.columns))]
OUTPUT:
array([6, 3])
Use:
list(df.values[idx, range(len(idx))])
Output:
[6, 3]
Here is a different way:
df.stack().loc[list(zip(idx,df.columns[:(len(idx))]))].to_numpy()

Python: Extract unique index values and use them in a loop

I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6

How can I choose rows and columns if the index/header contains certain integer in Pandas dataframe?

I have an input/output data where index and header have numbers that represents different types of industries. I want to create new columns and rows that would represent the sum of columns and rows that belong to certain industry group. To give an example(please refer to the example that I manually made as below), I would want to create new row/column that would have index/header as US_industry_135/CAN_industry_135 which would sum the rows/columns that has industry number 1, 3, or 5. The below example is a small set that I manually created, but I wanted to know if there is a way to put the condition in summation so that I sum rows/columns whose index/header has numbers that belong to specific numbers. I could extract the numbers from header/index and create make a separate row/column, but I was wondering if there is a way to directly check from the index/headers without creating new columns. Thank you in advance for your help!
import pandas as pd
data = {'US1':[3, 2, 1, 4,3,2,1,4,2,3,7,9],'US2':[8,4,9,2,1,3,4,2,5,6,18,11],'US3':[2,4,2,2,3,2,4,2,3,2,7,6],
'US4':[7,4,8,2,2,3,2,4,6,8,17,15],'US5':[2,4,3,2,2,4,1,3,2,4,7,11],
'CAN1':[3, 2, 1, 4,6,2,3,1,4,2,10,5],'CAN2':[8,4,9,2,5,7,3,5,7,1,22,13],'CAN3':[2,4,2,2,4,5,2,3,3,2,8,10],
'CAN4':[7,4,8,2,2,3,1,3,2,4,17,10],'CAN5':[2,4,3,2,6,7,5,4,0,9,11,20],
'US_IND_135':[7,10,6,8,8,8,6,9,7,9,21,26],'CAN_IND_135':[7,10,6,8,16,14,10,8,7,13,29,35]}
df = pd.DataFrame(data, index=['US1','US2','US3','US4','US5','CAN1','CAN2','CAN3','CAN4','CAN5','US_IND_135','CAN_IND_135'])
df
Let's define list of indexes of interest:
idx = [1, 3, 5]
Do the summation using specified columns:
df[['US' + str(i) for i in idx]].sum(axis = 1)
Alternatively, if you want to join summation column to dataframe, you can assign result to the variable:
s1 = df[['US' + str(i) for i in idx]].sum(axis = 1)
s1.name = 'NEW_US_IND_' + ''.join("{0}".format(i) for i in idx)
And add new column:
df.join(s1)

How to iterate over two dataframe columns and add values from a list based on the values in those two columns

I have a dataframe with three columns like this:
Subject{1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, ...} datetime{6/4/16 3:04:30, 6/5/16 6:02:15, ...} markers{}
It is sorted by subject then by datetime, and the markers column is empty
I also have a dictionary which maps subject numbers to lists of datetimes. These datetimes are not exactly the same as the ones already in the dataframe. I want to add all these datetimes to the markers column in their corresponding subject and date row for comparison purposes, so a dictionary with the key (subject) 1 with a list of values like {6/4/16 5:00:15, 6/5/16 6:10:30} would have its first value added to row 1 because the subject and date match and its second value added to row 2 for the same reason.
I thought of looping through each dictionary key and all it's corresponding datetimes, but then finding the corresponding row in the map for each datetime within the nested loops would be very inefficient. It would be something like this:
for subject in df.iloc[:, 0]:
# go to subject in dictionary and loop through datetimes in corresponding list,
# adding the matching datetime to the current row
# O(n^2) time!
Is there a more efficient way to do this?
Thanks!
try this, you will have to customize the answer somewhat to meet your specific needs, but the logic is basically the same.
df = pd.DataFrame({'colA': [100,200],'colB': ['NaN','NaN']})
dict1 = {100: ['rat','cat','hat'], 200: ['hen','men','den']}
df = pd.concat([df['colA'],df['colA'].map(dict1).apply(pd.Series)], axis = 1)

Categories

Resources