Count of values of all categorical variable using Python - python

I have a dataset with a large number of columns, how do I calculate the frequency of values of all categorical variables in Python? I don't want frequency for one or two specific columns rather I need the frequency of all variables type="category".

Use selected_dtypes() for selecting the columns with type=category, and use sum() method to calculate the frequencies:
df.select_dtypes(include='category').sum()
output:
col_cat1 9
col_cat2 21

Not entirely sure I know what you mean, but if you just want to keep a running count of frequencies, dictionaries are a great way to do this.
E.g. if we use the dummy data ['A', 'A', 'B', 'A', 'C', 'C']
category_counts = {}
for category in categories:
try:
category_counts[category] += 1
except:
category_counts[category] = 1
print(category_counts)
returns:
{'A': 3, 'B': 1, 'C': 2}
EDIT: so if you want a count of the categories of each column the code only changes slightly to:
table = [['Male/Female','M','M','F','M',"F"],['Age','10-20','30-40','10-20','20-30','10-20']]
category_counts = {}
for column in table:
category_counts[column[0]] = {}
for data in column[1:]:
try:
category_counts[column[0]][data] += 1
except:
category_counts[column[0]][data] = 1
print(category_counts)
Which prints:
{'Male/Female': {'M': 3, 'F': 2}, 'Age': {'10-20': 3, '30-40': 1, '20-30': 1}}
But I'm unsure how you're currently storing your data

Related

Best Way to Count Occurences of Each Character in a Large Dataset

I am trying to count the number of occurrences of each character within a large dateset. For example, if the data was the numpy array ['A', 'AB', 'ABC'] then I would want {'A': 3, 'B': 2, 'C': 1} as the output. I currently have an implementation that looks like this:
char_count = {}
for c in string.printable:
char_count[c] = np.char.count(data, c).sum()
The issue I am having is that this takes too long for my data. I have ~14,000,000 different strings that I would like to count and this implementation is not efficient for that amount of data. Any help is appreciated!
Another way.
import collections
c = collections.Counter()
for thing in data:
c.update(thing)
Same basic advantage - only iterates the data once.
One approach:
import numpy as np
from collections import defaultdict
data = np.array(['A', 'AB', 'ABC'])
counts = defaultdict(int)
for e in data:
for c in e:
counts[c] += 1
print(counts)
Output
defaultdict(<class 'int'>, {'A': 3, 'B': 2, 'C': 1})
Note that your code iterates len(string.printable) times over data in contrast my proposal iterates one time.
One alternative using a dictionary:
data = np.array(['A', 'AB', 'ABC'])
counts = dict()
for e in data:
for c in e:
counts[c] = counts.get(c, 0) + 1
print(counts)

Can I join two data frames using one column in df1 and one of any values in a cell in df2?

I'm working with some geospatial data, df_geo and am have a CSV of values I'd like to join to the location data frame, called df_data.
My issue, however, is that there are multiple ways to spell the values in the column I'd like to join the two data frames on (region names). Look at the Catalonia example below, in df_geo: there are 6 different ways to spell the region name, depending on the language.
My question is this: if the row is named "Catalonia" in df_data, how would I go about joining df_data to df_geo?
Since the rows are unique to a region, you can create a dictionary that maps any name in 'VARNAME_1' to the index from df_geo.
Then use this to map the the names in df_data to a dummy column and you can do a simple merge on the index in df_geo and the mapped column in df_data.
To get the dictionary do:
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
Sample Data:
import pandas as pd
df_geo = pd.DataFrame({'VARNAME_1': ['Catalogna\Catalogne\Catalonia', 'A\B\C\D\E\F\G']})
df_data = pd.DataFrame({'Name': ['Catalogna', 'Seven', 'E'],
'Vals': [1,2,3]})
Code
d = dict((y,ids) for ids, val in df_geo.VARNAME_1.str.split(r'\\').items()
for y in val)
#{'A': 1,
# 'B': 1,
# 'C': 1,
# 'Catalogna': 0,
# 'Catalogne': 0,
# 'Catalonia': 0,
# 'D': 1,
# 'E': 1,
# 'F': 1,
# 'G': 1}
df_data['ID'] = df_data.Name.map(d)
df_data.merge(df_geo, left_on='ID', right_index=True, how='left').drop(columns='ID')
Output:
Name Vals VARNAME_1
0 Catalogna 1 Catalogna\Catalogne\Catalonia
1 Seven 2 NaN
2 E 3 A\B\C\D\E\F\G
How the dictionary works.
df_geo.VARNAME_1.str.split(r'\\').values splits the string in VARNAME_1 on the '\' character and places all the separated values in a Series of lists. Using .items on the Series gives you a tuple (which we unpacked into two separate values), with the first value being the index, which is the same as the index of the original DataFrame, and the second item being the
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
print(f'id:{ids} and val:{val}')
#id:0 and val:['Catalogna', 'Catalogne', 'Catalonia']
#id:1 and val:['A', 'B', 'C', 'D', 'E', 'F', 'G']
So now val is a list, which we again want to iterate over to create out dictionary.
for ids, val in df_geo.VARNAME_1.str.split(r'\\').items():
for y in val:
print(f'id:{ids} and y:{y}')
#id:0 and y:Catalogna
#id:0 and y:Catalogne
#id:0 and y:Catalonia
#id:1 and y:A
#id:1 and y:B
#id:1 and y:C
#id:1 and y:D
#id:1 and y:E
#id:1 and y:F
#id:1 and y:G
And so the dictionary I created was with y as the key, and the original DataFrame index ids as the value.

Extend dictionary values by same key for SVM training data

Hi I'm quite new to Python and Machine learning, I want to extract SVM's x and y from two dictionaries.
the two dictionaries look like:
DIC_01
{'A': ['Low'],
'B': ['High']}
DIC_02
{'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625]}
About the data: The keys of the dictionaries are my 'keywords'. DIC_01 was converted from a dataframe, its values are keyword's probability of sales. DIC_02 is the vectors to represent the keyword.
I want to organise this dictionary to SVM training data format. x is the value of DIC_02, y is the value of DIC_01.
I don't know what's the most efficient way to do this task. At the moment I'm thinking...
step 1: merge values with the same keys
{'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],['Low'],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625],['High']}
step 2: extract the first and second value as SVM's x and y then train the model.
Thank you!
Hi is that what you want to do?
DIC_01 = {'A': ['Low'],
'B': ['High']}
DIC_02 = {'A': [2623.83740234375,
-1392.9608154296875,
416.20831298828125],
'B': [1231.1268310546875,
-963.231201171875,
1823.742431640625]}
smv_X = []
smv_Y = []
for e in DIC_01:
smv_X.append(DIC_02[e])
smv_Y.append(DIC_01[e][0])
print(smv_X) # [[2623.83740234375, -1392.9608154296875, 416.20831298828125], [1231.1268310546875, -963.231201171875, 1823.742431640625]]
print(smv_Y) # ['Low', 'High']
for k,v in DIC_01.items():
# k = key
# v = value

Optimization of data input

I have an array whose elements I would like to increment when a new user votes.
For example if there are 10 options, from 1 to 10 and one user voted i=3 it would be easy to write:
A[i] = A[i] + 1;
In the case where the options are from 'A' to 'I', how can I do this? Because I can't use the letter to point to a specific array element.
For a few thousand users, I don't want to do an embedded loop where I search all elements of the array to see to which one the choice 'i' corresponds to each time.
Can I do this in O(n) time?
Is as simple as using dictionaries, a dictionary is like an array where instead of having index you have a key and for each key there is a value. For more information visit https://www.tutorialspoint.com/python/python_dictionary.htm
So in your example just define a dictionary like this:
data = {'A': 0, 'B': 0, 'C': 0 .....}
then find the letter that you want to upvote and increase it:
data['A'] += 1
print(data['A'])
>>> {'A': 1, 'B': 0, 'C': 0}
Even you can have dictionaries inside dictionaries, example:
data = {'A': {'Votes':0, 'Description': ''}, 'B': {'Votes':0, 'Description': ''}, 'C': {'Votes':0, 'Description': ''} .....}
data['A']['Votes'] += 1

Link two dictionaries lists and calculate average values

I am not used to code with Python, but I have to do this one with it. What I am trying to do, is something that would reproduce the result of SQL statment like this :
SELECT T2.item, AVG(T1.Value) AS MEAN FROM TABLE_DATA T1 INNER JOIN TABLE_ITEMS T2 ON T1.ptid = T2.ptid GROUP BY T2.item.
In Python, I have two lists of dictionnaries, with the common key 'ptid'. My dctData contains around 100 000 pdit and around 7000 for the dctItems. Using a comparator like [i for i in dctData for j in dctItems if i['ptid'] == j['ptid']] is endless:
ptid = 1
for line in lines[6:]: # Skipping header
data = line.split()
for d in data:
dctData.append({'ptid' : ptid, 'Value': float(d)})
ptid += 1
dctData = [{'ptid':1,'Value': 0}, {'ptid':2,'Value': 2}, {'ptid':3,'Value': 2}, {'ptid':4,'Value': 5}, {'ptid':5,'Value': 3}, {'ptid':6,'Value': 2}]
for line in lines[1:]: # Skipping header
data = line.split(';')
dctItems.append({'ptid' : int(data[1]), 'item' : data[3]})
dctItems = [{'item':21, 'ptid':1}, {'item':21, 'ptid':2}, {'item':21, 'ptid':6}, {'item':22, 'ptid':2}, {'item':22, 'ptid':5}, {'item':23, 'ptid':4}]
Now, what I would like to get for result, is a third list that would present the average values according to each item in dctItems dictionnary, while the link between the two dictionnaries would be based on the 'pdit' value.
Where for example with the item 21, it would calculate the mean value of 1.3 by getting the values (0, 2, 2) of the ptid 1, 2 and 6:
And finally, the result would look something like this, where the key Value represents the mean calculated :
dctResults = [{'id':21, 'Value':1.3}, {'id':22, 'Value':2.5}, {'id':23, 'Value':5}]
How can I achieve this?
Thanks you all for your help.
Given those data structures that you use, this is not trivial, but it will become much easier if you use a single dictionary mapping items to their values, instead.
First, let's try to re-structure your data in that way:
values = {entry['ptid']: entry['Value'] for entry in dctData}
items = {}
for item in dctItems:
items.setdefault(item['item'], []).append(values[item['ptid']])
Now, items has the form {21: [0, 2, 2], 22: [2, 3], 23: [5]}. Of course, it would be even better if you could create the dictionary in this form in the first place.
Now, we can pretty easily calculate the average for all those lists of values:
avg = lambda lst: float(sum(lst))/len(lst)
result = {item: avg(values) for item, values in items.items()}
This way, result is {21: 1.3333333333333333, 22: 2.5, 23: 5.0}
Or if you prefer your "list of dictionaries" style:
dctResult = [{'id': item, 'Value': avg(values)} for item, values in items.items()]

Categories

Resources