Applying the counter from collection to a column in a dataframe - python

I have a column of strings, where each row is a list of strings. I want to count the elements of the column in its entirety and not just the rows which one gets with the value.counts() in pandas.
I want to apply the Counter() from the Collections module, but that runs only on a list. My column in the DataFrame looks like this:
[['FollowFriday', 'Awesome'],
['Covid_19', 'corona', 'Notagain'],
['Awesome'],
['FollowFriday', 'Awesome'],
[],
['corona', Notagain],
....]
I want to get the counts, such as
[('FollowFriday', 2),
('Awesome', 3),
('Corona', 2),
('Covid19'),
('Notagain', 2),
.....]
The basic command that I am using is:
from collection import Counter
Counter(df['column'])
OR
from collections import Counter
Counter(" ".join(df['column']).split()).most_common()
Any help would be greatly appreciated!

IIUC, your comparison to pandas was only to explain your goal and you want to work with lists?
You can use:
l = [['FollowFriday', 'Awesome'],
['Covid_19', 'corona', 'Notagain'],
['Awesome'],
['FollowFriday', 'Awesome'],
[],
['corona', 'Notagain'],
]
from collections import Counter
from itertools import chain
out = Counter(chain.from_iterable(l))
or if you have a Series of lists, use explode:
out = Counter(df['column'].explode())
# OR
out = df['column'].explode().value_counts()
output:
Counter({'FollowFriday': 2,
'Awesome': 3,
'Covid_19': 1,
'corona': 2,
'Notagain': 2})

Related

Python converting nested dictionary with list containing float elements as individual elements

I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]

Pandas get count of value stored in an array in a column

I have a pandas data frame where one of the columns is an array of keywords, one row in the data frame would look like
id, jobtitle, company, url, keywords
1, Software Engineer, Facebook, http://xx.xx, [javascript, java, python]
However the number of possible keywords can range from 1 to 40
But I would like to do some data analysis like,
what keyword appears most often across the whole dataset
what keywords appear most often for each job title/company
Apart from giving each keyword its own column and dealing with lots of NAN values is there an easy way to answer these questions with python, (permeably pandas as its a dataframe)
You can do something like this :
import pandas as pd
keyword_dict = {}
def count_keywords(keyword):
for item in keyword:
if item in keyword_dict:
keyword_dict[item] += 1
else:
keyword_dict[item] =1
def new_function():
data = {'keywords':
[['hello', 'test'], ['test', 'other'], ['test', 'hello']]
}
df = pd.DataFrame(data)
df.keywords.map(count_keywords)
print(keyword_dict)
if __name__ == '__main__':
new_function()
output
{'hello': 2, 'test': 3, 'other': 1}

Create a nested dict containing list from a file

For example, for the txt file of
Math, Calculus, 5
Math, Vector, 3
Language, English, 4
Language, Spanish, 4
into the dictionary of:
data={'Math':{'name':[Calculus, Vector], 'score':[5,3]}, 'Language':{'name':[English, Spanish], 'score':[4,4]}}
I am having trouble with appending value to create list inside the smaller dict. I'm very new to this and I would not understand importing command. Thank you so much for all your help!
For each line, find the 3 values, then add them to a dict structure
from pathlib import Path
result = {}
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
if subject_type not in result:
result[subject_type] = {'name': [], 'score': []}
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
You can simplify it with the use of a defaultdict that creates the mapping if the key isn't already present
result = defaultdict(lambda: {'name': [], 'score': []}) # from collections import defaultdict
for row in Path("test.txt").read_text().splitlines():
subject_type, subject, score = row.split(", ")
result[subject_type]['name'].append(subject)
result[subject_type]['score'].append(int(score))
With pandas.DataFrame you can directly the formatted data and output the format you want
import pandas as pd
df = pd.read_csv("test.txt", sep=", ", engine="python", names=['key', 'name', 'score'])
df = df.groupby('key').agg(list)
result = df.to_dict(orient='index')
From your data:
data={'Math':{'name':['Calculus', 'Vector'], 'score':[5,3]},
'Language':{'name':['English', 'Spanish'], 'score':[4,4]}}
If you want to append to the list inside your dictionary, you can do:
data['Math']['name'].append('Algebra')
data['Math']['score'].append(4)
If you want to add a new dictionary, you can do:
data['Science'] = {'name':['Chemisty', 'Biology'], 'score':[2,3]}
I am not sure if that is what you wanted but I hope it helps!

What is the fastest way to dedupe multivariate data?

Let's assume a very simple data structure. In the below example, IDs are unique. "date" and "id" are strings, and "amount" is an integer.
data = [[date1, id1, amount1], [date2, id2, amount2], etc.]
If date1 == date2 and id1 == id2, I'd like to merge the two entries into one and basically add up amount1 and amount2 so that data becomes:
data = [[date1, id1, amount1 + amount2], etc.]
There are many duplicates.
As data is very big (over 100,000 entries), I'd like to do this as efficiently as possible. What I did is a created a new "common" field that is basically date + id combined into one string with metadata allowing me to split it later (date + id + "_" + str(len(date)).
In terms of complexity, I have four loops:
Parse and load data from external source (it doesn't come in lists) | O(n)
Loop over data and create and store "common" string (date + id + metadata) - I call this "prepared data" where "common" is my encoded field | O(n)
Use the Counter() object to dedupe "prepared data" | O(n)
Decode "common" | O(n)
I don't care about memory here, I only care about speed. I could make a nested loop and avoid steps 2, 3 and 4 but that would be a time-complexity disaster (O(n²)).
What is the fastest way to do this?
Consider a defaultdict for aggregating data by a unique key:
Given
Some random data
import random
import collections as ct
random.seed(123)
# Random data
dates = ["2018-04-24", "2018-05-04", "2018-07-06"]
ids = "A B C D".split()
amounts = lambda: random.randrange(1, 100)
ch = random.choice
data = [[ch(dates), ch(ids), amounts()] for _ in range(10)]
data
Output
[['2018-04-24', 'C', 12],
['2018-05-04', 'C', 14],
['2018-04-24', 'D', 69],
['2018-07-06', 'C', 44],
['2018-04-24', 'B', 18],
['2018-05-04', 'C', 90],
['2018-04-24', 'B', 1],
['2018-05-04', 'A', 77],
['2018-05-04', 'A', 1],
['2018-05-04', 'D', 14]]
Code
dd = ct.defaultdict(int)
for date, id_, amt in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key] += amt
dd
Output
defaultdict(int,
{'2018-04-24B_10': 19,
'2018-04-24C_10': 12,
'2018-04-24D_10': 69,
'2018-05-04A_10': 78,
'2018-05-04C_10': 104,
'2018-05-04D_10': 14,
'2018-07-06C_10': 44})
Details
A defaultdict is a dictionary that calls a default factory (a specified function) for any missing keys. It this case, every date + id combination is uniquely added to the dict. The amounts are added to values if existing keys are found. Otherwise an integer (0) initializes a new entry to the dict.
For illustration, you can visualize the aggregated values using a list as the default factory.
dd = ct.defaultdict(list)
for date, id_, val in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key].append(val)
dd
Output
defaultdict(list,
{'2018-04-24B_10': [18, 1],
'2018-04-24C_10': [12],
'2018-04-24D_10': [69],
'2018-05-04A_10': [77, 1],
'2018-05-04C_10': [14, 90],
'2018-05-04D_10': [14],
'2018-07-06C_10': [44]})
We see three occurrences of duplicate keys where the values were appropriately summed. Regarding efficiency, notice:
keys are made with format(), which should be a bit better the string concatenation and calling str()
every key and value is computed in the same iteration
Using pandas makes this really easy:
import pandas as pd
df = pd.DataFrame(data, columns=['date', 'id', 'amount'])
df.groupby(['date','id']).sum().reset_index()
For more control you can use agg instead of sum():
df.groupby(['date','id']).agg({'amount':'sum'})
Depending on what you are doing with the data, it may be easier/faster to go this way just because so much of pandas is built on compiled C extensions and optimized routines that make it super easy to transform and manipulate.
You could import the data into a structure that prevents duplicates and than convert it to a list.
data = {
date1: {
id1: amount1,
id2: amount2,
},
date2: {
id3: amount3,
id4: amount4,
....
}
The program's skeleton:
ddata = collections.defaultdict(dict)
for date, id, amount in DATASOURCE:
ddata[date][id] = amount
data = [[d, i, a] for d, subd in ddata.items() for i, a in subd.items()]

Change order of list of lists according to another list

I have a bunch of CSV-files where first line is the column name, and now I want to change the order according to another list.
Example:
[
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
...
]
The above order differs slightly between the files, but the same column-names are always available.
So the I want the columns to be re-arranged as:
['index','date','name','position']
I can solve it by comparing the first row, making an index for each column, then re-map each row into a new list of lists using a for-loop.
And while it works, it feels so ugly even my blind old aunt would yell at me if she saw it.
Someone on IRC told me to look at on map() and operator but I'm just not experienced enough to puzzle those together. :/
Thanks.
Plain Python
You could use zip to transpose your data:
data = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233']
]
columns = list(zip(*data))
print(columns)
# [('date', '2003-02-04', '2003-02-04'), ('index', '23445', '23446'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
It becomes much easier to modify the columns order now.
To calculate the needed permutation, you can use:
old = data[0]
new = ['index','date','name','position']
mapping = {i:new.index(v) for i,v in enumerate(old)}
# {0: 1, 1: 0, 2: 2, 3: 3}
You can apply the permutation to the columns:
columns = [columns[mapping[i]] for i in range(len(columns))]
# [('index', '23445', '23446'), ('date', '2003-02-04', '2003-02-04'), ('name', 'Steiner, James', 'Holm, Derek'), ('position', '98886', '2233')]
and transpose them back:
list(zip(*columns))
# [('index', 'date', 'name', 'position'), ('23445', '2003-02-04', 'Steiner, James', '98886'), ('23446', '2003-02-04', 'Holm, Derek', '2233')]
With Pandas
For this kind of tasks, you should use pandas.
It can parse CSVs, reorder columns, sort them and keep an index.
If you have already imported data, you could use these methods to import the columns, use the first row as header and set index column as index.
import pandas as pd
df = pd.DataFrame(data[1:], columns=data[0]).set_index('index')
df then becomes:
date name position
index
23445 2003-02-04 Steiner, James 98886
23446 2003-02-04 Holm, Derek 2233
You can avoid those steps by importing the CSV correctly with pandas.read_csv. You'd need usecols=['index','date','name','position'] to get the correct order directly.
Simple and stupid:
LIST = [
['date', 'index', 'name', 'position'],
['2003-02-04', '23445', 'Steiner, James', '98886'],
['2003-02-04', '23446', 'Holm, Derek', '2233'],
]
NEW_HEADER = ['index', 'date', 'name', 'position']
def swap(lists, new_header):
mapping = {}
for lst in lists:
if not mapping:
mapping = {
old_pos: new_pos
for new_pos, new_field in enumerate(new_header)
for old_pos, old_field in enumerate(lst)
if new_field == old_field}
yield [item for _, item in sorted(
[(mapping[index], item) for index, item in enumerate(lst)])]
if __name__ == '__main__':
print(LIST)
print(list(swap(LIST, NEW_HEADER)))
To rearrange your data, you can use a dictionary:
import csv
s = [
['date','index','name','position'],
['2003-02-04','23445','Steiner, James','98886'],
['2003-02-04','23446','Holm, Derek','2233'],
]
new_data = [{a:b for a, b in zip(s[0], i)} for i in s[1:]]
final_data = [[b[c] for c in ['index','date','name','position']] for b in new_data]
write = csv.writer(open('filename.csv'))
write.writerows(final_data)

Categories

Resources