group number of counts by category - python

I wrote a script that goes over data, checks for emoticons using regex, and when an emoticon is found the counter is updated. Then, the number of counts per category should be written to a list for example cat ne has 25 emoticons, category fr has 45.... Here is where it goes wrong. The results I get are:
[1, 'ag', 2, 'dg', 3, 'dg', 4, 'fr', 5, 'fr', 6, 'fr', 7, 'fr', 8, 'hp', 9, 'hp', 10, 'hp', 11, 'hp', 12, 'hp', 13, 'hp', 14, 'hp', 15, 'hp', 16, 'hp', 17, 'hp', 18, 'hp', 19, 'hp', 20, 'hp', 21, 'hp', 22, 'hp', 23, 'hp', 24, 'hp', 25, 'ne', 26, 'ne', 27, 'ne', 28, 'ne', 29, 'ne', 30, 'ne', 31, 'ne', 32, 'ne', 33, 'ne', 34, 'ne', 35, 'ne', 36, 'ne', 37, 'ne', 38]
The fileid is of this form, one big file contains 7 smaller files (each file is a category). Within the category files are around 100 files per category:
data/ne/567.txt
The data in each of the .txt files is just one sentence, and looks like this
I am so happy today :)
This is my script:
counter = 0
lijst = []
for fileid in corpus.fileids():
for sentence in corpus.sents(fileid):
cat = str(fileid.split('/')[0])
s = " ".join(sentence)
m = re.search('(:\)|:\(|:\s|:\D|:\o|:\#)+', s)
if m is not None:
counter +=1
lijst += [counter] + [cat]

You should do:
import collections
counts = collections.defaultdict(lambda: 0)
for fileid in corpus.fileids():
for sentence in corpus.sents(fileid):
cat = str(fileid.split('/')[0])
s = " ".join(sentence)
counts[cat] += len(re.findall('(:\)|:\(|:\s|:\D|:\o|:\#)+', s))

Related

Python - filter list of dicts by top value in key

I'm trying to narrow down list of dicts by filtering it by value in one of the keys.
Current codes does it but I don't know how to retain entire dictionary rather then only those fields I filter by.
final_list = []
jobs = [glue_client.job_status(e) for e in j]
for e in jobs:
for page in e:
final_list.append(page["JobRuns"])
flat_list = [item for sublist in final_list for item in sublist]
sorted_list = sorted(flat_list, key=lambda k: (k['JobName'], k['StartedOn']), reverse=True)
#need to have following keys: "JobName", "JobRunState", "StartedOn" and "Id"
latest_jobs = [
{'JobName': key, 'StartedOn': max(item['StartedOn'] for item in values)}
for key, values in groupby(flat_list, lambda dct: dct['JobName'])
]
print(latest_jobs)
Data at sorted_list variable looks as below:
list_of_dicts = [
{'JobName': 'a', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'fail', 'id': 'xyz'},
{'JobName': 'a', 'StartedOn': datetime.datetime(2021, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'ok', 'id': 'xyz'},
{'JobName': 'b', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'fail', 'id': 'xyz'},
{'JobName': 'a', 'StartedOn': datetime.datetime(2020, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'fai;', 'id': 'xyz'},
{'JobName': 'b', 'StartedOn': datetime.datetime(2021, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'ok', 'id': 'xyz'}
]
Expected output:
filtered_list = [
{'JobName': 'a', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'fail', 'id': 'xyz'},
{'JobName': 'b', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000, tzinfo=tzlocal()), 'JobRunState': 'fail', 'id': 'xyz'}
]
Some judicious use of itertools.groupby, sorted, and max.
list_of_dicts = [
{'JobName': 'a', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'fail', 'id': 'xyz'},
{'JobName': 'a', 'StartedOn': datetime.datetime(2021, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'ok', 'id': 'xyz'},
{'JobName': 'b', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'fail', 'id': 'xyz'},
{'JobName': 'a', 'StartedOn': datetime.datetime(2020, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'fai;', 'id': 'xyz'},
{'JobName': 'b', 'StartedOn': datetime.datetime(2021, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'ok', 'id': 'xyz'}
]
from itertools import groupby
from operator import itemgetter
lst = sorted(list_of_dicts, key=itemgetter('JobName'))
[max(jobs, key=itemgetter('StartedOn'))
for jn, jobs in groupby(lst, key=itemgetter('JobName'))]
# [{'JobName': 'a', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'fail', 'id': 'xyz'},
# {'JobName': 'b', 'StartedOn': datetime.datetime(2022, 10, 18, 13, 0, 47, 306000), 'JobRunState': 'fail', 'id': 'xyz'}]

How to create a dictionary from one list, making the first item mapped to the second, the third mapped to the fouth, and so on

I have a list of strings and integers:
students = ['Janet', 21, 'Bill', 19, 'Amanda', 22, 'Mike', 25, 'Susan', 24, 'Jen', 29, 'Sara', 30, 'Maria', 18, 'Kathy', 20, 'Andrew', 27]
I need to make a dictionary called peoples, that takes each name and maps it to their age, which is the integer after it. I thought I would have to iterate over the list, but I've had no luck. Here is what I have so far:
students = ['Janet', 21, 'Bill', 19, 'Amanda', 22, 'Mike', 25, 'Susan', 24, 'Jen', 29, 'Sara', 30, 'Maria', 18, 'Kathy', 20, 'Andrew', 27]
people = {}
for i in students:
if type(i) is int == False:
#here I would take i and make it a key in the dictionary, then map the following integer to its value
students = ['Janet', 21, 'Bill', 19, 'Amanda', 22, 'Mike', 25, 'Susan', 24, 'Jen', 29, 'Sara', 30, 'Maria', 18, 'Kathy', 20, 'Andrew', 27]
print(dict(zip(students[::2], students[1::2])))
Prints:
{'Janet': 21, 'Bill': 19, 'Amanda': 22, 'Mike': 25, 'Susan': 24, 'Jen': 29, 'Sara': 30, 'Maria': 18, 'Kathy': 20, 'Andrew': 27}
dict([x for x in zip(*[iter(students)]*2)])
dict = {}
for i in range(len(students)//2):
dict[student[i]] = dict[student[i+1]]

Searching for elements within Python list using conditional statements

main_col = ['Name', 'Age', 'Gender']
main_row = [['Peter', 18, 'M'], ['Sam', 20, 'M'], ['Carol', 19, 'F'], ['Malcom', 21, 'M'], ['Oliver', 25, 'M'], ['Mellisa', 21, 'F'], ['Minreva', 18, 'F'], ['Bruce', 23, 'M'], ['Clarke', 24, 'M'], ['Zuck', 22, 'M'], ['Slade', 23, 'M'], ['Wade', 21, 'M'], ['Felicity', 22, 'F'], ['Selena', 23, 'F'], ['Ra\'s Al Gul',700, 'M']]
I am trying to make a program where main_col are column names and main_row have row information for each column (in a 2d list).
How can I write a piece of code for a search query which can search row where:
Name = 'Carol' and Age = 19.
Name = 'Carol' and Gender = 'F'
Age= 22 or Gender = 'M'
The following code is giving result for the 3rd part:-
search = {'Age' : 22, 'Gender' : 'M'}
for i in search:
idx = main_col.index(i)
for j in main_row:
if(j[idx] == search[i]):
print(j)
You could give this a try, its somewhat complicated but should get the job done:
AND = 'and'
OR = 'or'
# Check if the array is a match
def is_found(value, aggregator, search_terms):
if aggregator == AND:
is_found = True
for col, val in search_terms.items():
if value[val['idx']] != val['val']:
is_found = False
break
else:
is_found = False
for col, val in search_terms.items():
if value[val['idx']] == val['val']:
is_found = True
break
return is_found
# Perform the search
def search(columns, values, aggregator, search_filters):
# Format the search values into something we can use
# {
# 'col': { 'idx': <column index>, 'val': <search value> }
# }
search_terms = {
col: { 'idx': columns.index(col), 'val': val }
for col, val in search_filters.items()
}
return [
val
for val in values
if is_found(val, aggregator, search_terms)
]
if __name__ == "__main__":
main_col = ['Name', 'Age', 'Gender']
main_row = [['Peter', 18, 'M'], ['Sam', 20, 'M'], ['Carol', 19, 'F'], ['Malcom', 21, 'M'], ['Oliver', 25, 'M'], ['Mellisa', 21, 'F'], ['Minreva', 18, 'F'], ['Bruce', 23, 'M'], ['Clarke', 24, 'M'], ['Zuck', 22, 'M'], ['Slade', 23, 'M'], ['Wade', 21, 'M'], ['Felicity', 22, 'F'], ['Selena', 23, 'F'], ['Ra\'s Al Gul',700, 'M']]
search_filter = {
'Age': 22, 'Gender': 'M'
}
print(search(main_col, main_row, OR ,search_filter))
search_filter = {
'Name': 'Carol', 'Age': 19
}
print(search(main_col, main_row, AND ,search_filter))
If you want to stick with your pattern, this is an option:
search = {'Age' : 21, 'Gender' : 'M'}
idxs = [ (main_col.index(key), val) for key, val in search.items()]
tmp = [ set(tuple(person) for person in main_row if person[i] == v) for i, v in idxs ]
res = set.intersection(*tmp)
#=> {('Wade', 21, 'M'), ('Malcom', 21, 'M')}
NOTE: I used intersection to return AND, but you can customise to any of the operation available on set (https://docs.python.org/3.7/library/stdtypes.html#set): union, intersection, difference, ...
You can convert to a handy method:
def lookup(search, main_row, main_col):
idxs = [ (main_col.index(key), val) for key, val in search.items()]
tmp = [ set(tuple(person) for person in main_row if person[i] == v) for i, v in idxs ]
return set.intersection(*tmp)
lookup({'Age' : 21}, main_row, main_col)
#=> {('Wade', 21, 'M'), ('Mellisa', 21, 'F'), ('Malcom', 21, 'M')}
lookup({'Age' : 21, 'Gender' : 'M'}, main_row, main_col)
#=> {('Malcom', 21, 'M'), ('Wade', 21, 'M')}
lookup({'Age' : 21, 'Gender' : 'M', 'Name': 'Malcom'}, main_row, main_col)
#=> {('Malcom', 21, 'M')}
Anyway, I'd suggest to use a dict from main_row:
main_row = [['Peter', 18, 'M'], ['Sam', 20, 'M'], ['Carol', 19, 'F'], ['Malcom', 21, 'M'], ['Oliver', 25, 'M'], ['Mellisa', 21, 'F'], ['Minreva', 18, 'F'], ['Bruce', 23, 'M'], ['Clarke', 24, 'M'], ['Zuck', 22, 'M'], ['Slade', 23, 'M'], ['Wade', 21, 'M'], ['Felicity', 22, 'F'], ['Selena', 23, 'F'], ['Ra\'s Al Gul',700, 'M'], ['Oliver', 31, 'M']]
This builds the dictionary people, leaving apart the first list of headers:
people = [ {'name':name, 'age':age, 'gender':gender} for name, age, gender in main_row]
#=> [{'name': 'Peter', 'age': 18, 'gender': 'M'}, {'name': 'Sam', 'age': 20, 'gender': 'M'}, ....
Then you can query for example in this way:
next(person for person in people if person['name'] == "Oliver" and person['age'] == 31 )
#=> {'name': 'Oliver', 'age': 31, 'gender': 'M'}
the_21_years_old = [ person for person in people if person['age'] == 21 ]
#=> [{'name': 'Malcom', 'age': 21, 'gender': 'M'}, {'name': 'Mellisa', 'age': 21, 'gender': 'F'}, {'name': 'Wade', 'age': 21, 'gender': 'M'}]
You can the do whatever you need with the returned "records":
for person in the_21_years_old:
print(person['name'], person['age'])
# Malcom 21
# Mellisa 21
# Wade 21

Python string matching and give repeated numbers for unmatched strings

I have set of some words in list1:"management consultancy services better financial health"
user_search="management consultancy services better financial health"
user_split = nltk.word_tokenize(user_search)
user_length=len(user_split)
assign :management=1, consultancy=2,services=3 ,better=4, financial=5 ,health=6.
Then compare this with set of some lists.
list2: ['us',
'paleri',
'home',
'us',
'consulting',
'services',
'market',
'research',
'analysis',
'project',
'feasibility',
'studies',
'market',
'strategy',
'business',
'plan',
'model',
'health',
'human' etc..]
So that any match occurs it will reflect on corresponding positions as 1,2 3 etc. If the positions are unmatched then the positions are filled with number 6 on words.
Expected output example:
[1] 7 8 9 10 11 3 12 13 14 15 16 17 18 19 20 21 22 6 23 24
This means string 3 and 4, ie. services and health is there in this list(matched). Other numbers indicates the unmatched.user_length=6. So unmatched positions will starts from 7. How to get such a expected result in python?
You can use itertools.count to create a counter and iterate via next:
from itertools import count
user_search = "management consultancy services better financial health"
words = {v: k for k, v in enumerate(user_search.split(), 1)}
# {'better': 4, 'consultancy': 2, 'financial': 5,
# 'health': 6, 'management': 1, 'services': 3}
L = ['us', 'paleri', 'home', 'us', 'consulting', 'services',
'market', 'research', 'analysis', 'project', 'feasibility',
'studies', 'market', 'strategy', 'business', 'plan',
'model', 'health', 'human']
c = count(start=len(words)+1)
res = [next(c) if word not in words else words[word] for word in L]
# [7, 8, 9, 10, 11, 3, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 6, 23]

How to search in list

Getting a list from mongodb and sorting it:
results = list(db1.zaklad.find({"name": "cola", "stav": '+'}))
print(a)
sorted_results = sorted(results, key=itemgetter('weight'), reverse=True)
Im getting: [{'_id': ObjectId('5a13a8c396fb3488bb6a0648'), 'name': 'cola', 'weight': '3', 'url': 'goo.gl/2BgLmm', 'stav': '+', 'time_exp': datetime.datetime(2017, 11, 17, 23, 37, 31, 946000)}, {'_id': ObjectId('5a13a8bc96fb3488bb6a0647'), 'name': 'cola', 'weight': '2', 'url': 'goo.gl/2BgLmm', 'stav': '+', 'time_exp': datetime.datetime(2017, 11, 17, 23, 37, 31, 946000)}, {'_id': ObjectId('5a13a8ca96fb3488bb6a0649'), 'name': 'cola', 'weight': '2', 'url': 'goo.gl/2BgLmm', 'stav': '+', 'time_exp': datetime.datetime(2017, 11, 17, 23, 37, 31, 946000)}
From this list I want to get all un-repeating weights(from example above: 3, 2).
So, how to search in this list?
Or its better to do dictionary with dict(enumerate(results))?
Thx for your help
If list is already sorted, create a new list, put the first weight of sorted_results in it, remember the weight of this item and iterate over the remaining items:
If an item has same weight as remembered weight, ignore it, if it has another weight, add it to the new list and remember the new weight instead of the previous one.
get weights
weights = [dic["weight"] for dic in results]
remove duplicates by converting to set
set(weights)
If you want to search by un-repeating weight value from results.
from collections import OrderedDict
ordered_results = OrderedDict({k['weight']:k for k in sorted(results, key=lambda x:x['weight'], reverse=True)})
You can get an ordered results:
OrderedDict([('3', {'stav': '+', 'name': 'cola', 'weight': '3', 'url': 'goo.gl/2BgLmm', 'time_exp': datetime.datetime(2017, 11, 17, 23, 37, 31, 946000), '_id': ObjectId('5a13a8c396fb3488bb6a0648')}), ('2', {'stav': '+', 'name': 'cola', 'weight': '2', 'url': 'goo.gl/2BgLmm', 'time_exp': datetime.datetime(2017, 11, 17, 23, 37, 31, 946000), '_id': ObjectId('5a13a8ca96fb3488bb6a0649')})])
You can get the value with 'weight' value '3' by using ordered_result['3']

Categories

Resources