Related
How can I get the average marks for each student from following dataframe using Pandas groupby() and mean() methods?
The aim is to get the average marks in ascending order of all students.
import pandas as pd
# Marks of students in class 4A and 4B
data = {
'S4A': {
'Name': ['Amy', 'Mandy', 'Daisy', 'Ben', 'Peter', 'John'],
'Maths': [99, 87, 88, 70, 88, 76],
'Chemistry': [89, 90, 90, 90, 89, 82],
'Physics': [79, 97, 68, 80, 72, 95],
'English': [90, 65, 56, 67, 86, 82],
'Biology': [79, 89, 59, 70, 79, 78],
'History': [75, 81, 78, 55, 68, 84]
},
'S4B': {
'Name': ['Allen', 'Gordon', 'Jimmy', 'Nancy', 'Sammy', 'William'],
'Maths': [90, 86, 88, 80, 85, 86],
'Chemistry': [89, 78, 88, 90, 79, 82],
'Physics': [89, 97, 78, 81, 82, 55],
'English': [80, 85, 86, 77, 86, 82],
'Biology': [75, 89, 69, 70, 79, 78],
'History': [79, 81, 80, 65, 68, 84]
}
}
# list of subjects
subjects = ['Maths', 'Chemistry', 'Physics', 'English', 'Biology', 'History']
# create dataframe
df = pd.DataFrame(data)
You need to create a dataframe for each class and then compute mean or concat all the classes and compute mean.
df = pd.concat([pd.DataFrame(data[k]) for k in data], ignore_index=True)
mean_df = df.set_index('Name').mean(1)
print(mean_df)
Name
Amy 85.166667
Mandy 84.833333
Daisy 73.166667
Ben 72.000000
Peter 80.333333
John 82.833333
Allen 83.666667
Gordon 86.000000
Jimmy 81.500000
Nancy 77.166667
Sammy 79.833333
William 77.833333
dtype: float64
I have a list and i would like to convert it into a dictionary such that key:value pairs should be like
{'apple':87, 'fan':88 ,'jackal':89,...}
Following is the list :
values_list = ['apple', 87, 'fan', 88, 'jackal', 89, 'bat', 98, 'car', 84, 'ice', 80, 'car', 86, 'apple', 82, 'goat', 80, 'dog', 81, 'cat', 80, 'eagle', 90, 'eagle', 98, 'hawk', 89, 'dog', 79, 'fan', 89, 'goat', 85, 'car', 81, 'hawk', 90, 'ice', 85, 'cat', 78, 'goat', 84, 'jackal', 90, 'apple', 80, 'ice', 87, 'bat', 94, 'bat', 92, 'jackal', 91, 'eagle', 93, 'fan', 85]
following is the python script written to do the task :
for i in range(0,length(values_list),2):
value_count_dict = {values_list[i] : values_list[i+1]}
print(value_count_dict)
values_count_dict = dict(value_count_dict)
print(values_count_dict)
output of the script :
But expecting a single dictionary with all key:value pairs in it.
Thank you in advance!
You've misspelled len as length.
The most Pythonic way of doing this is likely with a list comprehension and range using the step argument.
[{values_list[i]: values_list[i+1]} for i in range(0, len(values_list), 2)]
# [{'apple': 87}, {'fan': 88}, {'jackal': 89}, {'bat': 98}, {'car': 84}, {'ice': 80}, {'car': 86}, {'apple': 82}, {'goat': 80}, {'dog': 81}, {'cat': 80}, {'eagle': 90}, {'eagle': 98}, {'hawk': 89}, {'dog': 79}, {'fan': 89}, {'goat': 85}, {'car': 81}, {'hawk': 90}, {'ice': 85}, {'cat': 78}, {'goat': 84}, {'jackal': 90}, {'apple': 80}, {'ice': 87}, {'bat': 94}, {'bat': 92}, {'jackal': 91}, {'eagle': 93}, {'fan': 85}]
In your code you create a new dictionary on each iteration, but you don't store them anywhere, so value_count_dict at the end of the loop is just the last pair.
value_counts = []
for i in range(0, len(values_list), 2):
value_count_dict = {values_list[i]: values_list[i+1]}
print(value_count_dict)
value_counts.append(value_count_dict)
Here we made a for loop that starts at 0 and ends at length of our list and the step is set to 2 because we can find the next key of our dictionary 2 step ahead. We have our key at x and the value at x+1 index of our list respectively. We have updated the key and value in the initially created empty dictionary.
values_list = ['apple', 87, 'fan', 88, 'jackal', 89, 'bat', 98, 'car', 84, 'ice', 80, 'car', 86, 'apple', 82, 'goat', 80, 'dog', 81, 'cat', 80, 'eagle', 90, 'eagle', 98, 'hawk', 89, 'dog', 79, 'fan', 89, 'goat', 85, 'car', 81, 'hawk', 90, 'ice', 85, 'cat', 78, 'goat', 84, 'jackal', 90, 'apple', 80, 'ice', 87, 'bat', 94, 'bat', 92, 'jackal', 91, 'eagle', 93, 'fan', 85]
final_dict={}
for x in range(0,len(values_list),2):
final_dict[values_list[x]]=values_list[x+1]
print(final_dict)
Try zip:
dct = dict(
zip(
values_list[0::2],
values_list[1::2],
)
)
For duplicate keys in your list, the last value will be taken.
You cannot have a duplicated keys as mentioned in above comments but you may try to have the values as list for the duplicated keys such as:
result = {}
l=values_list
for i in range(0, len(l), 2):
result.setdefault(l[i], []).append(l[i+1])
print(result)
and your output would look like:
{'apple': [87, 82, 80], 'fan': [88, 89, 85], 'jackal': [89, 90, 91], 'bat': [98, 94, 92], 'car': [84, 86, 81], 'ice': [80, 85, 87], 'goat': [80, 85, 84], 'dog': [81, 79], 'cat': [80, 78], 'eagle': [90, 98, 93], 'hawk': [89, 90]}
I have an example of annotation file
{'text': "BELGIE BELGIQUE BELGIEN\nIDENTITEITSKAART CARTE D'IDENTITE PERSONALAUSWEIS\nBELGIUM\nIDENTITY CARD\nNaam / Name\nDermrive\nVoornamen / Given names\nBrando Jerom L\nGeslacht / Nationaliteit /\nGeboortedatum /\nSex\nNationality\nDate of birth\nM/M\nBEL\n19 05 1982\nRijksregisternr. 7 National Register Nº\n85.08.23-562.77\nKaartnr. / Card Nº\n752-0465474-34\nVervalt op / Expires on\n23 07 2025\n", 'spans': [{'start': 24, 'end': 40, 'token_start': 16, 'token_end': 16, 'label': 'CardType'}, {'start': 41, 'end': 57, 'token_start': 16, 'token_end': 16, 'label': 'CardType'}, {'start': 58, 'end': 73, 'token_start': 15, 'token_end': 15, 'label': 'CardType'}, {'start': 108, 'end': 116, 'token_start': 8, 'token_end': 8, 'label': 'LastName'}, {'start': 141, 'end': 155, 'token_start': 14, 'token_end': 14, 'label': 'FirstName'}, {'start': 229, 'end': 232, 'token_start': 3, 'token_end': 3, 'label': 'Gender_nid'}, {'start': 233, 'end': 236, 'token_start': 3, 'token_end': 3, 'label': 'Nationality_nid'}, {'start': 237, 'end': 247, 'token_start': 10, 'token_end': 10, 'label': 'DateOfBirth_nid'}, {'start': 288, 'end': 303, 'token_start': 15, 'token_end': 15, 'label': 'Ssn'}, {'start': 323, 'end': 337, 'token_start': 14, 'token_end': 14, 'label': 'CardNumber'}, {'start': 362, 'end': 372, 'token_start': 10, 'token_end': 10, 'label': 'ValidUntil_nid'}]}
So when a i have a start and end position of "LastName"entity, in the example is "Dermrive", when i produce another, shorter or longer LastName for example "Brad", i need to change all the rest by difference of this words, so that other labels stays in the correct postition. Its works perfecly with one entity, but when i try to change all of them, the output is messy and labels are not correct anymore.
def replace_text_by_index_and_type(self, new_text, type):
label_position = self.search_label_position_in_spans(self.annotation['spans'], type.value)
label = self.annotation['spans'][label_position]
begin_new_string = self.annotation["text"][:label["start"]]
end_new_string = self.annotation["text"][label["end"]:]
new_string = begin_new_string + new_text + end_new_string
for to_change_ent in self.annotation['spans'][label_position+1:]:
diff = len(new_text) - (label["end"] - label["start"])
self.annotation['spans'][label_position]["end"] = self.annotation['spans'][label_position]["end"] + diff
#print(f"Diff between original {to_change_ent} and new_string: {diff}")
to_change_ent["start"] += diff
to_change_ent["end"] += diff
return new_string
I start to change all entities from the second one, to keep the start position of first one. And add diff to ending position of first entity, as a results the firstname and lastname are correct, but other entities are shifted to mess.
I have two lists:
names: ['Mary', 'Jack', 'Rose', 'Mary', 'Carl', 'Fred', 'Meg', 'Phil', 'Carl', 'Jack', 'Fred', 'Mary', 'Phil', 'Jack', 'Mary', 'Fred', 'Meg']
grades: [80, 88, 53, 80, 64, 61, 75, 80, 91, 82, 68, 76, 95, 58, 89, 51, 81, 78]
I want to be able to take the average of each persons test scores. For example, Mary pops up in the names list 4 times and I want to be able to take the test scores that are mapped to her and take that average.
The issue is how to compare the duplicate names with the test scores.
Note: I do know that the grades list is longer than the names list, but this was the two lists that was given to me.
Here is what I have done so far
def average_grades(names, grades):
averages = dict()
name_counter = 0
for name in names:
# if the name is the same
if name == names:
# count the occurence of the name
name_counter += 1
print(name_counter)
# cycle through the grades
# for grade in grades:
# print(grade)
Here's a way:
from collections import defaultdict, Counter
names = ['Mary', 'Jack', 'Rose', 'Mary', 'Carl', 'Fred', 'Meg', 'Phil', 'Carl', 'Jack', 'Fred', 'Mary', 'Phil', 'Jack', 'Mary', 'Fred', 'Meg']
grades = [80, 88, 53, 80, 64, 61, 75, 80, 91, 82, 68, 76, 95, 58, 89, 51, 81, 78]
score = defaultdict(int)
# this line initializes a default dict with default value = 0
frequency = Counter(names)
# this yields: Counter({'Mary': 4, 'Jack': 3, 'Fred': 3, 'Carl': 2, 'Meg': 2,'Phil': 2, 'Rose': 1})
for name, grade in zip(names, grades):
score[name] = score.get(name,0)+(grade / frequency[name])
# here you add the (grade of name / count of name) to each name,
# score.get(name,0) this line adds a default value 0 if the key does not exist already
print(score)
Output:
defaultdict(<class 'int'>, {'Mary': 81.25, 'Jack': 76.0, 'Rose': 53.0, 'Carl': 77.5, 'Fred': 60.0, 'Meg': 78.0, 'Phil': 87.5})
NOTE: It ignores the last grade, as I have no idea what to do with it.
You can iterate in parallel, find their average and add to the dictionary:
from itertools import groupby
from collections import defaultdict
names = ['Mary', 'Jack', 'Rose', 'Mary', 'Carl', 'Fred', 'Meg', 'Phil', 'Carl', 'Jack', 'Fred', 'Mary', 'Phil', 'Jack', 'Mary', 'Fred', 'Meg']
grades = [80, 88, 53, 80, 64, 61, 75, 80, 91, 82, 68, 76, 95, 58, 89, 51, 81, 78]
d = defaultdict(int)
f = lambda x: x[0]
for k, g in groupby(sorted(zip(names, grades), key=f), key=f):
grp = list(g)
d[k] = sum(x[1] for x in grp) / len(grp)
print(d)
I am working with a data frame which consists of a column with numbers in the format:
[[45, 45, 'D'],[46, 49, 'C'],[50, 66, 'S'],[67, 101, 'C'],[102, 103, 'S'],[104, 106, 'C'],[107, 108, 'S'],[109, 120, 'C'],[121, 121, 'S'],[122, 123, 'C'],[124, 140, 'S'],[141, 149, 'C'],[150, 176, 'S'],[177, 178, 'C'],[179, 181, 'S'],[182, 194, 'C'],[195, 213, 'S'],[214, 21``7, 'C']]
These numbers correspond to the positions of characters in a string: i.e. the string:
'MGILSFLPVLATESDWADCKSPQPWGHMLLWTAVLFLAPVAGTPAAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAPSSSPMGIIVAVVTGIAVAAIVAAVVALIYCRKKRISALPGYPECREMGETLPEKPANPTNPDEADKVGAENTITYSLLMHPDALEEPDDQNRI'
As you can see, some of the characters in the list are not corresponding to a number in the number list (i.e.) 0-44 is missing. So the characters at the 0-44th position have to be removed to create a shorter sequence of letters.
I am able to do this for one line, but I am struggling to do it for every line in the data frame.
This is the code for doing it for one line:
new_s = ''
for item in res:
new_s += strSeq[item[0]-1:item[1]]
print(len(new_s), new_s)
And this is what I have been trying to try to get it for all lines:
shortenedSeq_list =[]
counter=0
stringstring=[]
for rows in df.itertuples():
strSeq2 = [rows.sequence]
strremove2 = [rows.shortened_mobidb_consensus]
for item in strremove2:
res = ast.literal_eval(item)
for item in res:
stringstring.append(strSeq2[item[0]-1:item[1]])
stringstring
But this results in the output :
[],
[],
[],
[],
[],
[],
[],
[],
[],
['MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS'],
[],
[],
Whereas I want each line in the list to be the sequence which has been shortened.
I ultimately want to add this list as a column in a the dataframe.
UPDATE
The numbers are outputted as a string rather than a list, so res is the numbers as a list, and this is the working code output:
173 AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAP Where 173 is the length of the shortened sequence, followed by the sequence.
df sample:
shortened_mobidb_consensus;sequence
[[45, 45, 'D'], [46, 49, 'C'], [50, 66, 'S'], [67, 101, 'C'], [102, 103, 'S'], [104, 106, 'C'], [107, 108, 'S'], [109, 120, 'C'], [121, 121, 'S'], [122, 123, 'C'], [124, 140, 'S'], [141, 149, 'C'], [150, 176, 'S'], [177, 178, 'C'], [179, 181, 'S'], [182, 194, 'C'], [195, 213, 'S'], [214, 217, 'C']];MGILSFLPVLATESDWADCKSPQPWGHMLLWTAVLFLAPVAGTPAAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAPSSSPMGIIVAVVTGIAVAAIVAAVVALIYCRKKRISALPGYPECREMGETLPEKPANPTNPDEADKVGAENTITYSLLMHPDALEEPDDQNRI
[[1, 1, 'D'], [2, 143, 'S'], [144, 145, 'C']];MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS
[[1, 145, 'S']];MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS
[[1, 1, 'D'], [2, 2, 'C'], [3, 37, 'S'], [38, 39, 'C'], [40, 40, 'S'], [41, 41, 'C'], [42, 62, 'S'], [63, 65, 'C'], [66, 231, 'S']];MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTIKDSGEEEIKSVIEKINSKSIKVDTFVCAAGGWSGGNASSDEFLKSVKGMIDMNLYSAFASAHIGAKLLNQGGLFVLTGASAALNRTSGMIAYGATKAATHHIIKDLASENGGLPAGSTSLGILPVTLDTPTNRKYMSDANFDDWTPLSEVAEKLFEWSTNSDSRPTNGSLVKFETKSKVTTWTNL
[[24, 29, 'D'], [30, 91, 'S'], [92, 92, 'D']];MKVSTTALAVLLCTMTLCNQVFSAPYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRNRQICADSKETWVQEYITDLELNA
Solution 1:
df = pd.read_csv('stringsample.txt',sep=';',converters={0:ast.literal_eval})
for index, row in df.iterrows():
new_s = ''
res = row.shortened_mobidb_consensus
for item in res:
new_s += row.sequence[item[0]-1:item[1]]
df.loc[index,'output'] = new_s
df['output']
0 AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNL...
1 MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGS...
2 MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGS...
3 MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTI...
4 APYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRN...
Name: output, dtype: object
Solution 2: (Fixing your code)
df = pd.read_csv('stringsample.txt',sep=';')
shortenedSeq_list =[]
counter=0
stringstring=[]
for rows in df.itertuples():
strSeq2 = rows.sequence
strremove2 = rows.shortened_mobidb_consensus
res = ast.literal_eval(strremove2)
new_s = ''
for item in res:
new_s += strSeq2[item[0]-1:item[1]]
stringstring.append(new_s)
stringstring
['AAPPKAVLKLEPQWINVLQEDSVTLTCRGTHSPESDSIQWFHNGNLIPTHTQPSYRFKANNNDSGEYTCQTGQTSLSDPVHLTVLSEWLVLQTPHLEFQEGETIVLRCHSWKDKPLVKVTFFQNGKSKKFSRSDPNFSIPQANHSHSGDYHCTGNIGYTLYSSKPVTITVQAP',
'MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS',
'MGKGKPRGLNSARKLRVHRRNNRWAETTYKKRLLGTAFKSSPFGGSSHAKGIVLEKIGIESKQPNSAIRKCVRVQLIKNGKKVTAFVPNDGCLNFVDENDEVLLAGFGRKGKAKGDIPGVRFKVVKVSGVSLLALWKEKKEKPRS',
'MSKNILVLGGSGALGAEVVKFFKSKSWNTISIDFRENPNADHSFTIKDSGEEEIKSVIEKINSKSIKVDTFVCAAGGWSGGNASSDEFLKSVKGMIDMNLYSAFASAHIGAKLLNQGGLFVLTGASAALNRTSGMIAYGATKAATHHIIKDLASENGGLPAGSTSLGILPVTLDTPTNRKYMSDANFDDWTPLSEVAEKLFEWSTNSDSRPTNGSLVKFETKSKVTTWTNL',
'APYGADTPTACCFSYSRKIPRQFIVDYFETSSLCSQPGVIFLTKRNRQICADSKETWVQEYITDLELNA']