Python Pandas DataFrame pivot_table bizarre values

Python Pandas DataFrame pivot_table bizarre values - python

"Bizarre" is such an emotionally charged word.
Assume that I have 5 students: A, B, C, D, and E. Each of these students grades two of their peers on a writing assignment. The data is as follows:
peer_review = pd.DataFrame({
'Student': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E'],
'Assessor': ['B', 'C', 'A', 'D', 'D', 'D', 'B', 'D', 'D', 'D', 'A', 'A', 'A', 'E', 'C', 'E'],
'Score': [72, 53, 92, 100, 2, 90, 75, 50, 50, 47, 97, 86, 41, 17, 47, 29]})
Now, in some cases an assessor graded the student's assignment more than once. Maybe the student turned it in and revised several times. Maybe the assessor was drunk and didn't remember that he had already graded this student's assignment. In any case, I would like to be able to see a list of all scores that each assessor gave to each student. I tried to do this as follows:
peer_review.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=identity)
I can already hear you asking --- What is the "identity" function? It's this:
def identity(x):
return x
However, when I run this the pivot_table function repeatedly, it gives me different answers each time for the cells that have multiple values.
So, here are the questions:
What is the significance of the numbers that seem to change randomly as I run the pivot_table function repeatedly?
How do I fix the identity function so that it returns a simple list of all the scores when an assessor graded the same assignment more than once?
------------------UPDATE #1:------------------
I found that it is a pandas Series object that is being passed to the identity function. I changed the identity function to this:
def identity(x):
return x.values
This still gives me the bizarre random numbers. Realizing that x.values is a numpy.ndarray, I then tried this:
def identity(x):
return x.values.tolist()
This results in a ValueError exception. ("Function does not reduce.")
------------------UPDATE #2:------------------
The workaround proposed by ZJS works perfectly. Still wondering why pivot_table has failed me.

This will work every time...
groups = peer_review.groupby(['Assessor','Student']) #groups into Assessor,Student combos
peer_review = groups.apply(lambda x:list(x['Score'])) #apply your group function
peer_review =peer_review.unstack('Student') #Set student index as the columns
I'm still investigating why pivot_table doesn't work

Related

Function for creating a random order list in python

I am new to Python. For an experiment, I need to build a random selector function that determines the order of runs the athlete will perform. We will have four courses (A, B, C, D) and we want the athlete to perform these in random order. There will be a total of 12 runs for each athlete and each course must have 3 runs each session. How can I build this function?
This is what I have tried so far. It works but I need to run the script several times but I get what I want. So if someone has any better idea, I would be really happy.
Best
Christian
import random
runs = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
diffCourses = ['A', 'B', 'C', 'D']
myRandom = []
for run in runs:
x = random.choice(diffCourses)
myRandom.append(x)
if myRandom.count('A') != 3 or myRandom.count('B') != 3 or myRandom.count('C') != 3 or myRandom.count('D') != 3:
print('The run order does not satify the requirement')
else:
print('satified')
print(myRandom)

To keep things simple
I would create the total set of runs first, then shuffle it
from random import shuffle
diffCourses = ['A', 'B', 'C', 'D']
runs = diffCourses*3
shuffle(runs)
print(runs)
for example it produces
['C', 'C', 'D', 'C', 'D', 'A', 'A', 'D', 'B', 'B', 'B', 'A']

You choose a random ordering of A,B,C,D three times and collect them into a run:
import random
diffCourses = ['A', 'B', 'C', 'D']
runs = [ a for b in (random.sample(diffCourses, k=4) for _ in range (3)) for a in b]
print(runs)
Output (additional spaces between each sample):
['A', 'D', 'C', 'B', 'A', 'B', 'D', 'C', 'B', 'A', 'D', 'C']
The random.sample(diffCourses, k=4) part shuffles ABCD in a random fashion and the nested list comprehension creates a flat list from the three sublists.
This automatically ensures you get every letter trice and in a random fashion - you might get A A if your runner needs to run A last and first ins 2 sessions.
See
What does "list comprehension" mean? How does it work and how can I use it?
Understanding nested list comprehension
for how list comps work.

Remove some duplicates from list in python

UPDATE: I believe I found the solution. I've put it at the end.
Let’s say we have this list:
a = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
I want to create another list to remove the duplicates from list a, but at the same time, keep the ratio approximately intact AND maintain order.
The output should be:
b = ['a', 'b', 'a', 'c']
EDIT: To explain better, the ratio doesn't need to be exactly intact. All that's required is the output of ONE single letter for all letters in the data. However, two letters might be the same but represent two different things. The counts are important to identify this as I say later. Letters representing ONE unique variable appear in counts between 3000-3400 so when I divide the total count by 3500 and round it, I know how many time it should appear in the end, but the problem is I don't know what order they should be in.
To illustrate this I'll include one more input and desired output:
Input: ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'a', 'a', 'd', 'd', 'a', 'a']
Desired Output: ['a', 'a', 'b', 'c', 'a', 'd', 'a']
Note that 'C' has been repeated three times. The ratio needs not be preserved exactly, all I need to represent is how many times that variable is represented and because it's represented 3 times only in this example, it isn't considered enough for it to count as two.
The only difference is that here I'm assuming all letters repeating exactly twice are unique, although in the data-set, again, uniqueness is dependent on the appearance of 3000-3400 times.
Note(1): This doesn't necessarily need to be considered but there's a possibility that not all letters will be grouped together nicely, for example, considering 4 letters for uniqueness to make it short: ['a','a',''b','a','a','b','b','b','b'] should still be represented as ['a','b']. This is a minor problem in this case, however.
EDIT:
Example of what I've tried and successfully done:
full_list = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
#full_list is a list containing around 10k items, just using this as example
rep = 2 # number of estimated repetitions for unique item,
# in the real list this was set to 3500
quant = {'a': 0, "b" : 0, "c" : 0, "d" : 0, "e" : 0, "f" : 0, "g": 0}
for x in set(full_list):
quant[x] = round(full_list.count(x)/rep)
final = []
for x in range(len(full_list)):
if full_list[x] in final:
lastindex = len(full_list) - 1 - full_list[::-1].index(full_list[x])
if lastindex == x and final.count(full_list[x]) < quant[full_list[x]]:
final.append(full_list[x])
else:
final.append(full_list[x])
print(final)
My problem with the above code is two-fold:
If there are more than 2 repetitions of the same data, it will not count them correctly. For example: ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a'] should become ['a','b','a','c','a'] but instead it becomes ['a','b,'c','a']
It takes a very log time to finish as I'm sure it's a very
inefficient way to do this.
Final remark: The code I've tried was more of a little hack to achieve the desired output on the most common input, however it doesn't do exactly what I intended it to. It's also important to note that the input changes over time. Repetitions of single letters aren't always the same, although I believe they're always grouped together, so I was thinking of making a flag that is True when it hits a letter and becomes false as soon as it changes to a different one, but this also has the problem of not being able to account for the fact that two letters that are the same might be put right next to each other. The count for each letter as an individual is always between 3000-3400, so I know that if the count is above that, there are more than 1.
UPDATE: Solution
Following hiro protagonist's suggestion with minor modifications, the following code seems to work:
full = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a']
from itertools import groupby
letters_pre = [key for key, _group in groupby(full)]
letters_post = []
for x in range(len(letters_pre)):
if x>0 and letters_pre[x] != letters_pre[x-1]:
letters_post.append(letters_pre[x])
if x == 0:
letters_post.append(letters_pre [x])
print(letters_post)
The only problem is that it doesn't consider that sometimes letters can appear in between unique ones, as described in "Note(1)", but that's only a very minor issue. The bigger issue is that it doesn't consider when two separate occurances of the same letter are consecutive, for example (two for uniqueness as example): ['a','a','a','a','b','b'] gets turned to ['a','b'] when desired output should be ['a','a','b']

this is where itertools.groupby may come in handy:
from itertools import groupby
a = ["a", "a", "b", "b", "a", "a", "c", "c"]
res = [key for key, _group in groupby(a)]
print(res) # ['a', 'b', 'a', 'c']
this is a version where you could 'scale' down the unique keys (but are guaranteed to have at leas one in the result):
from itertools import groupby, repeat, chain
a = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'a',
'd', 'd', 'a', 'a']
scale = 0.4
key_count = tuple((key, sum(1 for _item in group)) for key, group in groupby(a))
# (('a', 4), ('b', 2), ('c', 5), ('a', 2), ('d', 2), ('a', 2))
res = tuple(
chain.from_iterable(
(repeat(key, round(scale * count) or 1)) for key, count in key_count
)
)
# ('a', 'a', 'b', 'c', 'c', 'a', 'd', 'a')
there may be smarter ways to determine the scale (probably based on the length of the input list a and the average group length).

Might be a strange one, but:
b = []
for i in a:
if next(iter(b[::-1]), None) != i:
b.append(i)
print(b)
Output:
['a', 'b', 'a', 'c']

How can I get the list to split how i want automatically?

I have some code here:
lsp_rows = ['a', 'b', 'c', 'd', 'e', 'b', 'c', 'd', 'e', 'a', 'c',
'd', 'e', 'a', 'b', 'd', 'e', 'a', 'b', 'c', 'e', 'a',
'b', 'c', 'd']
n = int(width/length)
x = [a+b+c+d+e for a,b,c,d,e in zip(*[iter(lsp_rows)]*n)]
Currently, this will split my list "lsp_rows" in groups of 5 all the time as my n = 5. But I need it to split differently depending on "n" as it will change depending on the values of width and length.
So if n is 4 i need the list to split into 4's.
I can see that the problem is with the "a+b+c+d+e for a,b,c,d,e", and I don't know a way to make this change without my manual input, is there a way for me to solve this.
If you guys could explain as thoroughly as possible i'd really appreciate it as i'm pretty new to python. Thanks in advance!

With strings only you can:
[''.join(t) for t in zip(*[iter(lsp_rows)]*n)]
Or slightly more succinct and possibly less memory usage:
map(''.join, zip(*[iter(lsp_rows)]*n))
The answer provided by #hpaulj is more useful in the general case.
And, on the off-chance that you're just trying to generate the cycles of a string, the following will produce the same output.
s = 'abcde'
[s[i:] + s[:i] for i in range(len(s))]

I believe this will generalize your expression to n items:
import functools
import operator
[functools.reduce(operator.add,abc) for abc in zip(*[iter(x)]*n)]
though I'd still like see a test case.
For example if x is a list of lists, the result is a list of x flattened.
A list of numbers or a string look better:
In [394]: [functools.reduce(operator.add,abc) for abc in zip(*[iter('abcdefghij')]*4)]
Out[394]: ['abcd', 'efgh']
In [395]: [functools.reduce(operator.add,abc) for abc in zip(*[iter('abcdefghij')]*5)]
Out[395]: ['abcde', 'fghij']
In [396]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(range(20))]*5)]
Out[396]: [10, 35, 60, 85]
with your list of characters
In [400]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(lsp_rows)]*5)]
Out[400]: ['abcde', 'bcdea', 'cdeab', 'deabc', 'eabcd']
In [401]: [functools.reduce(operator.add,abc) for abc in zip(*[iter(lsp_rows)]*6)]
Out[401]: ['abcdeb', 'cdeacd', 'eabdea', 'bceabc']
All these imports can be replaced with join if the items are strings.

Generating a list of random lists

I'm new to Python, so I might be doing basic errors, so apologies first.
Here is the kind of result I'm trying to obtain :
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
So basically, a list of lists of randomly permutated letters without repeat.
Here is the look of what I can get so far :
foo = ['BDCEA', 'BDCEA', 'BDCEA', 'BDCEA']
The main problem being that everytime is the same permutation. This is my code so far :
import random
import numpy as np
letters = ["A", "B", "C", "D", "E"]
nblines = 4
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
Help appreciated. Thanks

The problem with your code is that the line
foo = np.repeat(''.join(random.sample(letters, len(letters))), nblines)
will first create a random permutation, and then repeat that same permutation nblines times. Numpy.repeat does not repeatedly invoke a function, it repeats elements of an already existing array, which you created with random.sample.
Another thing is that numpy is designed to work with numbers, not strings. Here is a short code snippet (without using numpy) to obtain your desired result:
[random.sample(letters,len(letters)) for i in range(nblines)]
Result: similar to this:
foo = [
["B","C","E","A","D"],
["E","B","A","C","D"],
["D","B","A","E","C"],
["C","D","E","B","A"]
]
I hope this helped ;)
PS: I see that others gave similar answers to this while I was writing it.

np.repeat repeats the same array. Your approach would work if you changed it to:
[''.join(random.sample(letters, len(letters))) for _ in range(nblines)]
Out: ['EBCAD', 'BCEAD', 'EBDCA', 'DBACE']
This is a short way of writing this:
foo = []
for _ in range(nblines):
foo.append(''.join(random.sample(letters, len(letters))))
foo
Out: ['DBACE', 'CBAED', 'ACDEB', 'ADBCE']

Here's a plain Python solution using a "traditional" style for loop.
from random import shuffle
nblines = 4
letters = list("ABCDE")
foo = []
for _ in range(nblines):
shuffle(letters)
foo.append(letters[:])
print(foo)
typical output
[['E', 'C', 'D', 'A', 'B'], ['A', 'B', 'D', 'C', 'E'], ['A', 'C', 'B', 'E', 'D'], ['C', 'A', 'E', 'B', 'D']]
The random.shuffle function shuffles the list in-place. We append a copy of the list to foo using letters[:], otherwise foo would just end up containing 4 references to the one list object.
Here's a slightly more advanced version, using a generator function to handle the shuffling. Each time we call next(sh) it shuffles the lst list stored in the generator and returns a copy of it. So we can call next(sh) in a list comprehension to build the list, which is a little neater than using a traditional for loop. Also, list comprehesions can be slightly faster than using .append in a traditional for loop.
from random import shuffle
def shuffler(seq):
lst = list(seq)
while True:
shuffle(lst)
yield lst[:]
sh = shuffler('ABCDE')
foo = [next(sh) for _ in range(10)]
for row in foo:
print(row)
typical output
['C', 'B', 'A', 'E', 'D']
['C', 'A', 'E', 'B', 'D']
['D', 'B', 'C', 'A', 'E']
['E', 'D', 'A', 'B', 'C']
['B', 'A', 'E', 'C', 'D']
['B', 'D', 'C', 'E', 'A']
['A', 'B', 'C', 'E', 'D']
['D', 'C', 'A', 'B', 'E']
['D', 'C', 'B', 'E', 'A']
['E', 'D', 'A', 'C', 'B']

python: compare lists in a sequence using nested for loops

so I have two lists where I compare a person's answers to the correct answers:
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
I need to compare the two of them (without using sets, if that's even possible) and keep track of how many of the person's answers are wrong - in this case, 3
I tried using the following for loops to count how many were correct:
correct = 0
for i in correct_answers:
for j in user_answers:
if i == j:
correct += 1
print(correct)
but this doesn't work and I'm not sure what I need to change to make it work.

Just count them:
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = sum(1 if correct != user else 0
for correct, user in zip(correct_answers, user_answers))

I blame #alecxe for convincing me to post this, the ultra-efficient solution:
from future_builtins import map # <-- Only on Python 2 to get generator based map and avoid intermediate lists; on Py3, map is already a generator
from operator import ne
numincorrect = sum(map(ne, correct_answers, user_answers))
Pushes all the work to the C layer (making it crazy fast, modulo the initial cost of setting it all up; no byte code is executed if the values processed are Python built-in types, which removes a lot of overhead), and one-lines it without getting too cryptic.

The less pythonic, more generic (and readable) solution is pretty simple too.
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = 0
for i in range(len(correct_answers)):
if correct_answers[i] != user_answers[i]:
incorrect += 1
This assumes your lists are the same length. If you need to validate that, you can do it before running this code.
EDIT: The following code does the same thing, provided you are familiar with zip
correct_answers = ['A', 'C', 'A', 'B', 'D']
user_answers = ['B', 'A', 'C', 'B', 'D']
incorrect = 0
for answer_tuple in zip(correct_answers, user_answers):
if answer_tuple[0] != answer_tuple[1]:
incorrect += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas DataFrame pivot_table bizarre values - python

Related

Function for creating a random order list in python

Remove some duplicates from list in python

How can I get the list to split how i want automatically?

Generating a list of random lists

python: compare lists in a sequence using nested for loops

Categories

Resources