Grouping, summing, sorting and selecting within a DataFrame

Grouping, summing, sorting and selecting within a DataFrame - python

I have a DataFrame like this one:
df=pd.DataFrame({'State' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'County' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'],
'Population': [10, 11, 12, 13, 17, 16, 15, 18, 14]})
Looking at the two most populous counties for each state, what are the two most populous states (in order of highest population to lowest population)?
I solved it by using a loop, and now I'm trying to get the same result grouping, summing, sorting and selecting.
The following code works, but I'm sure there are many differnt and more elegant way to do it.
df.groupby(['State'])['Population'].nlargest(2).groupby(['State']).sum()\
.sort_values(ascending=False)[:2].to_frame()\
.reset_index()['State'].tolist()

You can't shorten this alittle.
df.groupby(['State'])['Population'].nlargest(2)\
.sum(level=0).sort_values(ascending=False).index[:2].tolist()
No need to convert back to dataframe to retreive states, just get the states from the index directly. Using sum with level parameter is just short syntax that over groupby again.

(df.sort_values('Population', ascending=False) # order by highest population per country
.groupby('State').head(2) # get two most populous counties per state
.groupby('State').sum() # get population of two largest counties per state
.sort_values('Population', ascending = False)[:2] # get top 2 states by population
.index # get states names
.tolist() # convert to list
)
Here's an alternate way to do it with explanations of each operation

Related

how to automatically drop index levels that only have single value?

I have a dataframe that has A to M columns for example. Then I did:
groups = df.groupby(['E', 'D', 'B', 'G', 'I'])
stats = pd.concat(
[
groups['N'].mean().rename("N_mean"),
groups['H'].median().rename('H_median')
]
)
stats = stats[stats['N']>0]
now if I print stats, the index is ('E', 'D', 'B', 'G', 'I'). However, many of them only contain single value which means they are insignificant. I knew I can determine which level is insignificant then stats.index.droplevel(...). But I wonder is there already builtin method to automatically do this?

Highlighting specific columns in bar chart in python using altair

I want to highlight specific data points in a bar chart in python based on my requirement. using altair, I am able to achieve results for one data point (for e.g 'A' in the code). Here's the sample data frame and code
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.datum.Name == 'A',
alt.value('red'),
alt.value('blue')
))
bar1
How can I highlight two or more datapoints (eg. A & B) with the same color and others with a different one? I tried passing the names as a list Select = ['A', 'B'] and then passing the alt.datum.Name == Select but that does not work?
How can i get this done?
Also, trying to understand why passing as a list did not work?
thank you

You could use the FieldOneOfPredicate to check if the Name column is one of the items in the list:
import pandas as pd
import altair as alt
df = pd.DataFrame({
'Rank': [25, 20, 40, 10, 50, 35],
'Name': ['A', 'B', 'C', 'D', 'E', 'F'],
})
bar1 =alt.Chart(df).mark_bar().encode( y = alt.Y('Rank'),
x = alt.X('Name:O', sort ='x'),
color=alt.condition(
alt.FieldOneOfPredicate('Name', ['A', 'B']),
alt.value('red'),
alt.value('blue')
))
bar1
You can read more about it in the VegaLite docs. You could also use two expression string with an "or" operator:
color=alt.condition(
"datum.Name == 'A'"
"|| datum.Name == 'B'", # splitting on two rows for readability
alt.value('red'),
alt.value('blue')
)
I don't think there is a single Vega expression operator that you can use for checking membership like Python's in. This answer doesn't mention it either.

Remove some duplicates from list in python

UPDATE: I believe I found the solution. I've put it at the end.
Let’s say we have this list:
a = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
I want to create another list to remove the duplicates from list a, but at the same time, keep the ratio approximately intact AND maintain order.
The output should be:
b = ['a', 'b', 'a', 'c']
EDIT: To explain better, the ratio doesn't need to be exactly intact. All that's required is the output of ONE single letter for all letters in the data. However, two letters might be the same but represent two different things. The counts are important to identify this as I say later. Letters representing ONE unique variable appear in counts between 3000-3400 so when I divide the total count by 3500 and round it, I know how many time it should appear in the end, but the problem is I don't know what order they should be in.
To illustrate this I'll include one more input and desired output:
Input: ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'a', 'a', 'd', 'd', 'a', 'a']
Desired Output: ['a', 'a', 'b', 'c', 'a', 'd', 'a']
Note that 'C' has been repeated three times. The ratio needs not be preserved exactly, all I need to represent is how many times that variable is represented and because it's represented 3 times only in this example, it isn't considered enough for it to count as two.
The only difference is that here I'm assuming all letters repeating exactly twice are unique, although in the data-set, again, uniqueness is dependent on the appearance of 3000-3400 times.
Note(1): This doesn't necessarily need to be considered but there's a possibility that not all letters will be grouped together nicely, for example, considering 4 letters for uniqueness to make it short: ['a','a',''b','a','a','b','b','b','b'] should still be represented as ['a','b']. This is a minor problem in this case, however.
EDIT:
Example of what I've tried and successfully done:
full_list = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
#full_list is a list containing around 10k items, just using this as example
rep = 2 # number of estimated repetitions for unique item,
# in the real list this was set to 3500
quant = {'a': 0, "b" : 0, "c" : 0, "d" : 0, "e" : 0, "f" : 0, "g": 0}
for x in set(full_list):
quant[x] = round(full_list.count(x)/rep)
final = []
for x in range(len(full_list)):
if full_list[x] in final:
lastindex = len(full_list) - 1 - full_list[::-1].index(full_list[x])
if lastindex == x and final.count(full_list[x]) < quant[full_list[x]]:
final.append(full_list[x])
else:
final.append(full_list[x])
print(final)
My problem with the above code is two-fold:
If there are more than 2 repetitions of the same data, it will not count them correctly. For example: ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a'] should become ['a','b','a','c','a'] but instead it becomes ['a','b,'c','a']
It takes a very log time to finish as I'm sure it's a very
inefficient way to do this.
Final remark: The code I've tried was more of a little hack to achieve the desired output on the most common input, however it doesn't do exactly what I intended it to. It's also important to note that the input changes over time. Repetitions of single letters aren't always the same, although I believe they're always grouped together, so I was thinking of making a flag that is True when it hits a letter and becomes false as soon as it changes to a different one, but this also has the problem of not being able to account for the fact that two letters that are the same might be put right next to each other. The count for each letter as an individual is always between 3000-3400, so I know that if the count is above that, there are more than 1.
UPDATE: Solution
Following hiro protagonist's suggestion with minor modifications, the following code seems to work:
full = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a']
from itertools import groupby
letters_pre = [key for key, _group in groupby(full)]
letters_post = []
for x in range(len(letters_pre)):
if x>0 and letters_pre[x] != letters_pre[x-1]:
letters_post.append(letters_pre[x])
if x == 0:
letters_post.append(letters_pre [x])
print(letters_post)
The only problem is that it doesn't consider that sometimes letters can appear in between unique ones, as described in "Note(1)", but that's only a very minor issue. The bigger issue is that it doesn't consider when two separate occurances of the same letter are consecutive, for example (two for uniqueness as example): ['a','a','a','a','b','b'] gets turned to ['a','b'] when desired output should be ['a','a','b']

this is where itertools.groupby may come in handy:
from itertools import groupby
a = ["a", "a", "b", "b", "a", "a", "c", "c"]
res = [key for key, _group in groupby(a)]
print(res) # ['a', 'b', 'a', 'c']
this is a version where you could 'scale' down the unique keys (but are guaranteed to have at leas one in the result):
from itertools import groupby, repeat, chain
a = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'a',
'd', 'd', 'a', 'a']
scale = 0.4
key_count = tuple((key, sum(1 for _item in group)) for key, group in groupby(a))
# (('a', 4), ('b', 2), ('c', 5), ('a', 2), ('d', 2), ('a', 2))
res = tuple(
chain.from_iterable(
(repeat(key, round(scale * count) or 1)) for key, count in key_count
)
)
# ('a', 'a', 'b', 'c', 'c', 'a', 'd', 'a')
there may be smarter ways to determine the scale (probably based on the length of the input list a and the average group length).

Might be a strange one, but:
b = []
for i in a:
if next(iter(b[::-1]), None) != i:
b.append(i)
print(b)
Output:
['a', 'b', 'a', 'c']

Sorting a txt-file in a descending order in Python

I have a text file with two columns.
I want to sort this file in a descending order, based on the second column.
In the following example, I have tree rows and two columns.
So my input is the following array:
array([['A', 82512.09],
['B', 4036.5],
['C', 1187798.0]])
My output that I want to achieve is:
array([['C', 1187798.0],
['A', 82512.09],
['B', 4036.5]])
Is there an efficient way of achieving this?
Thanks in advance,
Steven

sorted has some nice features to help your sorting. You can define a key via lambda and finally reverse the order to sort in descending order.
Have a look at: https://wiki.python.org/moin/HowTo/Sorting
mylist = [['A', 82512.09], ['B', 4036.5], ['C', 1187798.0]]
result = sorted(mylist, key=lambda second_col: second_col[1], reverse=True)
# output: [['C', 1187798.0], ['A', 82512.09], ['B', 4036.5]]

Python Pandas DataFrame pivot_table bizarre values

"Bizarre" is such an emotionally charged word.
Assume that I have 5 students: A, B, C, D, and E. Each of these students grades two of their peers on a writing assignment. The data is as follows:
peer_review = pd.DataFrame({
'Student': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'E', 'E'],
'Assessor': ['B', 'C', 'A', 'D', 'D', 'D', 'B', 'D', 'D', 'D', 'A', 'A', 'A', 'E', 'C', 'E'],
'Score': [72, 53, 92, 100, 2, 90, 75, 50, 50, 47, 97, 86, 41, 17, 47, 29]})
Now, in some cases an assessor graded the student's assignment more than once. Maybe the student turned it in and revised several times. Maybe the assessor was drunk and didn't remember that he had already graded this student's assignment. In any case, I would like to be able to see a list of all scores that each assessor gave to each student. I tried to do this as follows:
peer_review.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=identity)
I can already hear you asking --- What is the "identity" function? It's this:
def identity(x):
return x
However, when I run this the pivot_table function repeatedly, it gives me different answers each time for the cells that have multiple values.
So, here are the questions:
What is the significance of the numbers that seem to change randomly as I run the pivot_table function repeatedly?
How do I fix the identity function so that it returns a simple list of all the scores when an assessor graded the same assignment more than once?
------------------UPDATE #1:------------------
I found that it is a pandas Series object that is being passed to the identity function. I changed the identity function to this:
def identity(x):
return x.values
This still gives me the bizarre random numbers. Realizing that x.values is a numpy.ndarray, I then tried this:
def identity(x):
return x.values.tolist()
This results in a ValueError exception. ("Function does not reduce.")
------------------UPDATE #2:------------------
The workaround proposed by ZJS works perfectly. Still wondering why pivot_table has failed me.

This will work every time...
groups = peer_review.groupby(['Assessor','Student']) #groups into Assessor,Student combos
peer_review = groups.apply(lambda x:list(x['Score'])) #apply your group function
peer_review =peer_review.unstack('Student') #Set student index as the columns
I'm still investigating why pivot_table doesn't work

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping, summing, sorting and selecting within a DataFrame - python

Related

how to automatically drop index levels that only have single value?

Highlighting specific columns in bar chart in python using altair

Remove some duplicates from list in python

Sorting a txt-file in a descending order in Python

Python Pandas DataFrame pivot_table bizarre values

Categories

Resources