Sort CSV file by count on column values - python

May I know how I can sort a csv file by a certain column, not by the value in the column, but the count of rows having the largest number of same values should appear first (or last).
Is it possible to do this using csv package or pandas. If I can see both that will be great.
I hope I have described the problem in an understandable manner

With pandas you can combine using key parameter of sort_values() and a lambda function that effectively calculates the frequency.
import numpy as np
df = pd.DataFrame({"col":np.random.choice(list("abcd"),20,p=(.46,.46,.04,.04))})
df.sort_values("col", key=lambda s: s.groupby(s).transform("size"))
output
col
0
c
2
d
1
a
16
a
5
a
15
a
8
a
13
a
11
a
17
b
14
b
12
b
9
b
18
b
7
b
6
b
4
b
3
b
10
b
19
b

Related

Drop all the subsequent columns after a particular column with some string match python

I have a very big dataset and i am working by selecting subset of the entire data in each of these subset i want to drop all the subsequent columns after matching one of the column name's string value that is randomnr.
my df columns looks like this:-
A B C D E randomnr H I J K
If this is the subset i am working on i want to drop H I J K columns which are after my common string randomnr and this is the common string match for all the subsets.
for example:- in first subset it can be 'randomnr_abc' and in second subset 'randomnr_123' but all subsets contains 'randomnr'
I am specifically looking to drop these columns for the subset i am working on so that i can use the same code for all the other subsets.
Please help me on this. thanks in advance
IIUC, use pandas.Index.str.find with argmax (assuming your keyword exists uniquely:
print(df.iloc[:, :df.columns.str.find("randomnr").argmax()+1])
Sample:
# df
A B C D E randomnr_123 H I J K
0 1 2 3 4 5 6 7 8 9 10
# df2
A B randomnr_abc H I J K
0 1 2 6 7 8 9 10
Output:
print(df.iloc[:, :df.columns.str.find("randomnr").argmax()+1])
A B C D E randomnr_123
0 1 2 3 4 5 6
print(df2.iloc[:, :df2.columns.str.find("randomnr").argmax()+1])
A B randomnr_abc
0 1 2 6

Getting lowest valued duplicated columns only

I have a dataframe with 2 columns: value and product. There will be duplicated products, but with different values. What I want to do is to get all products, but remove any duplication. The condition to remove duplication will be to get the row with the lowest value and drop the rest. For example, I want something like this:
Before:
product value
A 25
B 45
C 15
C 14
C 13
B 22
After
product value
A 25
B 22
C 13
How can I make it so that only the lowest valued duplicated columns get added in the new dataframe?
df.sort_values('value').groupby('product').first()
# value
#product
#A 25
#B 22
#C 13
You can sort_values and then drop_duplicates:
res = df.sort_values('values').drop_duplicates('product')
While going through the requirement i see , even you don't need to use drop.duplicate and sort_values as we are looking for the least minimum value of each product column in the dataFrame. So, there are couple ways doing it as follows...
I believe one of the shorted way will looking at the unique index by using pandas.DataFrame.idxmin.
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.loc[df.groupby('product')['value'].idxmin()]
product value
0 A 25
5 B 22
4 C 13
OR
In this case another shortest and elegant way around using Compute min of group values using groupby.min() :
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.groupby('product').min()
value
product
A 25
B 22
C 13

Create a new dataframe by aggregating repeated origin and destination values by a separate count column in a pandas dataframe

I am having trouble analysing origin-destination values in a pandas dataframe which contains origin/destination columns and a count column of the frequency of these. I want to transform this into a dataframe with the count of how many are leaving and entering:
Initial:
Origin Destination Count
A B 7
A C 1
B A 1
B C 4
C A 3
C B 10
For example this simplified dataframe has 7 leaving from A to B and 1 from A to C so overall leaving place A would be 8, and entering place A would be 4 (B - A is 1, C - A is 3) etc. The new dataframe would look something like this.
Goal:
Place Entering Leaving
A 4 8
B 17 5
C 5 13
I have tried several techniques such as .groupby() but have not yet created my intended dataframe. How can I handle the repeated values in the origin/destination columns and assign them to a new dataframe with aggregated values of just the count of leaving and entering?
Thank you!
Use double groupby + concat:
a = df.groupby('Destination')['Count'].sum()
b = df.groupby('Origin')['Count'].sum()
df = pd.concat([a,b], axis=1, keys=('Entering','Leaving')).rename_axis('Place').reset_index()
print (df)
Place Entering Leaving
0 A 4 8
1 B 17 5
2 C 5 13
pivot_table then do sum
df=pd.pivot_table(df,index='Origin',columns='Destination',values='Count',aggfunc=sum)
pd.concat([df.sum(0),df.sum(1)],1)
Out[428]:
0 1
A 4.0 8.0
B 17.0 5.0
C 5.0 13.0

How does pandas argsort work? How do I interpret the result?

I have the following pandas series:
>>>ser
num let
0 a 12
b 11
c 18
1 a 10
b 8
c 5
2 a 8
b 9
c 6
3 a 15
b 10
c 11
When I use argsort, I get this:
>>>ser.argsort()
num let
0 a 5
b 8
c 4
1 a 6
b 7
c 3
2 a 10
b 1
c 11
3 a 0
b 9
c 2
Which I don't really understand. Shouldn't ser[(1, 'c')] get the lowest value from argsort?
I am further confused by how ordering ser according to ser.argsort() works like a charm:
>>>ser.iloc[ser.argsort()]
num let
1 c 5
2 c 6
1 b 8
2 a 8
b 9
1 a 10
3 b 10
0 b 11
3 c 11
0 a 12
3 a 15
0 c 18
Will appreciate any input to help me sort this out.
Per the documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.argsort.html
pd.Series.argsort()
does the same job as np.ndarray.argsort(), namely (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy-argsort)
"Returns the indices that would sort an array."
So it returns the Series with the values replaced by the order the index should be in to see the Series sorted. This is why when you call ser.iloc[ser.argsort()], you get a sorted Series.
If you're looking for a simple way to sort the series by values, why not just use ser.sort_values()?
The confusion over what ser.argsort()[(1, 'c')] returns is understandable.
You might expect it to return the position of ser[(1,'c')] after the sort, but that's not what it's trying to do.
What ser.argsort()[(1, 'c')] is doing is:
once we've performed the argsort, what is old the positional index of the value which now resides at the location index (1,'c').
After sorting the series, the value which would sit where (1,'c') was previously is (1,'b'), which is ser.iloc[3], hence you get 3.
It's not at all intuitive, but that's what it is!
argsort returns a series with the same index as the initial series (so you can use .iloc as you have), but with the values replaced by the prior position of the sorted value.
No, that's not how argsort works. argsort tells you where that element comes from in the original list. If you look at the argsorted index, you see the first element belongs to index 5 in the original series. If you look at the 5th index, you'll see that's 5, which is indeed the smallest value. And so on.

Aggregating dataframe to give sum of elements and string of grouped indices

I'm trying to use groupby to give me the sum or mean of a number of elements, and a string of the original row indices for each group. So for instance, the dataframe:
>>> df = pd.DataFrame([[1,2,3],[1,3,4],[2,3,4],[2,5,6],[7,8,3],[11,12,13],[11,2,3]],index = ['p','q','r','s','t','u','v'],columns =['a','b','c'])
a b c
p 1 2 3
q 1 3 4
r 2 3 4
s 2 5 6
t 7 8 3
u 11 12 13
v 11 2 3
I would then like df to be grouped by 'a', to give:
b c indices
1 5 7 p,q
2 8 10 r,s
7 8 3 t
11 14 16 u,v
So far, I've tried:
df.groupby('a').agg({'score' : np.sum, 'indices' : lambda x: ",".join(list(x.index.values))})
But am receiving an error based on 'indices' not existing, can anyone advise how to accomplish what I'm trying to do?
Thanks
The way aggregation works is that you give a key and a value, where the key is a pre existing column name and the value is a function to map on the column.
So to get the sums the way you want, you do the following:
>>> grouped = df.groupby('a')
>>> grouped.agg({'b' : np.sum, 'c' : np.sum}).head()
c b
a
1 7 5
2 10 8
7 3 8
11 16 14
But you want to know the rows that have been combined in a third column. So you actually need to add this column before you groupby! Here is the full code:
df['indices'] = range(len(df))
grouped = df.groupby('a')
final = grouped.agg({'b' : np.sum, 'c' : np.sum, 'indices': lambda x: ",".join(list(x.index.values))})
then you get the following result:
>>> final.head()
indices c b
a
1 p,q 7 5
2 r,s 10 8
7 t 3 8
11 u,v 16 14
if you have any further questions, feel free to comment.

Categories

Resources