Python Pandas: Get dataframe.value_counts() result as list - python

I have a DataFrame and I want to get both group names and corresponding group counts as a list or numpy array. However when I convert the output to matrix I only get group counts I dont get the names. Like in the example below:
df = pd.DataFrame({'a':[0.5, 0.4, 5 , 0.4, 0.5, 0.6 ]})
b = df['a'].value_counts()
print(b)
output:
[0.4 2
0.5 2
0.6 1
5.0 1
Name: a, dtype: int64]
what I tried is print[b.as_matrix()]. Output:
[array([2, 2, 1, 1])]
In this case I do not have the information of corresponding group names which also I need. Thank you.

Convert it to a dict:
bd = dict(b)
print(bd)
# {0.40000000000000002: 2, 0.5: 2, 0.59999999999999998: 1, 5.0: 1}
Don't worry about the long decimals. They're just a result of floating point representation; you still get what you expect from the dict.
bd[0.4]
# 2

most simplest way
list(df['a'].value_counts())

One approach with np.unique -
np.c_[np.unique(df.a, return_counts=1)]
Sample run -
In [270]: df
Out[270]:
a
0 0.5
1 0.4
2 5.0
3 0.4
4 0.5
5 0.6
In [271]: np.c_[np.unique(df.a, return_counts=1)]
Out[271]:
array([[ 0.4, 2. ],
[ 0.5, 2. ],
[ 0.6, 1. ],
[ 5. , 1. ]])
We can zip the outputs from np.unique for list output -
In [283]: zip(*np.unique(df.a, return_counts=1))
Out[283]: [(0.40000000000000002, 2), (0.5, 2), (0.59999999999999998, 1), (5.0, 1)]
Or use zip directly on the value_counts() output -
In [338]: b = df['a'].value_counts()
In [339]: zip(b.index, b.values)
Out[339]: [(0.40000000000000002, 2), (0.5, 2), (0.59999999999999998, 1), (5.0, 1)]

Related

Creating dataframe based on conditions on other dataframes

I have two dataframes: s-1 column, d-3 columns
s = {0: [0, 0.3, 0.5, -0.1, -0.2, 0.7, 0]}
d = {0: [0.1, 0.2, -0.2, 0, 0, 0, 0], 1: [0.3, 0.4, -0.7, 0, 0.8, 0, 0.1], 2: [-0.5, 0.4, -0.1, 0.5, 0.5, 0, 0]}
sd = pd.DataFrame(data=s)
dd = pd.DataFrame(data=d)
result = pd.DataFrame()
I want to get the result dataframe (1 column) based on values in those two:
1. When value in sd = 0 then 0
2. When value in sd != 0 then check if for this row there is at least one non-zero value in dd, if yes - get avg of non zero values, if no return OK
Here is what I would like to get:
results:
0 0
1 -0,033
2 -0,333
3 0,5
4 0,65
5 OK
6 0
I know I can use dd[dd != 0].mean(axis=1) to calculate the mean of non zero values for the row but I don't know how to connect all these 3 conditions together
Using np.where twice
np.where(sd[0]==0,0,np.where(dd.eq(0).all(1),'OK',dd.mask(dd==0).mean(1)))
Out[232]:
array(['0', '0.3333333333333333', '-0.3333333333333333', '0.5', '0.65',
'OK', '0'], dtype='<U32')
Using numpy.select:
c1 = sd[0].eq(0)
c2 = dd.eq(0).all(1)
res = np.select([c1, c2], [0, 'OK'], dd.where(dd.ne(0)).mean(1))
pd.Series(res)
0 0
1 0.3333333333333333
2 -0.3333333333333333
3 0.5
4 0.65
5 OK
6 0
dtype: object
thank you for your help. I managed to do it in a quite different way.
I used:
res1 = pd.Series(np.where(sd[0]==0, 0, dd[dd != 0].mean(axis=1))).fillna('OK')
The difference is that it returns float values (for rows that are not 'OK'), not string. It also appears to be a little bit faster.

Search a list of values in a csv file in python

I have a list of lists as follows.
[[0, 0.1, 0.3], [0.5, 0.2, 0.8]]
I also have a csv file as follows.
No, heading1, heading2, heading3
0, 0, 0.7, 0.3
1, 0, 0.1, 0.3
Now I want to search these list of values only using the values in 'heading1', 'heading2' and 'heading3' and if matched return the 'No' corresponding to it.
Can we do this using pandas?
You can use merge by all columns:
L = [[0, 0.1, 0.3], [0.5, 0.2, 0.8]]
#helper df with same columns without first
df1 = pd.DataFrame(L, columns= df.columns[1:])
print (df1)
heading1 heading2 heading3
0 0.0 0.1 0.3
1 0.5 0.2 0.8
#after merge select column No
d = pd.merge(df, df1)['No']
print (d)
0 1
Name: No, dtype: int64

How to select the first 3 rows of every group in pandas?

I get a pandas dataframe like this:
id prob
0 1 0.5
1 1 0.6
2 1 0.4
3 1 0.2
4 2 0.3
6 2 0.5
...
I want to group it by 'id', sort descending order and get the first 3 prob of every group. Note that some groups contain rows less than 3.
Finally I want to get a 2D array like:
[[1, 0.6, 0.5, 0.4], [2, [0.5, 0.3]]...]
How can I do that with pandas?
Thanks!
Use sort_values, groupby, and head:
df.sort_values(by=['id','prob'], ascending=[True,False]).groupby('id').head(3).values
Output:
array([[ 1. , 0.6],
[ 1. , 0.5],
[ 1. , 0.4],
[ 2. , 0.5],
[ 2. , 0.3]])
Following #COLDSPEED lead:
df.sort_values(by=['id','prob'], ascending=[True,False])\
.groupby('id').agg(lambda x: x.head(3).tolist())\
.reset_index().values.tolist()
Output:
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]
You can use groupby and nlargest
df.groupby('id').prob.nlargest(3).reset_index(1,drop = True)
id
1 0.6
1 0.5
1 0.4
2 0.5
2 0.3
For the array
df1 = df.groupby('id').prob.nlargest(3).unstack(1)#.reset_index(1,drop = True)#.set_index('id')
np.column_stack((df1.index.values, df1.values))
You get
array([[ 1. , 0.5, 0.6, 0.4, nan, nan],
[ 2. , nan, nan, nan, 0.3, 0.5]])
If you're looking for a dataframe of array columns, you can use np.sort:
df = df.groupby('id').prob.apply(lambda x: np.sort(x.values)[:-4:-1])
df
id
1 [0.6, 0.5, 0.4]
2 [0.5, 0.3]
To retrieve the values, reset_index and access:
df.reset_index().values
array([[1, array([ 0.6, 0.5, 0.4])],
[2, array([ 0.5, 0.3])]], dtype=object)
[[n, g.nlargest(3).tolist()] for n, g in df.groupby('id').prob]
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]

Efficient way for calculating selected differences in array

I have two arrays as an output from a simulation script where one contains IDs and one times, i.e. something like:
ids = np.array([2, 0, 1, 0, 1, 1, 2])
times = np.array([.1, .3, .3, .5, .6, 1.2, 1.3])
These arrays are always of the same size. Now I need to calculate the differences of times, but only for those times with the same ids. Of course, I can simply loop over the different ids an do
for id in np.unique(ids):
diffs = np.diff(times[ids==id])
print diffs
# do stuff with diffs
However, this is quite inefficient and the two arrays can be very large. Does anyone have a good idea on how to do that more efficiently?
You can use array.argsort() and ignore the values corresponding to change in ids:
>>> id_ind = ids.argsort(kind='mergesort')
>>> times_diffs = np.diff(times[id_ind])
array([ 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
To see which values you need to discard, you could use a Counter to count the number of times per id (from collections import Counter)
or just sort ids, and see where its diff is nonzero: these are the indices where id change, and where you time diffs are irrelevant:
times_diffs[np.diff(ids[id_ind]) == 0] # ids[id_ind] being the sorted indices sequence
and finally you can split this array with np.split and np.where:
np.split(times_diffs, np.where(np.diff(ids[id_ind]) != 0)[0])
As you mentionned in your comment, argsort() default algorithm (quicksort) might not preserve order between equals times, so the argsort(kind='mergesort') option must be used.
Say you np.argsort by ids:
inds = np.argsort(ids, kind='mergesort')
>>> array([1, 3, 2, 4, 5, 0, 6])
Now sort times by this, np.diff, and prepend a nan:
diffs = np.concatenate(([np.nan], np.diff(times[inds])))
>>> diffs
array([ nan, 0.2, -0.2, 0.3, 0.6, -1.1, 1.2])
These differences are correct except for the boundaries. Let's calculate those
boundaries = np.concatenate(([False], ids[inds][1: ] == ids[inds][: -1]))
>>> boundaries
array([False, True, False, True, True, False, True], dtype=bool)
Now we can just do
diffs[~boundaries] = np.nan
Let's see what we got:
>>> ids[inds]
array([0, 0, 1, 1, 1, 2, 2])
>>> times[inds]
array([ 0.3, 0.5, 0.3, 0.6, 1.2, 0.1, 1.3])
>>> diffs
array([ nan, 0.2, nan, 0.3, 0.6, nan, 1.2])
I'm adding another answer, since, even though these things are possible in numpy, I think that the higher-level pandas is much more natural for them.
In pandas, you could do this in one step, after creating a DataFrame:
df = pd.DataFrame({'ids': ids, 'times': times})
df['diffs'] = df.groupby(df.ids).transform(pd.Series.diff)
This gives:
>>> df
ids times diffs
0 2 0.1 NaN
1 0 0.3 NaN
2 1 0.3 NaN
3 0 0.5 0.2
4 1 0.6 0.3
5 1 1.2 0.6
6 2 1.3 1.2
The numpy_indexed package (disclaimer: I am its author) contains efficient and flexible functionality for these kind of grouping operations:
import numpy_indexed as npi
unique_ids, diffed_time_groups = npi.group_by(keys=ids, values=times, reduction=np.diff)
Unlike pandas, it does not require a specialized datastructure just to perform this kind of rather elementary operation.

Python - delete object that points to another object

I have a list of tuples like below:
In [136]: z
Out[136]:
[(0, array([ 0.71428571, 2.92857143, 1.64285714, 1.07142857])),
(1, array([ 2.89473684, 1.68421053, 0.52631579, 3.21052632])),
(2, array([ 1.03571429, 1.5 , 2.75 , 2.96428571])),
(3, array([ 3.35714286, 2.07142857, 3. , 1.28571429])),
(0, array([ 5.234324 , 3.234324 , 4. , 2.34534534])),
(4, array([ 0.6, 0.1, 2.6, 0.4]))]
and a list of strings like below:
In [138]: b
Out[138]: ['Sam', 'Rachel', 'Mosses', 'Roth', 'Wilhelm']
The integer in z points to a string in b.
For example, the vectors (0, array([ 0.71428571, 2.92857143, 1.64285714, 1.07142857])) and (0, array([ 5.234324 , 3.234324 , 4. , 2.34534534])) both represent 'Sam' (which is b[0]).
I want to delete an entry from b. In result, all vectors in z pointing to this removed entry will be deleted.
For example, if I will delete 'Sam', I want my new z to become:
In [136]: z
Out[136]:
[(0, array([ 2.89473684, 1.68421053, 0.52631579, 3.21052632])),
(1, array([ 1.03571429, 1.5 , 2.75 , 2.96428571])),
(2, array([ 3.35714286, 2.07142857, 3. , 1.28571429])),
(3, array([ 0.6, 0.1, 2.6, 0.4]))]
In [138]: b
Out[138]: ['rachel', 'mosses', 'roth', 'wilhelm']
I didn't try but probably something like (where k is the key to be removed; in your example k=0):
z = [ (e[0] - (e[0]>k), e[1]) for e in z if e[0] != k ]
Explanations: you can have a filter effect in a list comprehension with the syntax for e in z if; you can also substract 1 to the initial key when greater than k by using the arithmetic value of (e[0]>k).

Categories

Resources