select/filter bins after qcut decile - python

I am trying to access the labels (i.e. positional indicator) after binning my data by decile:
q = pd.qcut(df["revenue"], 10)
q.head():
7 (317.942, 500.424]
81 (317.942, 500.424]
83 (150.65, 317.942]
84 [0.19, 150.65]
85 (317.942, 500.424]
Name: revenue, dtype: category
Categories (10, object): [[0.19, 150.65] < (150.65, 317.942] < (317.942, 500.424] < (500.424, 734.916] ... (1268.306, 1648.35]
< (1648.35, 1968.758] < (1968.758, 2527.675] < (2527.675, 18690.2]]
In [233]:
This post link shows that you can do the following to access the labels:
>>> q.labels
But when I do that I get:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-246-e806c96b1ab2> in <module>()
----> 1 q.labels
C:\Users\blah\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
2666 if (name in self._internal_names_set or name in self._metadata or
2667 name in self._accessors):
-> 2668 return object.__getattribute__(self, name)
2669 else:
2670 if name in self._info_axis:
AttributeError: 'Series' object has no attribute 'labels'
In any case, what I want to do is use the labels to filter my data - likely by adding a new column in df which represents the positional label of the result of the decile (or quantile).

I personally like using the labels parameter in pd.qcut to specify clean looking and consistent labels.
np.random.seed([3,1415])
df = pd.DataFrame(dict(revenue=np.random.randint(1000000, 99999999, 100)))
df['decile'] = pd.qcut(df.revenue, 10, labels=range(10))
print(df.head())
As #jeremycg pointed out, you access category information via the cat accessor attribute
df.decile.cat.categories
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
You can quickly describe each bin
df.groupby('decile').describe().unstack()
You can filter
df.query('decile >= 8')
revenue decile
4 98274570 9
6 99418302 9
19 89598752 8
20 88877661 8
22 90789485 9
29 83126518 8
31 90700517 9
33 96816407 9
40 89937348 8
54 83041116 8
65 83399066 8
66 97055576 9
79 87700403 8
81 88592657 8
82 91963755 9
83 82443566 8
84 84880509 8
88 98603752 9
95 92548497 9
98 98963891 9
you can z-score within deciles
df = df.join(df.groupby('decile').revenue.agg(dict(Mean='mean', Std='std')), on='decile')
df['revenue_zscore_by_decile'] = df.revenue.sub(df.Mean).div(df.Std)
df.head()
revenue decile Std Mean revenue_zscore_by_decile
0 32951600 2 2.503325e+06 29649669 1.319018
1 70565451 6 9.639336e+05 71677761 -1.153928
2 6602402 0 5.395453e+06 11166286 -0.845876
3 82040251 7 2.976992e+06 78299299 1.256621
4 98274570 9 3.578865e+06 95513475 0.771500

Related

Python Pandas: Shape of passed values is (126, 5), indices imply (84, 5) [duplicate]

This question already has answers here:
Pandas concat: ValueError: Shape of passed values is blah, indices imply blah2
(7 answers)
Closed 2 years ago.
I have 2 dataframes with 84 rows, clearly the same lengths, but when i want to concat them to 1 df (concat by column - to have the name, Edge and Offset to the right of Latitude and Longitude), i get this error,.
what is going on?
Latitude Longitude
0 45.403538 -75.735729
1 45.403506 -75.735699
2 45.409095 -75.722588
3 45.409069 -75.722552
4 45.413496 -75.714184
.. ... ...
79 45.415609 -75.644769
80 45.416073 -75.645726
81 45.416193 -75.638802
82 45.416172 -75.638223
[84 rows x 2 columns]
name Edge Offset
0 TUN-W 1 3000
1 TUN-E 2 3000
2 BAY-W 5 102510
3 BAY-E 6 102579
4 PIM-W 5 186035
.. ... ... ...
37 PTSTTW 33 52710
38 PTSTTE 34 18997
39 PAG11 40 24362
40 PAG14 50 9927
41 PHND15 177 11662
[84 rows x 3 columns]
This is my code liner
output_df = pd.concat( [output_df, input_df], axis=1)
I got it:
name Edge Offset
0 TUN-W 1 3000
1 TUN-E 2 3000
2 BAY-W 5 102510
3 BAY-E 6 102579
4 PIM-W 5 186035
.. ... ... ...
37 PTSTTW 33 52710
38 PTSTTE 34 18997
39 PAG11 40 24362
40 PAG14 50 9927
41 PHND15 177 11662
The indexing was messed up. Somewhere in the middle, the index resets to 0 instead of counting up to 84.
So i did a segments_df = segments_df_start.append(segments_df_end).reset_index() (was in earlier part of my code) to fix the indexing for that dataframe before I pass it over.
So ALWAYS remember to check your indexes and .reset_index() , when troubleshooting!!

Pandas groupby throws: TypeError: unhashable type: 'numpy.ndarray'

I have a dataframe as shown in the picture:
problem dataframe: attdf
I would like to group the data by Source class and Destination class, count the number of rows in each group and sum up Attention values.
While trying to achieve that, I am unable to get past this type error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-100-6f2c8b3de8f2> in <module>()
----> 1 attdf.groupby(['Source Class', 'Destination Class']).count()
8 frames
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py in _factorize_array(values, na_sentinel, size_hint, na_value)
458 table = hash_klass(size_hint or len(values))
459 uniques, labels = table.factorize(values, na_sentinel=na_sentinel,
--> 460 na_value=na_value)
461
462 labels = ensure_platform_int(labels)
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.factorize()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
TypeError: unhashable type: 'numpy.ndarray'
attdf.groupby(['Source Class', 'Destination Class'])
gives me a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1e720f2080> which I'm not sure how to use to get what I want.
Dataframe attdf can be imported from : https://drive.google.com/open?id=1t_h4b8FQd9soVgYeiXQasY-EbnhfOEYi
Please advise.
#Adam.Er8 and #jezarael helped me with their inputs. The unhashable type error in my case was because of the datatypes of the columns in my dataframe.
Original df and df imported from csv
It turned out that the original dataframe had two object columns which i was trying to use up in the groupby. Hence the unhashable type error. But on importing the data into a new dataframe right out of a csv fixed the datatypes. Consequently, no type errors faced anymore.
try using .agg as follows:
import pandas as pd
attdf = pd.read_csv("attdf.csv")
print(attdf.groupby(['Source Class', 'Destination Class']).agg({"Attention": ['sum', 'count']}))
Output:
Attention
sum count
Source Class Destination Class
0 0 282.368908 1419
1 7.251101 32
2 3.361009 23
3 22.482438 161
4 14.020189 88
5 10.138409 75
6 11.377947 80
1 0 6.172269 32
1 181.582437 1035
2 9.440956 62
3 12.007303 67
4 3.025752 20
5 4.491725 28
6 0.279559 2
2 0 3.349921 23
1 8.521828 62
2 391.116034 2072
3 9.937170 53
4 0.412747 2
5 4.441985 30
6 0.220316 2
3 0 33.156251 161
1 11.944373 67
2 9.176584 53
3 722.685180 3168
4 29.776050 137
5 8.827215 54
6 2.434347 16
4 0 17.431855 88
1 4.195519 20
2 0.457089 2
3 20.401789 137
4 378.802604 1746
5 3.616083 19
6 1.095061 6
5 0 13.525333 75
1 4.289306 28
2 6.424412 30
3 10.911705 54
4 3.896328 19
5 250.309764 1132
6 8.643153 46
6 0 15.249959 80
1 0.150240 2
2 0.413639 2
3 3.108417 16
4 0.850280 6
5 8.655959 46
6 151.571505 686

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

output multiple files based on column value python pandas

i have a sample pandas data frame:
import pandas as pd
df = {'ID': [73, 68,1,94,42,22, 28,70,47, 46,17, 19, 56, 33 ],
'CloneID': [1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4 ],
'VGene': ['64D', '64D', '64D', 61, 61, 61, 311, 311, 311, 311, 311, 311, 311, 311]}
df = pd.DataFrame(df)
it looks like this:
df
Out[7]:
CloneID ID VGene
0 1 73 64D
1 1 68 64D
2 1 1 64D
3 1 94 61
4 1 42 61
5 2 22 61
6 2 28 311
7 3 70 311
8 3 47 311
9 3 46 311
10 4 17 311
11 4 19 311
12 4 56 311
13 4 33 311
i want to write a simple script to output each cloneID to a different output file. so in this case there would be 4 different files.
the first file would be named 'CloneID1.txt' and it would look like this:
CloneID ID VGene
1 73 64D
1 68 64D
1 1 64D
1 94 61
1 42 61
second file would be named 'CloneID2.txt':
CloneID ID VGene
2 22 61
2 28 311
third file would be named 'CloneID3.txt':
CloneID ID VGene
3 70 311
3 47 311
3 46 311
and last file would be 'CloneID4.txt':
CloneID ID VGene
4 17 311
4 19 311
4 56 311
4 33 311
the code i found online was:
import pandas as pd
data = pd.read_excel('data.xlsx')
for group_name, data in data.groupby('CloneID'):
with open('results.csv', 'a') as f:
data.to_csv(f)
but it outputs everything to one file instead of multiple files.
You can do something like the following:
In [19]:
gp = df.groupby('CloneID')
for g in gp.groups:
print('CloneID' + str(g) + '.txt')
print(gp.get_group(g).to_csv())
CloneID1.txt
,CloneID,ID,VGene
0,1,73,64D
1,1,68,64D
2,1,1,64D
3,1,94,61
4,1,42,61
CloneID2.txt
,CloneID,ID,VGene
5,2,22,61
6,2,28,311
CloneID3.txt
,CloneID,ID,VGene
7,3,70,311
8,3,47,311
9,3,46,311
CloneID4.txt
,CloneID,ID,VGene
10,4,17,311
11,4,19,311
12,4,56,311
13,4,33,311
So here we iterate over the groups in for g in gp.groups: and we use this to create the result file path name and call to_csv on the group so the following should work for you:
gp = df.groupby('CloneID')
for g in gp.groups:
path = 'CloneID' + str(g) + '.txt'
gp.get_group(g).to_csv(path)
Actually the following would be even simpler:
gp = df.groupby('CloneID')
gp.apply(lambda x: x.to_csv('CloneID' + str(x.name) + '.txt'))

Not getting 0 index from pandas value_counts()

total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64
total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1

Categories

Resources