Not getting 0 index from pandas value_counts() - python

total_val_count = dataset[attr].value_counts()
for i in range(len(total_val_count.index)):
print total_val_count[i]
I have written this piece of code which counts occurrences of all distinct values of an attribute in a dataframe. The problem I am facing is that I am unable to access the first value by using index 0. I get a KeyError: 0 error in the first loop run itself.
The total_val_count contains proper values as shown below:
34 2887
4 2708
13 2523
35 2507
33 2407
3 2404
36 2382
26 2378
16 2282
22 2187
21 2141
12 2104
25 2073
5 2052
15 2044
17 2040
14 2027
28 1984
27 1980
23 1979
24 1960
30 1953
29 1936
31 1884
18 1877
7 1858
37 1767
20 1762
11 1740
8 1722
6 1693
32 1692
10 1662
9 1576
19 1308
2 1266
1 175
38 63
dtype: int64

total_val_count is a Series. The index of the Series are values in dataset[attr],
and the values in the Series are the number of times the associated value in dataset[attr] appears.
When you index a Series with total_val_count[i], Pandas looks for i in the index and returns the assocated value. In other words, total_val_count[i] is indexing by index value, not by ordinal.
Think of a Series as a mapping from the index to the values. When using plain indexing, e.g. total_val_count[i], it behaves more like a dict than a list.
You are getting a KeyError because 0 is not a value in the index.
To index by ordinal, use total_val_count.iloc[i].
Having said that, using for i in range(len(total_val_count.index)) -- or, what amounts to the same thing, for i in range(len(total_val_count)) -- is not recommended. Instead of
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
you could use
for value in total_val_count.values:
print(value)
This is more readable, and allows you to access the desired value as a variable, value, instead of the more cumbersome total_val_count.iloc[i].
Here is an example which shows how to iterate over the values, the keys, both the keys and values:
import pandas as pd
s = pd.Series([1, 2, 3, 2, 2])
total_val_count = s.value_counts()
print(total_val_count)
# 2 3
# 3 1
# 1 1
# dtype: int64
for value in total_val_count.values:
print(value)
# 3
# 1
# 1
for key in total_val_count.keys():
print(key)
# 2
# 3
# 1
for key, value in total_val_count.iteritems():
print(key, value)
# (2, 3)
# (3, 1)
# (1, 1)
for i in range(len(total_val_count)):
print(total_val_count.iloc[i])
# 3
# 1
# 1

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break
If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

function calculation based on condition [duplicate]

This question already has an answer here:
Pandas Rolling Python to create new Columns
(1 answer)
Closed 2 years ago.
I am practising and new to create a function in Python with conditions:
create a function that takes an input of an integer number (for example m, where m is between 2 to n, and n is the maximum number of rows). This function calculates the ‘Sum A’ and ‘Sum B’ from the last m-days. There will be no value for the first m-days
The original data:
V TP A B Sum A Sum B
3509 47.81
4862 48.406667 235353.2133
1810 49.26 89160.6
3824 49.263333 188382.9867
2209 47.386667 104677.1467
4558 45.573333 207723.2533
3832 44.396667 170128.0267
3778 43.75 165287.5
1005 44.64 44863.2
4047 43.76 177096.72
2201 44.383333 97687.71667 655447.7167 824912.6467
2507 45.156667 113207.7633 533302.2667 824912.6467
4392 44.4333 195151.2 444141.6667 1020063.847
3497 43.296667 151408.4433 255758.68 1171472.29
1181 43.07 50865.67 255758.68 1117660.813
1971 42.89 84536.19 255758.68 994473.75
4994 43.563333 217555.2867 473313.9667 824345.7233
2017 44.816667 90395.21667 563709.1833 659058.2233
2823 44.936667 126856.21 645702.1933 659058.2233
2774 45.13 125190.62 770892.8133 481961.5033
Continue original data
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The attempt that I have done so far is and it shows error KeyError 'A':
curret_period = int(input("enter days: "))
sumA = curret_period * ((df["A"] < df["A"]),'')
sumB = curret_period * ((df["B"] >= df["B"]),'')
print(sumA)
print(sumB)
I am wondering is there a better way to create the function? I also wonder if below is the one that I need?
def function_name()
print()
Expected result when m= 10:
A B Sum A Sum B
0
1 235353.21333333332
2 89160.59999999999
3 188382.98666666663
4 104677.1466666667
5 207723.25333333333
6 170128.02666666667
7 165287.5
8 44863.200000000004
9 177096.72
10 97687.71666666666 655447.7167 824912.6467
11 113207.76333333334 533302.2667 824912.6467
12 195151.2 444141.6667 1020063.847
13 151408.4433333333 255758.68 1171472.29
14 50865.66999999999 255758.68 1117660.813
15 84536.19000000002 255758.68 994473.75
16 217555.28666666665 473313.9667 824345.7233
17 90395.21666666666 563709.1833 659058.2233
18 126856.21 645702.1933 659058.2233
19 125190.61999999998 770892.8133 481961.5033
Any suggestion? Thank you in advance.
You can utilize df.tail() to get the last m rows of the dataframe and then simply sum() each column.
We can also check if m is not greater than the length of the dataframe, however even if you did not have this it would just sum the entire dataframe.
def sumof(df, m):
if m <= len(df.index):
rows = df.tail(m)
print(rows['A'].sum())
print(rows['B'].sum())
else:
print("'m' can not be greater than length of dataframe")

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

delete matplotlib Line2D object from address

I'm currently trying to improve the memory usage of a script, that produces figures which are very "heavy" over time.
before creating the figures :
('Before: heap:', Partition of a set of 337 objects. Total size = 82832 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 75 22 32520 39 32520 39 dict (no owner)
1 39 12 20904 25 53424 64 dict of guppy.etc.Glue.Interface
2 8 2 8384 10 61808 75 dict of guppy.etc.Glue.Share
3 16 5 4480 5 66288 80 dict of guppy.etc.Glue.Owner
4 84 25 4280 5 70568 85 str
5 23 7 3128 4 73696 89 list
6 39 12 2496 3 76192 92 guppy.etc.Glue.Interface
7 16 5 1152 1 77344 93 guppy.etc.Glue.Owner
8 1 0 1048 1 78392 95 dict of guppy.heapy.Classifiers.ByUnity
9 1 0 1048 1 79440 96 dict of guppy.heapy.Use._GLUECLAMP_
<15 more rows. Type e.g. '_.more' to view.>)
And after creating them :
('After : heap:', Partition of a set of 89339 objects. Total size = 32584064 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 2340 3 7843680 24 7843680 24 dict of matplotlib.lines.Line2D
1 1569 2 5259288 16 13102968 40 dict of matplotlib.text.Text
2 10137 11 3208536 10 16311504 50 dict (no owner)
3 2340 3 2452320 8 18763824 58 dict of matplotlib.markers.MarkerStyle
4 2261 3 2369528 7 21133352 65 dict of matplotlib.path.Path
5 662 1 2219024 7 23352376 72 dict of matplotlib.axis.XTick
6 1569 2 1644312 5 24996688 77 dict of matplotlib.font_manager.FontProperties
7 10806 12 856816 3 25853504 79 list
8 8861 10 708880 2 26562384 82 numpy.ndarray
9 1703 2 476840 1 27039224 83 dict of matplotlib.transforms.Affine2D
<181 more rows. Type e.g. '_.more' to view.>)
then, I do :
figures=[manager.canvas.figure for manager in matplotlib._pylab_helpers.Gcf.get_all_fig_managers()]
for i, figure in enumerate(figures): figure.clf(); plt.close(figure)
figures=[manager.canvas.figure for manager in matplotlib._pylab_helpers.Gcf.get_all_fig_managers()]#here, figures==[]
del figures
hp.heap()
This prints :
Partition of a set of 71966 objects. Total size = 23491976 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1581 2 5299512 23 5299512 23 dict of matplotlib.lines.Line2D
1 1063 1 3563176 15 8862688 38 dict of matplotlib.text.Text
2 7337 10 2356952 10 11219640 48 dict (no owner)
3 1584 2 1660032 7 12879672 55 dict of matplotlib.path.Path
4 1581 2 1656888 7 14536560 62 dict of matplotlib.markers.MarkerStyle
5 441 1 1478232 6 16014792 68 dict of matplotlib.axis.XTick
6 1063 1 1114024 5 17128816 73 dict of matplotlib.font_manager.FontProperties
7 7583 11 619384 3 17748200 76 list
8 6500 9 572000 2 18320200 78 __builtin__.weakref
9 6479 9 518320 2 18838520 80 numpy.ndarray
<199 more rows. Type e.g. '_.more' to view.>
So appearantly a handful of matplotlib objects have been deleted, but not all of them.
To begin with, I want to look at all the Line2D objects that are left :
objs = [obj for obj in gc.get_objects() if isinstance(obj, matplotlib.lines.Line2D)]
#[... very long list with e.g., <matplotlib.lines.Line2D object at 0x1375ede590>, <matplotlib.lines.Line2D object at 0x1375ede4d0>, <matplotlib.lines.Line2D object at 0x1375eec390>, <matplotlib.lines.Line2D object at 0x1375ef6350>, <matplotlib.lines.Line2D object at 0x1375eece10>, <matplotlib.lines.Line2D object at 0x1375eec690>, <matplotlib.lines.Line2D object at 0x1375eec610>, <matplotlib.lines.Line2D object at 0x1375eec590>, <matplotlib.lines.Line2D object at 0x1375eecb10>, <matplotlib.lines.Line2D object at 0x1375ef6850>, <matplotlib.lines.Line2D object at 0x1375eec350>]
print len(objs)#29199 (!!!)
So now I would like to be able to access all these objects to be able to delete them and clear the memory, but I don't know how I could do that...
Thanks for your help!

Categories

Resources