fastest way to access dataframe cell by colums values?

fastest way to access dataframe cell by colums values? - python

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break

If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Related

Selecting data in pandas based on conditions

I am importing data from excel using Pandas and it looks like below,
time Column1 Column2 Column3 ID
0 1.0 181.359 -1.207 9.734 10
1 2.0 181.357 -1.179 9.729 10
2 3.0 181.357 -0.713 9.732 10
3 602.0 179.148 505.520 17.774 1810
4 603.0 179.153 506.824 17.765 1810
5 604.0 179.128 506.169 17.773 1810
6 605.0 179.129 504.141 17.776 1810
7 606.0 179.165 505.214 17.774 1810
8 3003.0 180.032 278.810 17.748 2010
9 3004.0 180.025 279.382 17.749 2010
10 16955.0 450.377 7.271 17.710 4510
11 16956.0 450.375 6.806 17.720 4510
12 16957.0 450.368 7.428 17.710 4510
13 16958.0 450.372 7.892 17.723 4510
14 16959.0 450.359 8.085 17.714 4510
I want to pick up values from the Column1, 2 & 3 based on certain value of ID.
For example, if I give ID=1810 I should get values from Column1, 2 & 3 corresponding to 1810 (row 3 to 7).
I am using numpy.where function to get the correct row number
a = np.where(data['ID'] == 1810)
but could not find out how to select Column data based on that. Thank you in advance for help!

Use pandas.DataFrame.loc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
df.loc[df['ID'] == 1810][['Column1', 'Column2', 'Column3']]

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000

ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Calculating current, min, max, mean monthly growth from pandas dataframe

I have a dataset similar to the one below:
product_ID month amount_sold
1 1 23
1 2 34
1 3 85
2 1 47
2 2 28
2 3 9
3 1 73
3 2 84
3 3 12
I want the output to be like this:
For example, for product 1:
-avg_monthly_growth is calculated by ((85-34)/34*100 + (34-23)/23*100)/2 = 98.91%
-lowest_monthly_growth is (34-23)/23*100) = 47.83%
-highest_monthly_growth is (85-34)/34*100) = 150%
-current_monthly_growth is the growth between the lastest two months (in this case, it's the growth from month 2 to month 3, as the month ranges from 1-3 for each product)
product_ID avg_monthly_growth lowest_monthly_growth highest_monthly_growth current_monthly_growth
1 98.91% 47.83% 150% 150%
2 ... ... ... ...
3 ... ... ... ...
I've tried df.loc[df.groupby('product_ID')['amount_sold'].idxmax(), :].reset_index() which gets me the max (and similarly the min), but I'm not too sure how to get the percentage growths.

You can use a pivot_table withh pct_change() on axis=1 , then create a dictionary with desired series and create a df:
m=df.pivot_table(index='product_ID',columns='month',values='amount_sold').pct_change(axis=1)
d={'avg_monthly_growth':m.mean(axis=1)*100,'lowest_monthly_growth':m.min(1)*100,
'highest_monthly_growth':m.max(1)*100,'current_monthly_growth':m.iloc[:,-1]*100}
final=pd.DataFrame(d)
print(final)
avg_monthly_growth lowest_monthly_growth highest_monthly_growth \
product_ID
1 98.913043 47.826087 150.000000
2 -54.141337 -67.857143 -40.425532
3 -35.322896 -85.714286 15.068493
current_monthly_growth
product_ID
1 150.000000
2 -67.857143
3 -85.714286

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.

Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!

Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Changing of Data format from Pivoted data in Dataframes using Pandas Python

The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?

You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

fastest way to access dataframe cell by colums values? - python

Related

Selecting data in pandas based on conditions

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

Calculating current, min, max, mean monthly growth from pandas dataframe

How can I Extract only numbers from this columns?

Changing of Data format from Pivoted data in Dataframes using Pandas Python

Categories

Resources