Division of two dataframe with Group by of a Column Pandas - python

I have a dataframe df_F1:
df_F1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 7 columns):
class_energy 2 non-null object
ACT_TIME_AERATEUR_1_F1 2 non-null float64
ACT_TIME_AERATEUR_1_F3 2 non-null float64
ACT_TIME_AERATEUR_1_F5 2 non-null float64
ACT_TIME_AERATEUR_1_F8 2 non-null float64
ACT_TIME_AERATEUR_1_F7 2 non-null float64
ACT_TIME_AERATEUR_1_F8 2 non-null float64
dtypes: float64(6), object(1)
memory usage: 128.0+ bytes
df_F1.head()
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 ACT_TIME_AERATEUR_1_F5
low 5.875550 431.000000 856.666667
medium 856.666667 856.666667 856.666667
I try to create a dataframe Ratio which contain for each class_energy the value of energy of each ACT_TIME_AERATEUR_1_Fx divided by the sum of energy of all ACT_TIME_AERATEUR_1_Fx.
For example:
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 ACT_TIME_AERATEUR_1_F5
low 5.875550/(5.875550 + 431.000000+856.666667) 431.000000/(5.875550+431.000000+856.666667) 856.666667/(5.875550+431.000000+856.666667)
medium 856.666667/(856.666667+856.666667+856.666667) 856.666667/(856.666667+856.666667+856.666667) 856.666667/(856.666667+856.666667+856.666667)
Any idea please to help me?

You could use DF.divide to divide the required columns with their sum along the same columns as shown:
df.iloc[:,1:4] = df.iloc[:,1:4].divide(df.sum(axis=1), axis=0)
print(df)
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3 \
0 low 0.004542 0.333194
1 medium 0.333333 0.333333
ACT_TIME_AERATEUR_1_F5
0 0.662264
1 0.333333

Related

pandas.DataFrame.convert_dtypes increasing memory usage

Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.

How to convert (%) sign in float data type or delete it

When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

Exponentially Weighted Covariance Matrix in Python

I have weekly return data in ascending order. Suppose I want to calculate exponentially weighted moving covariance matrix, with suppose half life of 5 years or 260 weeks. I am trying to follow pandas.DataFrame.ewm, but not really clear about how to implement this in my case. Given I can't mention halflife, as "Only applicable to mean() and halflife value will not apply to the other functions."
Using alpha = 1-exp(ln(0.5)/HalfLife) = 2/(span + 1), I used span = 750.
ret is my return data frame with DatetimeIndex.
ret.shape is (895, 11)
I tried SigmaExpW = ret.ewm(span = 750).cov(), but result is not what I expected. SigmaExpW.shape is (9845, 11).
For context -
ret.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 895 entries, 2003-06-04 to 2020-07-22
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GE 895 non-null float64
1 IGM 895 non-null float64
2 USLC 895 non-null float64
3 USSC 895 non-null float64
4 ExUS 895 non-null float64
5 ExUSHedged 895 non-null float64
6 EM 895 non-null float64
7 Commodities 895 non-null float64
8 HYM 895 non-null float64
9 Government 895 non-null float64
10 Cash 895 non-null float64
dtypes: float64(11)
I am clearly making some elementary mistake. Would appreciate help.
It creates 11*11 matrix everyday. 895*11 = 9845

pandas update a dataframe column based on prefiltered groupby object

given dataframe d such as this:
index col1
1 a
2 a
3 b
4 b
Create a prefiltered group object with new values:
g = d[prefilter].groupby(['some cols']).apply( somefunc )
index col1
2 c
4 d
Now I want to update df to this:
index col1
1 a
2 c
3 b
4 d
Ive been hacking away with update, ix, filtering, where, etc... I am guessing there is an obvious solution I am not seeing here.
stuff like this is not working:
d[d.index == db.index]['alert_v'] = db['alert_v']
q90 = g.transform( somefunc )
d.ix[ d['alert_v'] >=q90, 'alert_v'] = 1
d.ix[ d['alert_v'] < q90, 'alert_v'] = 0
d['alert_v'] = np.where( d.index==db.index, db['alert_v'], d['alert_v'] )
any help is appreciated
thankyou
--edit--
the two dataframes are in the same form:
one is simply a filtered version of the other, with different values, that I want to update to the original.
ValueError: cannot reindex from a duplicate axis
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2186 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 2186 non-null object
subject_id 2186 non-null float64
alert_t 2186 non-null object
variable 2186 non-null object
timeindex 2186 non-null datetime64[ns]
alert_v 2105 non-null float64
value 2186 non-null float64
tavg 54 non-null timedelta64[ns]
iqt 61 non-null object
dtypes: datetime64[ns](1), float64(3), object(4), timedelta64[ns](1)None<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1982 entries, 1984-12-12 13:33:00 to 1939-03-19 22:54:00
Data columns (total 9 columns):
source 1982 non-null object
subject_id 1982 non-null float64
alert_t 1982 non-null object
variable 1982 non-null object
timeindex 1982 non-null datetime64[ns]
alert_v 1982 non-null int64
value 1982 non-null float64
tavg 0 non-null timedelta64[ns]
iqt 0 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4), timedelta64[ns](1)None
you want the df.update() function.
Try something like this:
import pandas as pd
df1 = pd.DataFrame({'Index':[1,2,3,4],'Col1':['A', 'B', 'C', 'D']}).set_index('Index')
df2 = pd.DataFrame({'Index':[2,4],'Col1':['E', 'F']}).set_index('Index')
print df1
Col1
Index
1 A
2 B
3 C
4 D
df1.update(df2)
print df1
Col1
Index
1 A
2 E
3 C
4 F

Categories

Resources