I have a problem using Pandas.
When I execute autos.info() it returns:
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dateCrawled 371528 non-null object
1 name 371528 non-null object
2 seller 371528 non-null object
3 offerType 371528 non-null object
4 price 371528 non-null int64
5 abtest 371528 non-null object
6 vehicleType 333659 non-null object
7 yearOfRegistration 371528 non-null int64
8 gearbox 351319 non-null object
9 powerPS 371528 non-null int64
10 model 351044 non-null object
11 kilometer 371528 non-null int64
12 monthOfRegistration 371528 non-null int64
13 fuelType 338142 non-null object
14 brand 371528 non-null object
15 notRepairedDamage 299468 non-null object
16 dateCreated 371528 non-null object
17 nrOfPictures 371528 non-null int64
18 postalCode 371528 non-null int64
19 lastSeen 371528 non-null object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB
But when I execute autos["price"].describe() it returns:
count 3.715280e+05
mean 1.729514e+04
std 3.587954e+06
min 0.000000e+00
25% 1.150000e+03
50% 2.950000e+03
75% 7.200000e+03
max 2.147484e+09
Name: price, dtype: float64
I don't understand why there is this type incongruence between the type of the column price.
Any suggestions?
The return value of Series.describe() is a Series with the descriptive statistics. The dtype you see in the Series is not the dtype of the original column but the dtype of the statistics - which is float.
The name of the result is price because that is set as the name of the Series autos["price"].
If I control the number of display digits, will I get the data I want?
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)
Related
Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.
When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here
I want to fill feature with null value in dataframe. But when I fill to all feature, every data type I'm filling was changed to "Object".
I have dataframe with data type:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7832 non-null float64
1 jenis_kelamin 7840 non-null object
2 pekerjaan 7760 non-null object
3 provinsi 7831 non-null object
4 gaji 7843 non-null float64
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7861 non-null float64
8 tinggi 7843 non-null float64
9 sampo 7858 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7847 non-null object
12 stress 7853 non-null float64
And I use fillna() for filling null value to every feature
# Feature categoric type inputation
df['jenis_kelamin'].fillna(df['jenis_kelamin'].mode()[0], inplace = True)
df['pekerjaan'].fillna(df['pekerjaan'].mode()[0], inplace = True)
df['provinsi'].fillna(df['provinsi'].mode()[0], inplace = True)
df['sampo'].fillna(df['sampo'].mode()[0], inplace = True)
df['pendidikan'].fillna(df['pendidikan'].mode()[0], inplace = True)
# Feature numeric type inputation
df['umur'].fillna(df['umur'].mean, inplace = True)
df['gaji'].fillna(df['gaji'].mean, inplace = True)
df['berat'].fillna(df['berat'].mean, inplace = True)
df['tinggi'].fillna(df['tinggi'].mean, inplace = True)
df['stress'].fillna(df['stress'].mean, inplace = True)
But after that, all feature's data type has been changed to Object:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7917 non-null object
1 jenis_kelamin 7917 non-null object
2 pekerjaan 7917 non-null object
3 provinsi 7917 non-null object
4 gaji 7917 non-null object
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7917 non-null object
8 tinggi 7917 non-null object
9 sampo 7917 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7917 non-null object
12 stress 7917 non-null object
I think it can be work to convert every feature with astype(), but is there any other efficient way to fill null value without change the datatype?
I think you are missing your brackets on .mean(), so it is filling the series with a method instead of the actual values.
You want, for example:
df['umur'].fillna(df['umur'].mean(), inplace = True)
I have a dataframe with the following columns as shown below. I created a boxplot with plotly.express with the shown code using facets and I have embedded a sample of the plot produced by the code.
df.columns
>>> Index(['crops', 'category', 'sand', 'clay', 'soil_text_3', 'org_mat', 'org_mat_characterisations', 'pH', 'pH_characterisation', 'ca', 'ca_characterisation', 'N_ppm', 'N_ppm_characterisation',
'N_dose', 'residual_coef', 'fev'],
dtype='object')
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'browser'
fig = px.box(data_frame = df,
x = 'N_ppm', y = 'N_dose',
color = 'pH_characterisation',
points = False,
facet_row = 'soil_text_3',
facet_col = 'org_mat_characterisations')
fig.show()
My question is whether it is possible to have a second x-axes below the primary with the 'N_ppm_characterisation', to show at the same time both the numeric values and below them the categorical values.
I also provide a print of information of the dataframe with the current state of types in case it is necessary to perform any changes.
df.info()
>>>Output from spyder call 'get_namespace_view':
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302016 entries, 0 to 302015
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 crops 302016 non-null object
1 category 302016 non-null object
2 sand 302016 non-null int64
3 clay 302016 non-null int64
4 soil_text_3 302016 non-null object
5 org_mat 302016 non-null float64
6 org_mat_characterisations 302016 non-null object
7 pH 302016 non-null float64
8 pH_characterisation 302016 non-null object
9 ca 302016 non-null float64
10 ca_characterisation 302016 non-null object
11 N_ppm 302016 non-null int64
12 N_ppm_characterisation 302016 non-null object
13 N_dose 302016 non-null float64
14 residual_coef 302016 non-null float64
15 fev 302016 non-null float64
dtypes: float64(6), int64(3), object(7)
memory usage: 36.9+ MB
I have an object named air_list and it contains the following in the first column:
['John_F_Kennedy_International_Airport', 'Charlotte_Douglas_International_Airport', 'OHare_International_Airport']
In the second and third columns, i have Lat and Long coordinates.
I have another object named cust_loc and it contains Lat, Lon, and names of airports, coordinates, distance between coordinates. Now, I'm trying to use a Lambda Function to basically say, if the distance is less than 500 miles, the Condition is 'In', otherwise it's 'Out'
Here's the function that I am testing out.
for i in air_list:
cust_loc.loc['Condition']=cust_loc.loc[cust_loc.Condition=='Out'][i].apply(lambda x: 'In' if x<=500 else 'Out')
The In flags seems to be fine, but all Out flags come in as NANs. All coordinates and distances are float and Coordinate is an object. Any idea what's wrong with my setup?
These are two Pandas Dataframes:
cust_loc.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Longitude 150961 non-null float64
1 Latitude 150961 non-null float64
2 coor 150961 non-null object
3 John_F_Kennedy_International_Airport 150961 non-null float64
4 Charlotte_Douglas_International_Airport 150961 non-null float64
5 OHare_International_Airport 150961 non-null float64
6 Tucson_International_Airport 150961 non-null float64
7 Candy_Kitchen_Ranch_Airport 150961 non-null float64
8 Canandaigua_Airport 150961 non-null float64
9 Asheville_Regional_Airport 150961 non-null float64
10 Dallas_Love_Field_Airport 150961 non-null float64
11 Fly_Barts 150961 non-null float64
12 Tampa_International_Airport 150961 non-null float64
13 Condition 150961 non-null object
air_list.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Place_Name 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 coor 10 non-null object
Thanks.