Trying to use a Lambda Function and getting NAN - python

I have an object named air_list and it contains the following in the first column:
['John_F_Kennedy_International_Airport', 'Charlotte_Douglas_International_Airport', 'OHare_International_Airport']
In the second and third columns, i have Lat and Long coordinates.
I have another object named cust_loc and it contains Lat, Lon, and names of airports, coordinates, distance between coordinates. Now, I'm trying to use a Lambda Function to basically say, if the distance is less than 500 miles, the Condition is 'In', otherwise it's 'Out'
Here's the function that I am testing out.
for i in air_list:
cust_loc.loc['Condition']=cust_loc.loc[cust_loc.Condition=='Out'][i].apply(lambda x: 'In' if x<=500 else 'Out')
The In flags seems to be fine, but all Out flags come in as NANs. All coordinates and distances are float and Coordinate is an object. Any idea what's wrong with my setup?
These are two Pandas Dataframes:
cust_loc.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Longitude 150961 non-null float64
1 Latitude 150961 non-null float64
2 coor 150961 non-null object
3 John_F_Kennedy_International_Airport 150961 non-null float64
4 Charlotte_Douglas_International_Airport 150961 non-null float64
5 OHare_International_Airport 150961 non-null float64
6 Tucson_International_Airport 150961 non-null float64
7 Candy_Kitchen_Ranch_Airport 150961 non-null float64
8 Canandaigua_Airport 150961 non-null float64
9 Asheville_Regional_Airport 150961 non-null float64
10 Dallas_Love_Field_Airport 150961 non-null float64
11 Fly_Barts 150961 non-null float64
12 Tampa_International_Airport 150961 non-null float64
13 Condition 150961 non-null object
air_list.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Place_Name 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 coor 10 non-null object
Thanks.

Related

pandas.DataFrame.convert_dtypes increasing memory usage

Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.

How to add secondary x-axes with plotly boxplot?

I have a dataframe with the following columns as shown below. I created a boxplot with plotly.express with the shown code using facets and I have embedded a sample of the plot produced by the code.
df.columns
>>> Index(['crops', 'category', 'sand', 'clay', 'soil_text_3', 'org_mat', 'org_mat_characterisations', 'pH', 'pH_characterisation', 'ca', 'ca_characterisation', 'N_ppm', 'N_ppm_characterisation',
'N_dose', 'residual_coef', 'fev'],
dtype='object')
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'browser'
fig = px.box(data_frame = df,
x = 'N_ppm', y = 'N_dose',
color = 'pH_characterisation',
points = False,
facet_row = 'soil_text_3',
facet_col = 'org_mat_characterisations')
fig.show()
My question is whether it is possible to have a second x-axes below the primary with the 'N_ppm_characterisation', to show at the same time both the numeric values and below them the categorical values.
I also provide a print of information of the dataframe with the current state of types in case it is necessary to perform any changes.
df.info()
>>>Output from spyder call 'get_namespace_view':
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302016 entries, 0 to 302015
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 crops 302016 non-null object
1 category 302016 non-null object
2 sand 302016 non-null int64
3 clay 302016 non-null int64
4 soil_text_3 302016 non-null object
5 org_mat 302016 non-null float64
6 org_mat_characterisations 302016 non-null object
7 pH 302016 non-null float64
8 pH_characterisation 302016 non-null object
9 ca 302016 non-null float64
10 ca_characterisation 302016 non-null object
11 N_ppm 302016 non-null int64
12 N_ppm_characterisation 302016 non-null object
13 N_dose 302016 non-null float64
14 residual_coef 302016 non-null float64
15 fev 302016 non-null float64
dtypes: float64(6), int64(3), object(7)
memory usage: 36.9+ MB

copy output from describe and info commands to different dataframes python

I am reading a csv file as a dataframe in python. Then i use below two commands to get more information about those files.
Is there a way to copy output of these two commands into separate data frames?
data.describe(include='all')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Regarding df.describe, as it's of type Dataframe itself, you can either create a new dataframe directly or save it to csv as below:
des=pd.DataFrame(df.describe())
or
df.describe().to_csv()
Regarding df.info(), this is of type 'Nonetype' which means that cannot be saved directly. You can check for some alternative solutions here:
Is there a way to export pandas dataframe info -- df.info() into an excel file?

DataFrame.info() differs from DataFrame.Series.describe()

I have a problem using Pandas.
When I execute autos.info() it returns:
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dateCrawled 371528 non-null object
1 name 371528 non-null object
2 seller 371528 non-null object
3 offerType 371528 non-null object
4 price 371528 non-null int64
5 abtest 371528 non-null object
6 vehicleType 333659 non-null object
7 yearOfRegistration 371528 non-null int64
8 gearbox 351319 non-null object
9 powerPS 371528 non-null int64
10 model 351044 non-null object
11 kilometer 371528 non-null int64
12 monthOfRegistration 371528 non-null int64
13 fuelType 338142 non-null object
14 brand 371528 non-null object
15 notRepairedDamage 299468 non-null object
16 dateCreated 371528 non-null object
17 nrOfPictures 371528 non-null int64
18 postalCode 371528 non-null int64
19 lastSeen 371528 non-null object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB
But when I execute autos["price"].describe() it returns:
count 3.715280e+05
mean 1.729514e+04
std 3.587954e+06
min 0.000000e+00
25% 1.150000e+03
50% 2.950000e+03
75% 7.200000e+03
max 2.147484e+09
Name: price, dtype: float64
I don't understand why there is this type incongruence between the type of the column price.
Any suggestions?
The return value of Series.describe() is a Series with the descriptive statistics. The dtype you see in the Series is not the dtype of the original column but the dtype of the statistics - which is float.
The name of the result is price because that is set as the name of the Series autos["price"].
If I control the number of display digits, will I get the data I want?
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)

Python - Input contains NaN, infinity or a value too large for dtype('float64')

I am new on Python. I am trying to use sklearn.cluster.
Here is my code:
from sklearn.cluster import MiniBatchKMeans
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df)
But I get the following error:
50 and not np.isfinite(X).all()):
51 raise ValueError("Input contains NaN, infinity"
---> 52 " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I checked that the there is no Nan or infinity value. So there is only one option left. However, my data info tells me that all variables are float64, so I don't understand where the problem comes from.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
Thanks a lot,
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
I think that fit() accepts only "array-like, shape = [n_samples, n_features]", not pandas dataframes. So try to pass the values of the dataframe into it as:
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df.values)
Or shape them in order to run the function correctly. Hope that helps.
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
So you can slice your data to the right fit with iloc():
df = pd.read_csv(location1, encoding = "ISO-8859-1").iloc[2:20]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 2 to 19
Data columns (total 6 columns):
zip_code 18 non-null int64
latitude 18 non-null float64
longitude 18 non-null float64
city 18 non-null object
state 18 non-null object
county 18 non-null object
dtypes: float64(2), int64(1), object(3)

Categories

Resources