I am reading a csv file as a dataframe in python. Then i use below two commands to get more information about those files.
Is there a way to copy output of these two commands into separate data frames?
data.describe(include='all')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Regarding df.describe, as it's of type Dataframe itself, you can either create a new dataframe directly or save it to csv as below:
des=pd.DataFrame(df.describe())
or
df.describe().to_csv()
Regarding df.info(), this is of type 'Nonetype' which means that cannot be saved directly. You can check for some alternative solutions here:
Is there a way to export pandas dataframe info -- df.info() into an excel file?
Related
Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.
When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here
Hello I have an issue to convert column of object to integer for complete column.
I have a data frame and I tried to convert some columns that are detected as Object into Integer (or Float) but all the answers I already found are working for me
First status
Then I tried to apply the to_numeric method but doesn't work.
To numeric method
Then a custom method that you can find here: Pandas: convert dtype 'object' to int
but doesn't work either: data3['Title'].astype(str).astype(int)
( I cannot pass the image anymore - You have to trust me that it doesn't work)
I tried to use the inplace statement but doesn't seem to be integrated in those methods:
I am pretty sure that the answer is dumb but cannot find it
You need assign output back:
#maybe also works omit astype(str)
data3['Title'] = data3['Title'].astype(str).astype(int)
Or:
data3['Title'] = pd.to_numeric(data3['Title'])
Sample:
data3 = pd.DataFrame({'Title':['15','12','10']})
print (data3)
Title
0 15
1 12
2 10
print (data3.dtypes)
Title object
dtype: object
data3['Title'] = pd.to_numeric(data3['Title'])
print (data3.dtypes)
Title int64
dtype: object
data3['Title'] = data3['Title'].astype(int)
print (data3.dtypes)
Title int32
dtype: object
As python_enthusiast said ,
This command works for me too
data3.Title = data3.Title.str.replace(',', '').astype(float).astype(int)
but also works fine with
data3.Title = data3.Title.str.replace(',', '').astype(int)
you have to use str before replace in order to get rid of commas only then change it to int/float other wise you will get error .
2 years and 11 months later, but here I go.
It's important to check if your data has any spaces, special characters (like commas, dots, or whatever else) first. If yes, then you need to basically remove those and then convert your string data into float and then into an integer (this is what worked for me for the case where my data was numerical values but with commas, like 4,118,662).
data3.Title = data3.Title.str.replace(',', '').astype(flaoat).astype(int)
also you can try this code, work fine with me
data3.Title= pd.factorize(data3.Title)[0]
Version that works with Nulls
With older version of Pandas there was no NaN for int but newer versions of pandas offer Int64 which has pd.NA.
So to go from object to int with missing data you can do this.
df['col'] = df['col'].astype(float)
df['col'] = df['col'].astype('Int64')
By switching to float first you avoid object cannot be converted to an IntegerDtype error.
Note it is capital 'I' in the Int64.
More info here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
Working with pd.NA
In Pandas 1.0 the new pd.NA datatype has been introduced; the goal of pd.NA is to provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
With this in mind they have created the dataframe.convert_dtypes() and Series.convert_dtypes() functions which converts to datatypes that support pd.NA. This is currently considered experimental but might well be a bright future.
I had a dataset like this
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79902 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79902 non-null object
1 Video Title 79902 non-null object
2 Video ID 79902 non-null object
3 Video Views 79902 non-null object
4 Comment ID 79902 non-null object
5 cleaned_comments 79902 non-null object
dtypes: object(6)
memory usage: 5.5+ MB
Removed the None, NaN entries using
dataset = dataset.replace(to_replace='None', value=np.nan).dropna()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null object
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: object(6)
memory usage: 6.1+ MB
Notice the reduced entries
But the Video Views were floats, as shown in dataset.head()
Then I used
dataset['Video Views'] = pd.to_numeric(dataset['Video Views'])
dataset['Video Views'] = dataset['Video Views'].astype(int)
Now,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 79868 entries, 0 to 79901
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Query 79868 non-null object
1 Video Title 79868 non-null object
2 Video ID 79868 non-null object
3 Video Views 79868 non-null int64
4 Comment ID 79868 non-null object
5 cleaned_comments 79868 non-null object
dtypes: int64(1), object(5)
memory usage: 6.1+ MB
I would display all information of my data frame which contains more than 100 columns with .info() from pandas but it won't :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85529 entries, 0 to 85528
Columns: 110 entries, ID to TARGET
dtypes: float64(40), int64(19), object(51)
memory usage: 71.8+ MB
I would like it displays like this :
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
But the problem seems to be the high number of columns from my previous data frame. I would like to show all values including non null values (NaN).
You can pass optional arguments verbose=True and show_counts=True (null_counts=True deprecated since pandas 1.2.0) to the .info() method to output information for all of the columns
pandas >=1.2.0: data_train.info(verbose=True, show_counts=True)
pandas <1.2.0: data_train.info(verbose=True, null_counts=True)
I am new on Python. I am trying to use sklearn.cluster.
Here is my code:
from sklearn.cluster import MiniBatchKMeans
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df)
But I get the following error:
50 and not np.isfinite(X).all()):
51 raise ValueError("Input contains NaN, infinity"
---> 52 " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I checked that the there is no Nan or infinity value. So there is only one option left. However, my data info tells me that all variables are float64, so I don't understand where the problem comes from.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
Thanks a lot,
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 362358 entries, 135 to 4747145
Data columns (total 8 columns):
User 362358 non-null float64
Hour 362352 non-null float64
Minute 362352 non-null float64
Day 362352 non-null float64
Month 362352 non-null float64
Year 362352 non-null float64
Latitude 362352 non-null float64
Longitude 362352 non-null float64
dtypes: float64(8)
memory usage: 24.9 MB
I think that fit() accepts only "array-like, shape = [n_samples, n_features]", not pandas dataframes. So try to pass the values of the dataframe into it as:
kmeans=MiniBatchKMeans(n_clusters=2)
kmeans.fit(df.values)
Or shape them in order to run the function correctly. Hope that helps.
By looking at your df.info(), it appears that there are 6 more non-null Users values than there are values of any other column. This would indicate that you have 6 nulls in each of the other columns, and that is the reason for the error.
So you can slice your data to the right fit with iloc():
df = pd.read_csv(location1, encoding = "ISO-8859-1").iloc[2:20]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 2 to 19
Data columns (total 6 columns):
zip_code 18 non-null int64
latitude 18 non-null float64
longitude 18 non-null float64
city 18 non-null object
state 18 non-null object
county 18 non-null object
dtypes: float64(2), int64(1), object(3)