I am having trouble solving one assignment. Well, in a dataframe in one column I have values as text strings (objects). I want to convert this to a numeric value but every time I get an error that I cannot convert the string to a float.
I want to try using regex to convert the string '-1 203.45' into the value '1203.45'. Please help me how the code should be written in Pandas.
I've tested virtually all of the forum hints and none of them work. Please give me a hint.
First what I did:
Read the file csv in different way:
df = pd.read_csv('dane_navision.csv', delimiter=";", decimal= ",",
thousands=" " )
and
df = pd.read_table('dane_navision.csv', delimiter=";", thousands=" ", decimal=',')
I receive such table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1905 entries, 0 to 1904
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Data księgowania 1905 non-null object
1 Typ zapisu 1905 non-null object
2 Nr zapisu 1905 non-null int64
3 Nr dokumentu 1905 non-null object
4 Nr zapasu 1905 non-null object
5 Opis 1905 non-null object
6 Kod lokalizacji 1905 non-null object
7 Ilość 1905 non-null object
8 Ilość zafakturowana 1905 non-null object
9 MPK - kod 1905 non-null object
10 Nr dok.zewn. 0 non-null float64
11 Kod kategorii zapasu 1826 non-null object
12 Ilość na jednostkę miary 1905 non-null int64
13 Kwota kosztu (rzeczywista) 1905 non-null object
dtypes: float64(1), int64(2), object(11)
memory usage: 208.5+ KB
For me is important 13 column. I wanted to change them from object to float and I can't do it.
I get the following error every time:
ValueError: could not convert string to float: '-1xa0032.02'
I tried many ways with the method:
df['colum_name'].astype(float) - does not work.
df['column_name'].str.replace[" ", "" ] - does not work.
Maybe someone has some idea how to cut space from string and then it will be easier to convert it to number.
Related
Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.
When I try to read the date from Excel file, I found that
the column called "TOT_SALES" the data type is float64 and all values with % sign.
I want to remove this sign and dividing all values on 100. And at the same time the values in the column in Excel file are regular as I mentioned.
any help how to remove the (%) from the values.
df= pd.read_excel("QVI_transaction_data_1.xlsx")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
df.head()
this is the result appear in the DataFrame
TOT_SALES
1180.00%
740.00%
420.00%
1080.00%
660.00%
This is the values in the Excel file without (%) sign
TOT_SALES
11.80
7.40
4.20
10.80
6.60
enter image description here
enter image description here
enter image description here
I am trying to convert all the cells value (except date) to float point number, I can successfully convert first 3 column but getting an error on the last one:
Here is my code:
df['Market Cap_'+str(coin)] = df['Market Cap_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Volume_'+str(coin)] = df['Volume_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Open_'+str(coin)] = df['Open_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Close_'+str(coin)] = df['Close_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
Here is df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 30
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date_ETHEREUM 30 non-null datetime64[ns]
1 Market Cap_ETHEREUM 30 non-null float64
2 Volume_ETHEREUM 30 non-null float64
3 Open_ETHEREUM 30 non-null float64
4 Close_ETHEREUM 30 non-null object
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 1.4+ KB
And here is the Error:
AttributeError: Can only use .str accessor with string values!
As you can see the column type is an object, (same as what others were before conversion, but I'm getting an error on this one)
I am trying to convert the all the cells value (except date) to float point number, but I'm getting and
error:
Can only use .str accessor with string values!
here is my code:
df['Market Cap_'+str(coin)] = df['Market Cap_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Volume_'+str(coin)] = df['Volume_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Open_'+str(coin)] = df['Open_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Close_'+str(coin)] = df['Close_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
here is the output of df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 30
Data columns (total 5 columns):
Column Non-Null Count Dtype
0 Date_ETHEREUM 30 non-null datetime64[ns]
1 Market Cap_ETHEREUM 30 non-null float64
2 Volume_ETHEREUM 30 non-null float64
3 Open_ETHEREUM 30 non-null float64
4 Close_ETHEREUM 30 non-null object
dtypes: datetime64ns, float64(3), object(1)
memory usage: 1.4+ KB
here is an image of my dataframe:
Note: Coin is just a string which added dynamically from URL for each particular coin table.
I would appreciate any help or an alternative solution.
You have a $ sign so the value cannot be parsed as a float. Remove it before converting the column to a float type
So I have two spreadsheets in csv format that I've been provided with for my masters uni course.
Part of the processing of the data involved the merging of the files, followed by running some reports off the merged content using dates. this I've completed successfully, however....
The current date format I'm led to believe is epoch so for example the first date on the spreadsheet is 43471
So, firstly I ran this code first to check what format it was looking at
pd.read_csv('bookloans_merged.csv')
df.info()
This returned the result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null int64
8 date_of_return 1958 non-null int64
dtypes: int64(4), object(5)
memory usage: 137.8+ KB
I then ran the following code:
# parsing date values
df = pd.read_csv('bookloans_merged.csv')
df[['date_of_loan','date_of_return']] = df[['date_of_loan','date_of_return']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S.%f')
df.to_csv('bookloans_merged_dates.csv', index=False)
Running this again:
pd.read_csv('bookloans_merged_dates.csv')
df.info()
I get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null datetime64[ns]
8 date_of_return 1958 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 137.8+ KB
So I can see the date_of_loan and date_of_return is now datetime64
trouble is, all the dates are now showing as 1970-01-01 00:00:00.000043471
How do I get to 01/03/2019 format please?
Thanks
David.
So I managed to get this figured out, with a little help. Here is the answer
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)
from datetime import datetime
excel_date = 43139
d_time = datetime.fromordinal(datetime(1900, 1, 1).toordinal() + excel_date - 2)
t_time = d_time.timetuple()
print(d_time)
print(t_time)
So how I was able to use that premise in my program was like this
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)