Changing string to integer in Pandas - python

The data set had "deaths" as object and I need to convert it to the INTEGER. I try to use the formula from another thread and it doesn't seem to work.
******Input:******
data.info()
*****Output:*****
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
****Input:****
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
print (pd.to_numeric(df.deaths, errors='coerce'))
****Output:****
0 50
1 30
2 28
Name: deaths, dtype: int64
****Input:****
df.deaths = pd.to_numeric(df.deaths, errors='coerce').astype('Int64')
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
data.info()
****Output:****
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB

If you have nulls (np.NaN) in the column it will not convert to int type.
You need to deal with nulls first.
1 Either replace them with an int value:
df.deaths = df.deaths.fillna(0)
df.deaths = df.deaths.astype(int)
2 Or drop null values:
df = df[df.deaths.notna()]
df.deaths = df.deaths.astype(int)
3 Or (preferred) learn to live with them:
# make your other function accept null values

Related

Sorting alphanumeric Dataframe in Pandas

My plot in y-axis is set in 1, 11, 12, etc... I would like it to be 1,2,3,4 ... 10, 11
df = pd.read_csv("numerical_subset_cleaned.csv",
names=["age","fnlwgt","educational-num","capital-gain","capital-loss","hours-per-week"])
sns.set_style("darkgrid")
bubble_plot(df,x = 'age', y = 'educational-num', fontsize=16, figsize=(15,10), normalization_by_all = True)
My df.info():
<class 'pandas.core.frame.DataFrame'>
Index: 32420 entries, 0.0 to string
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32420 non-null object
1 fnlwgt 32420 non-null object
2 educational-num 32420 non-null object
3 capital-gain 32420 non-null object
4 capital-loss 32420 non-null object
5 hours-per-week 32420 non-null object
dtypes: object(6)
memory usage: 1.7+ MB[enter image description here][1]

AttributeError: Can only use .str accessor with string values! python pandas

I am trying to convert all the cells value (except date) to float point number, I can successfully convert first 3 column but getting an error on the last one:
Here is my code:
df['Market Cap_'+str(coin)] = df['Market Cap_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Volume_'+str(coin)] = df['Volume_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Open_'+str(coin)] = df['Open_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
df['Close_'+str(coin)] = df['Close_'+str(coin)].str.replace(',','').str.replace('$', '').astype(float)
Here is df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 1 to 30
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date_ETHEREUM 30 non-null datetime64[ns]
1 Market Cap_ETHEREUM 30 non-null float64
2 Volume_ETHEREUM 30 non-null float64
3 Open_ETHEREUM 30 non-null float64
4 Close_ETHEREUM 30 non-null object
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 1.4+ KB
And here is the Error:
AttributeError: Can only use .str accessor with string values!
As you can see the column type is an object, (same as what others were before conversion, but I'm getting an error on this one)

python pandas | replacing the date and time string with only time

price
quantity
high time
10.4
3
2021-11-08 14:26:00-05:00
dataframe = ddg
the datatype for hightime is datetime64[ns, America/New_York]
i want the high time to be only 14:26:00 (getting rid of 2021-11-08 and -05:00) but i got an error when using the code below
ddg['high_time'] = ddg['high_time'].dt.strftime('%H:%M')
I think because it's not the right column name:
# Your code
>>> ddg['high_time'].dt.strftime('%H:%M')
...
KeyError: 'high_time'
# With right column name
>>> ddg['high time'].dt.strftime('%H:%M')
0 14:26
Name: high time, dtype: object
# My dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 1 non-null float64
1 quantity 1 non-null int64
2 high time 1 non-null datetime64[ns, America/New_York]
dtypes: datetime64[ns, America/New_York](1), float64(1), int64(1)
memory usage: 152.0 bytes

Parse correct datetime using Python and pandas

So I have two spreadsheets in csv format that I've been provided with for my masters uni course.
Part of the processing of the data involved the merging of the files, followed by running some reports off the merged content using dates. this I've completed successfully, however....
The current date format I'm led to believe is epoch so for example the first date on the spreadsheet is 43471
So, firstly I ran this code first to check what format it was looking at
pd.read_csv('bookloans_merged.csv')
df.info()
This returned the result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null int64
8 date_of_return 1958 non-null int64
dtypes: int64(4), object(5)
memory usage: 137.8+ KB
I then ran the following code:
# parsing date values
df = pd.read_csv('bookloans_merged.csv')
df[['date_of_loan','date_of_return']] = df[['date_of_loan','date_of_return']].apply(pd.to_datetime, format='%Y-%m-%d %H:%M:%S.%f')
df.to_csv('bookloans_merged_dates.csv', index=False)
Running this again:
pd.read_csv('bookloans_merged_dates.csv')
df.info()
I get:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1958 entries, 0 to 1957
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Number 1958 non-null int64
1 Title 1958 non-null object
2 Author 1854 non-null object
3 Genre 1958 non-null object
4 SubGenre 1958 non-null object
5 Publisher 1845 non-null object
6 member_number 1958 non-null int64
7 date_of_loan 1958 non-null datetime64[ns]
8 date_of_return 1958 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(5)
memory usage: 137.8+ KB
So I can see the date_of_loan and date_of_return is now datetime64
trouble is, all the dates are now showing as 1970-01-01 00:00:00.000043471
How do I get to 01/03/2019 format please?
Thanks
David.
So I managed to get this figured out, with a little help. Here is the answer
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)
from datetime import datetime
excel_date = 43139
d_time = datetime.fromordinal(datetime(1900, 1, 1).toordinal() + excel_date - 2)
t_time = d_time.timetuple()
print(d_time)
print(t_time)
So how I was able to use that premise in my program was like this
from datetime import datetime
df1 = pd.DataFrame(data_frame, columns=['Title','Author','date_of_loan'])
df1['date_of_loan'] = pd.to_datetime(df1['date_of_loan'], unit='d', origin=pd.Timestamp('1900-01-01'))
df1.sort_values('date_of_loan', ascending=True)

Joining two data frames that appear to be same type gives error 'ValueError: You are trying to merge on object and int64 columns'

I have two data frames, sessions1 and sessions2 that I would like to join on field 'ga:dimension1'.
sessions1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15775 entries, 0 to 15774
Data columns (total 9 columns):
ga:dimension1 15775 non-null object
ga:date 15775 non-null object
ga:deviceCategory 15775 non-null object
ga:landingPagePath 15775 non-null object
ga:userType 15775 non-null object
ga:operatingSystem 15775 non-null object
ga:operatingSystemVersion 15775 non-null object
ga:sessions 15775 non-null int64
ga:bounces 15775 non-null int64
dtypes: int64(2), object(7)
memory usage: 1.1+ MB
sessions2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15774 entries, 0 to 15773
Data columns (total 9 columns):
ga:dimension1 15774 non-null object
ga:source 15774 non-null object
ga:medium 15774 non-null object
ga:campaign 15774 non-null object
ga:adContent 15774 non-null object
ga:keyword 15774 non-null object
ga:channelGrouping 15774 non-null object
ga:sessions 15774 non-null int64
ga:bounces 15774 non-null int64
dtypes: int64(2), object(7)
memory usage: 1.1+ MB
Looking at the first few rows they look the same at least:
sessions1.head()
ga:dimension1 ga:date ... ga:sessions ga:bounces
0 1567331564026.evxjzuot 20190901 ... 1 1
1 1567331572999.vtnsczsj 20190901 ... 1 1
2 1567331693070.fkdbmcj6 20190901 ... 1 1
3 1567335919816.ctz12xcl 20190901 ... 1 0
4 1567345181556.b3yowmbh 20190901 ... 1 1
sessions2.head()
ga:dimension1 ga:source ... ga:sessions ga:bounces
0 1567331564026.evxjzuot (direct) ... 1 1
1 1567331572999.vtnsczsj (direct) ... 1 1
2 1567331693070.fkdbmcj6 (direct) ... 1 1
3 1567335919816.ctz12xcl (direct) ... 1 0
4 1567345181556.b3yowmbh (direct) ... 1 1
However, when I try this:
sessions_combined = sessions1.join(sessions2,
on = 'ga:dimension1',
how = 'left')
I get an error message:
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
Why is this and how should I join the two data frames together?
Use merge
sessions_combined = sessions1.merge(sessions2,
on = 'ga:dimension1',
how = 'left')

Categories

Resources