Sorting alphanumeric Dataframe in Pandas - python

My plot in y-axis is set in 1, 11, 12, etc... I would like it to be 1,2,3,4 ... 10, 11
df = pd.read_csv("numerical_subset_cleaned.csv",
names=["age","fnlwgt","educational-num","capital-gain","capital-loss","hours-per-week"])
sns.set_style("darkgrid")
bubble_plot(df,x = 'age', y = 'educational-num', fontsize=16, figsize=(15,10), normalization_by_all = True)
My df.info():
<class 'pandas.core.frame.DataFrame'>
Index: 32420 entries, 0.0 to string
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32420 non-null object
1 fnlwgt 32420 non-null object
2 educational-num 32420 non-null object
3 capital-gain 32420 non-null object
4 capital-loss 32420 non-null object
5 hours-per-week 32420 non-null object
dtypes: object(6)
memory usage: 1.7+ MB[enter image description here][1]

Related

Changing string to integer in Pandas

The data set had "deaths" as object and I need to convert it to the INTEGER. I try to use the formula from another thread and it doesn't seem to work.
******Input:******
data.info()
*****Output:*****
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
****Input:****
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
df = pd.DataFrame({'deaths':['50','30','28']})
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
print (pd.to_numeric(df.deaths, errors='coerce'))
****Output:****
0 50
1 30
2 28
Name: deaths, dtype: int64
****Input:****
df.deaths = pd.to_numeric(df.deaths, errors='coerce').astype('Int64')
print (df)
****Output:****
deaths
0 50
1 30
2 28
****Input:****
data.info()
****Output:****
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1271
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 1270 non-null object
1 leading_cause 1270 non-null object
2 sex 1270 non-null object
3 race_ethnicity 1270 non-null object
4 deaths 1270 non-null object
dtypes: object(5)
memory usage: 59.5+ KB
If you have nulls (np.NaN) in the column it will not convert to int type.
You need to deal with nulls first.
1 Either replace them with an int value:
df.deaths = df.deaths.fillna(0)
df.deaths = df.deaths.astype(int)
2 Or drop null values:
df = df[df.deaths.notna()]
df.deaths = df.deaths.astype(int)
3 Or (preferred) learn to live with them:
# make your other function accept null values

Python: Pandas df.fillna() function change all data type into object

I want to fill feature with null value in dataframe. But when I fill to all feature, every data type I'm filling was changed to "Object".
I have dataframe with data type:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7832 non-null float64
1 jenis_kelamin 7840 non-null object
2 pekerjaan 7760 non-null object
3 provinsi 7831 non-null object
4 gaji 7843 non-null float64
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7861 non-null float64
8 tinggi 7843 non-null float64
9 sampo 7858 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7847 non-null object
12 stress 7853 non-null float64
And I use fillna() for filling null value to every feature
# Feature categoric type inputation
df['jenis_kelamin'].fillna(df['jenis_kelamin'].mode()[0], inplace = True)
df['pekerjaan'].fillna(df['pekerjaan'].mode()[0], inplace = True)
df['provinsi'].fillna(df['provinsi'].mode()[0], inplace = True)
df['sampo'].fillna(df['sampo'].mode()[0], inplace = True)
df['pendidikan'].fillna(df['pendidikan'].mode()[0], inplace = True)
# Feature numeric type inputation
df['umur'].fillna(df['umur'].mean, inplace = True)
df['gaji'].fillna(df['gaji'].mean, inplace = True)
df['berat'].fillna(df['berat'].mean, inplace = True)
df['tinggi'].fillna(df['tinggi'].mean, inplace = True)
df['stress'].fillna(df['stress'].mean, inplace = True)
But after that, all feature's data type has been changed to Object:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7917 non-null object
1 jenis_kelamin 7917 non-null object
2 pekerjaan 7917 non-null object
3 provinsi 7917 non-null object
4 gaji 7917 non-null object
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7917 non-null object
8 tinggi 7917 non-null object
9 sampo 7917 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7917 non-null object
12 stress 7917 non-null object
I think it can be work to convert every feature with astype(), but is there any other efficient way to fill null value without change the datatype?
I think you are missing your brackets on .mean(), so it is filling the series with a method instead of the actual values.
You want, for example:
df['umur'].fillna(df['umur'].mean(), inplace = True)

How to add secondary x-axes with plotly boxplot?

I have a dataframe with the following columns as shown below. I created a boxplot with plotly.express with the shown code using facets and I have embedded a sample of the plot produced by the code.
df.columns
>>> Index(['crops', 'category', 'sand', 'clay', 'soil_text_3', 'org_mat', 'org_mat_characterisations', 'pH', 'pH_characterisation', 'ca', 'ca_characterisation', 'N_ppm', 'N_ppm_characterisation',
'N_dose', 'residual_coef', 'fev'],
dtype='object')
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'browser'
fig = px.box(data_frame = df,
x = 'N_ppm', y = 'N_dose',
color = 'pH_characterisation',
points = False,
facet_row = 'soil_text_3',
facet_col = 'org_mat_characterisations')
fig.show()
My question is whether it is possible to have a second x-axes below the primary with the 'N_ppm_characterisation', to show at the same time both the numeric values and below them the categorical values.
I also provide a print of information of the dataframe with the current state of types in case it is necessary to perform any changes.
df.info()
>>>Output from spyder call 'get_namespace_view':
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302016 entries, 0 to 302015
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 crops 302016 non-null object
1 category 302016 non-null object
2 sand 302016 non-null int64
3 clay 302016 non-null int64
4 soil_text_3 302016 non-null object
5 org_mat 302016 non-null float64
6 org_mat_characterisations 302016 non-null object
7 pH 302016 non-null float64
8 pH_characterisation 302016 non-null object
9 ca 302016 non-null float64
10 ca_characterisation 302016 non-null object
11 N_ppm 302016 non-null int64
12 N_ppm_characterisation 302016 non-null object
13 N_dose 302016 non-null float64
14 residual_coef 302016 non-null float64
15 fev 302016 non-null float64
dtypes: float64(6), int64(3), object(7)
memory usage: 36.9+ MB

Adding new column to dataframe and change it to list

Based on the given dataset, I must create a function which takes as input the original dataframe and returns the same dataframe, but with one additional column called subr_faved_by_as_list, where you have the same information as in subr_faved_by, but as a python list instead of a string.
That's my code:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
a = f.read()
outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')
df.info()
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 author 19940 non-null object
1 posted_at 19940 non-null object
2 num_comments 19940 non-null int64
3 score 19940 non-null int64
4 selftext 19940 non-null object
5 subr_created_at 19940 non-null object
6 subr_description 19940 non-null object
7 subr_faved_by 19940 non-null object
8 subr_numb_members 19940 non-null int64
9 subr_numb_posts 19940 non-null int64
10 subreddit 19940 non-null object
11 title 19940 non-null object
12 total_awards_received 19940 non-null int64
13 upvote_ratio 19940 non-null float64
14 user_num_posts 19940 non-null int64
15 user_registered_at 19940 non-null object
16 user_upvote_ratio 19940 non-null float64
dtypes: float64(2), int64(6), object(9)
And that's the function
def transform_faves(df):
df['subr_faved_by_as_list'] = df['subr_faved_by'].str.split(' ', n = 1, expand = True)
return df
df = transform_faves(df)
I am getting the following error:
list indices must be integers or slices, not str
If use expand=True it return DataFrame, so need omit it, also is possible omit space string for split:
def transform_faves(df):
df['subr_faved_by_as_list'] = df['subr_faved_by'].str.split(n = 1)
return df
df = transform_faves(df)

Joining two data frames that appear to be same type gives error 'ValueError: You are trying to merge on object and int64 columns'

I have two data frames, sessions1 and sessions2 that I would like to join on field 'ga:dimension1'.
sessions1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15775 entries, 0 to 15774
Data columns (total 9 columns):
ga:dimension1 15775 non-null object
ga:date 15775 non-null object
ga:deviceCategory 15775 non-null object
ga:landingPagePath 15775 non-null object
ga:userType 15775 non-null object
ga:operatingSystem 15775 non-null object
ga:operatingSystemVersion 15775 non-null object
ga:sessions 15775 non-null int64
ga:bounces 15775 non-null int64
dtypes: int64(2), object(7)
memory usage: 1.1+ MB
sessions2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15774 entries, 0 to 15773
Data columns (total 9 columns):
ga:dimension1 15774 non-null object
ga:source 15774 non-null object
ga:medium 15774 non-null object
ga:campaign 15774 non-null object
ga:adContent 15774 non-null object
ga:keyword 15774 non-null object
ga:channelGrouping 15774 non-null object
ga:sessions 15774 non-null int64
ga:bounces 15774 non-null int64
dtypes: int64(2), object(7)
memory usage: 1.1+ MB
Looking at the first few rows they look the same at least:
sessions1.head()
ga:dimension1 ga:date ... ga:sessions ga:bounces
0 1567331564026.evxjzuot 20190901 ... 1 1
1 1567331572999.vtnsczsj 20190901 ... 1 1
2 1567331693070.fkdbmcj6 20190901 ... 1 1
3 1567335919816.ctz12xcl 20190901 ... 1 0
4 1567345181556.b3yowmbh 20190901 ... 1 1
sessions2.head()
ga:dimension1 ga:source ... ga:sessions ga:bounces
0 1567331564026.evxjzuot (direct) ... 1 1
1 1567331572999.vtnsczsj (direct) ... 1 1
2 1567331693070.fkdbmcj6 (direct) ... 1 1
3 1567335919816.ctz12xcl (direct) ... 1 0
4 1567345181556.b3yowmbh (direct) ... 1 1
However, when I try this:
sessions_combined = sessions1.join(sessions2,
on = 'ga:dimension1',
how = 'left')
I get an error message:
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
Why is this and how should I join the two data frames together?
Use merge
sessions_combined = sessions1.merge(sessions2,
on = 'ga:dimension1',
how = 'left')

Categories

Resources