Related
Background: I am trying to learn from a notebook used in Kaggle House Price Prediction Dataset.
I am trying to use a Pipeline to transform numerical and categorical columns in a dataframe. It is having issues with my Categorical variables' names, which is a list stored in this variable categ_cols_names. It says that those categorical columns are not unique in dataframe, which I'm not sure what that means.
categ_cols_names = ['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','Functional','Fireplaces','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageQual','GarageCond','PavedDrive','MoSold','YrSold','SaleType','SaleCondition','OverallQual','GarageCars','FullBath','YearBuilt']
Below is my code:
# Get numerical columns names
num_cols_names = X_train.columns[X_train.dtypes != object].to_list()
# Numerical columns with missing values
num_nan_cols = X_train[num_cols_names].columns[X_train[num_cols_names].isna().sum() > 0]
# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[num_nan_cols] = X_train[num_nan_cols].fillna(value = np.nan, axis = 1)
# Define pipeline for imputation of the numerical features
num_pipeline = Pipeline(steps = [
('Simple Imputer', SimpleImputer(strategy = 'median')),
('Robust Scaler', RobustScaler()),
('Power Transformer', PowerTransformer())
]
)
# Get categorical columns names
categ_cols_names = X_train.columns[X_train.dtypes == object].to_list()
# Categorical columns with missing values
categ_nan_cols = X_train[categ_cols_names].columns[X_train[categ_cols_names].isna().sum() > 0]
# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[categ_nan_cols] = X_train[categ_nan_cols].fillna(value = np.nan, axis = 1)
# Define pipeline for imputation and encoding of the categorical features
categ_pipeline = Pipeline(steps = [
('Categorical Imputer', SimpleImputer(strategy = 'most_frequent')),
('One Hot Encoder', OneHotEncoder(drop = 'first'))
])
ct = ColumnTransformer([
('Categorical Pipeline', categ_pipeline, categ_cols_names),
('Numerical Pipeline', num_pipeline, num_cols_names)],
remainder = 'passthrough',
sparse_threshold = 0,
n_jobs = -1)
pipe = Pipeline(steps = [('Column Transformer', ct)])
pipe.fit_transform(X_train)
The ValueError occurs on the .fit_transform() line:
Here is a sample of my X_train:
{'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
'Street': {0: 'Pave', 1: 'Pave', 2: 'Pave', 3: 'Pave', 4: 'Pave'},
'LotShape': {0: 'Reg', 1: 'Reg', 2: 'IR1', 3: 'IR1', 4: 'IR1'},
'LandContour': {0: 'Lvl', 1: 'Lvl', 2: 'Lvl', 3: 'Lvl', 4: 'Lvl'},
'Utilities': {0: 'AllPub',
1: 'AllPub',
2: 'AllPub',
3: 'AllPub',
4: 'AllPub'},
'LotConfig': {0: 'Inside', 1: 'FR2', 2: 'Inside', 3: 'Corner', 4: 'FR2'},
'LandSlope': {0: 'Gtl', 1: 'Gtl', 2: 'Gtl', 3: 'Gtl', 4: 'Gtl'},
'Neighborhood': {0: 'CollgCr',
1: 'Veenker',
2: 'CollgCr',
3: 'Crawfor',
4: 'NoRidge'},
'Condition1': {0: 'Norm', 1: 'Feedr', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
'Condition2': {0: 'Norm', 1: 'Norm', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
'BldgType': {0: '1Fam', 1: '1Fam', 2: '1Fam', 3: '1Fam', 4: '1Fam'},
'HouseStyle': {0: '2Story',
1: '1Story',
2: '2Story',
3: '2Story',
4: '2Story'},
'OverallQual': {0: '7', 1: '6', 2: '7', 3: '7', 4: '8'},
'OverallCond': {0: '5', 1: '8', 2: '5', 3: '5', 4: '5'},
'YearBuilt': {0: '2003', 1: '1976', 2: '2001', 3: '1915', 4: '2000'},
'YearRemodAdd': {0: '2003', 1: '1976', 2: '2002', 3: '1970', 4: '2000'},
'RoofStyle': {0: 'Gable', 1: 'Gable', 2: 'Gable', 3: 'Gable', 4: 'Gable'},
'RoofMatl': {0: 'CompShg',
1: 'CompShg',
2: 'CompShg',
3: 'CompShg',
4: 'CompShg'},
'Exterior1st': {0: 'VinylSd',
1: 'MetalSd',
2: 'VinylSd',
3: 'Wd Sdng',
4: 'VinylSd'},
'Exterior2nd': {0: 'VinylSd',
1: 'MetalSd',
2: 'VinylSd',
3: 'Wd Shng',
4: 'VinylSd'},
'MasVnrType': {0: 'BrkFace',
1: 'None',
2: 'BrkFace',
3: 'None',
4: 'BrkFace'},
'ExterQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'TA', 4: 'Gd'},
'ExterCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'Foundation': {0: 'PConc', 1: 'CBlock', 2: 'PConc', 3: 'BrkTil', 4: 'PConc'},
'BsmtQual': {0: 'Gd', 1: 'Gd', 2: 'Gd', 3: 'TA', 4: 'Gd'},
'BsmtCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'Gd', 4: 'TA'},
'BsmtExposure': {0: 'No', 1: 'Gd', 2: 'Mn', 3: 'No', 4: 'Av'},
'BsmtFinType1': {0: 'GLQ', 1: 'ALQ', 2: 'GLQ', 3: 'ALQ', 4: 'GLQ'},
'BsmtFinType2': {0: 'Unf', 1: 'Unf', 2: 'Unf', 3: 'Unf', 4: 'Unf'},
'Heating': {0: 'GasA', 1: 'GasA', 2: 'GasA', 3: 'GasA', 4: 'GasA'},
'HeatingQC': {0: 'Ex', 1: 'Ex', 2: 'Ex', 3: 'Gd', 4: 'Ex'},
'CentralAir': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
'Electrical': {0: 'SBrkr', 1: 'SBrkr', 2: 'SBrkr', 3: 'SBrkr', 4: 'SBrkr'},
'BsmtFullBath': {0: '1', 1: '0', 2: '1', 3: '1', 4: '1'},
'BsmtHalfBath': {0: '0', 1: '1', 2: '0', 3: '0', 4: '0'},
'FullBath': {0: '2', 1: '2', 2: '2', 3: '1', 4: '2'},
'HalfBath': {0: '1', 1: '0', 2: '1', 3: '0', 4: '1'},
'BedroomAbvGr': {0: '3', 1: '3', 2: '3', 3: '3', 4: '4'},
'KitchenAbvGr': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
'KitchenQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'Gd', 4: 'Gd'},
'Functional': {0: 'Typ', 1: 'Typ', 2: 'Typ', 3: 'Typ', 4: 'Typ'},
'Fireplaces': {0: '0', 1: '1', 2: '1', 3: '1', 4: '1'},
'GarageType': {0: 'Attchd',
1: 'Attchd',
2: 'Attchd',
3: 'Detchd',
4: 'Attchd'},
'GarageYrBlt': {0: '2003.0',
1: '1976.0',
2: '2001.0',
3: '1998.0',
4: '2000.0'},
'GarageFinish': {0: 'RFn', 1: 'RFn', 2: 'RFn', 3: 'Unf', 4: 'RFn'},
'GarageCars': {0: '2', 1: '2', 2: '2', 3: '3', 4: '3'},
'GarageQual': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'GarageCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'PavedDrive': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
'MoSold': {0: '2', 1: '5', 2: '9', 3: '2', 4: '12'},
'YrSold': {0: '2008', 1: '2007', 2: '2008', 3: '2006', 4: '2008'},
'SaleType': {0: 'WD', 1: 'WD', 2: 'WD', 3: 'WD', 4: 'WD'},
'SaleCondition': {0: 'Normal',
1: 'Normal',
2: 'Normal',
3: 'Abnorml',
4: 'Normal'},
'GrLivArea': {0: 1710, 1: 1262, 2: 1786, 3: 1717, 4: 2198},
'GarageArea': {0: 548, 1: 460, 2: 608, 3: 642, 4: 836},
'TotalBsmtSF': {0: 856, 1: 1262, 2: 920, 3: 756, 4: 1145},
'1stFlrSF': {0: 856, 1: 1262, 2: 920, 3: 961, 4: 1145},
'TotRmsAbvGrd': {0: 8, 1: 6, 2: 6, 3: 7, 4: 9}}
How do you convert number 1.425887B to 1.4 in plotly choropleth ?
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3' ],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'})
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50))
iplot(choromap2022)
This is the image of the result I got, I want to convert the population of China from 1.425887B to 1.4B
I try to look up on the plotly document but cannot find anything.
This is the output of df.head().to_dict()
'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 41128771, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}}```
This is trickier than it appears because plotly uses d3-format, but I believe they are using additional metric abbreviations in their formatting to have the default display numbers larger than 1000 in the format 1.425887B.
My original idea was to round to the nearest 2 digits in the hovertemplate with something like:
data2022 = dict(..., hovertemplate = "%{z:.2r}<br>%{text}<extra></extra>")
However, this removes the default metric abbreviation and causes the entire long form decimal to display. The population of China should show up as 1400000000 instead of 1.4B.
So one possible workaround would be to create a new column in your DataFrame called "2022 Population Text" and format the number using a custom function to round and abbreviate your number (credit goes to #rtaft for their function which does exactly that). Then you can pass this column to customdata, and display customdata in your hovertemplate (instead of z).
import pandas as pd
import plotly.graph_objects as go
data = {'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 1412000000, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}
}
## rounds a number to the specified precision, and adds metrics abbreviations
## i.e. 14230000000 --> 14B
## reference: https://stackoverflow.com/a/45846841/5327068
def human_format(num):
num = float('{:.2g}'.format(num))
magnitude = 0
while abs(num) >= 1000:
magnitude += 1
num /= 1000.0
return '{}{}'.format('{:f}'.format(num).rstrip('0').rstrip('.'), ['', 'K', 'M', 'B', 'T'][magnitude])
df = pd.DataFrame(data=data)
df['2022 Population Text'] = df['2022 Population'].apply(lambda x: human_format(x))
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3'],
customdata = df['2022 Population Text'],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'},
hovertemplate = "%{customdata}<br>%{text}<extra></extra>"
)
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50),
)
choromap2022.show()
Note: Since China wasn't included in your sample data, I changed the population of AFG to 1412000000 to test that the hovertemplate would display it as '1.4B'.
In my table from a dataset I need to highlight rows in bold that contain "All" in columns Building, Floor or Teams:
My code :
headerColor = 'darkgrey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig_occ_fl_team = go.Figure(data=[go.Table(
header=dict(
values=list(final_table_occ_fl_team.columns),
line_color='black',
fill_color=headerColor,
align=['left','left','left','left','left','left','left','left','left','left'],
font=dict(color='black', size=9)
),
cells=dict(
values=[final_table_occ_fl_team['Building'],
final_table_occ_fl_team['Floor'],
final_table_occ_fl_team['Team'],
final_table_occ_fl_team['Number of Desks'],
final_table_occ_fl_team['Avg Occu (#)'],
final_table_occ_fl_team['Avg Occu (%)'],
final_table_occ_fl_team['Avg Occu 10-4 (#)'],
final_table_occ_fl_team['Avg Occu 10-4 (%)'],
final_table_occ_fl_team['Max Occu (#)'],
final_table_occ_fl_team['Max Occu (%)'],
],
line_color='black',
# 2-D list of colors for alternating rows
fill_color = [[rowOddColor,rowEvenColor]*56],
align = ['left','left','left','left','left','left','left','left','left','left'],
font = dict(color = 'black', size = 7)
))
])
fig_occ_fl_team.show()
Dataset head :
data = {'Building': {0: 'All',
1: '1LWP',
2: '1LWP',
3: '1LWP',
4: '1LWP',
5: '1LWP',
6: '1LWP',
7: '1LWP',
8: '1LWP',
9: '1LWP'},
'Floor': {0: 'All',
1: 'All',
2: '2nd',
3: '2nd',
4: '2nd',
5: '2nd',
6: '2nd',
7: '2nd',
8: '2nd',
9: '2nd'},
'Team': {0: 'All',
1: 'All',
2: 'All',
3: 'Anderson/Money',
4: 'Banking & Treasury',
5: 'Charities',
6: 'Client Management',
7: 'Compliance, Legal & Risk',
8: 'DFM',
9: 'Emmerson'},
'Number of Desks': {0: 2297,
1: 2008,
2: 381,
3: 22,
4: 8,
5: 19,
6: 9,
7: 41,
8: 20,
9: 33},
'Avg Occu (#)': {0: 1261,
1: 1126,
2: 195,
3: 14,
4: 4,
5: 9,
6: 5,
7: 21,
8: 13,
9: 18},
'Avg Occu (%)': {0: '55%',
1: '56%',
2: '51%',
3: '64%',
4: '50%',
5: '48%',
6: '56%',
7: '52%',
8: '65%',
9: '55%'},
'Avg Occu 10-4 (#)': {0: 851,
1: 759,
2: 132,
3: 8,
4: 3,
5: 6,
6: 3,
7: 14,
8: 9,
9: 12},
'Avg Occu 10-4 (%)': {0: '37%',
1: '38%',
2: '35%',
3: '37%',
4: '38%',
5: '32%',
6: '34%',
7: '35%',
8: '45%',
9: '37%'},
'Max Occu (#)': {0: 1901,
1: 1680,
2: 274,
3: 22,
4: 6,
5: 13,
6: 7,
7: 27,
8: 17,
9: 25},
'Max Occu (%)': {0: '83%',
1: '84%',
2: '72%',
3: '100%',
4: '75%',
5: '69%',
6: '78%',
7: '66%',
8: '85%',
9: '76%'}}
You can add the bold style to your dataframe prior to creating the table as follows:
import pandas as pd
df = pd.DataFrame().from_dict(data)
indices = df.index[(df[["Building","Floor","Team"]] == "All").all(1)]
for i in indices:
for j in range(len(df.columns)):
df.iloc[i,j] = "<b>{}</b>".format(df.iloc[i,j])
You can now create the table, I increase the size of font to 12:
import plotly.graph_objects as go
headerColor = 'darkgrey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'
fig_occ_fl_team = go.Figure(data=[go.Table(
header=dict(
values=list(df.columns),
line_color='black',
fill_color=headerColor,
align=['left','left','left','left','left','left','left','left','left','left'],
font=dict(color='black', size=9)
),
cells=dict(
values=[df['Building'],
df['Floor'],
df['Team'],
df['Number of Desks'],
df['Avg Occu (#)'],
df['Avg Occu (%)'],
df['Avg Occu 10-4 (#)'],
df['Avg Occu 10-4 (%)'],
df['Max Occu (#)'],
df['Max Occu (%)'],
],
line_color='black',
# 2-D list of colors for alternating rows
fill_color = [[rowOddColor,rowEvenColor]*56],
align = ['left','left','left','left','left','left','left','left','left','left'],
font = dict(color = 'black', size = 12)
))
])
fig_occ_fl_team.show()
Output:
You will notice that the first and forth columns are bold. If you want to keep the original dataframe unchanged, you can use such that df2 = df1.copy().
How could I select in column 'Funding' all the values ending with "M" and then eliminate M,$ and add "0," before value.
ex. from $535M to 0,535
That's beacuase I have Billion and Million values, I've decided to formatting the column in billion so, values in millions must be 0,...
df.head(10).to_dict()
{'Company': {0: 'Bytedance',
1: 'SpaceX',
2: 'SHEIN',
3: 'Stripe',
4: 'Klarna',
5: 'Canva',
6: 'Checkout.com',
7: 'Instacart',
8: 'JUUL Labs',
9: 'Databricks'},
'Valuation': {0: '$180B',
1: '$100B',
2: '$100B',
3: '$95B',
4: '$46B',
5: '$40B',
6: '$40B',
7: '$39B',
8: '$38B',
9: '$38B'},
'Date Joined': {0: '2017-04-07',
1: '2012-12-01',
2: '2018-07-03',
3: '2014-01-23',
4: '2011-12-12',
5: '2018-01-08',
6: '2019-05-02',
7: '2014-12-30',
8: '2017-12-20',
9: '2019-02-05'},
'Industry': {0: 'Artificial intelligence',
1: 'Other',
2: 'E-commerce & direct-to-consumer',
3: 'Fintech',
4: 'Fintech',
5: 'Internet software & services',
6: 'Fintech',
7: 'Supply chain, logistics, & delivery',
8: 'Consumer & retail',
9: 'Data management & analytics'},
'City': {0: 'Beijing',
1: 'Hawthorne',
2: 'Shenzhen',
3: 'San Francisco',
4: 'Stockholm',
5: 'Surry Hills',
6: 'London',
7: 'San Francisco',
8: 'San Francisco',
9: 'San Francisco'},
'Country': {0: 'China',
1: 'United States',
2: 'China',
3: 'United States',
4: 'Sweden',
5: 'Australia',
6: 'United Kingdom',
7: 'United States',
8: 'United States',
9: 'United States'},
'Continent': {0: 'Asia',
1: 'North America',
2: 'Asia',
3: 'North America',
4: 'Europe',
5: 'Oceania',
6: 'Europe',
7: 'North America',
8: 'North America',
9: 'North America'},
'Year Founded': {0: 2012,
1: 2002,
2: 2008,
3: 2010,
4: 2005,
5: 2012,
6: 2012,
7: 2012,
8: 2015,
9: 2013},
'Funding': {0: '$8B',
1: '$7B',
2: '$2B',
3: '$2B',
4: '$4B',
5: '$572M',
6: '$2B',
7: '$3B',
8: '$14B',
9: '$3B'},
'Select Investors': {0: 'Sequoia Capital China, SIG Asia Investments, Sina Weibo, Softbank Group',
1: 'Founders Fund, Draper Fisher Jurvetson, Rothenberg Ventures',
2: 'Tiger Global Management, Sequoia Capital China, Shunwei Capital Partners',
3: 'Khosla Ventures, LowercaseCapital, capitalG',
4: 'Institutional Venture Partners, Sequoia Capital, General Atlantic',
5: 'Sequoia Capital China, Blackbird Ventures, Matrix Partners',
6: 'Tiger Global Management, Insight Partners, DST Global',
7: 'Khosla Ventures, Kleiner Perkins Caufield & Byers, Collaborative Fund',
8: 'Tiger Global Management',
9: 'Andreessen Horowitz, New Enterprise Associates, Battery Ventures'}}
I did a similar manipulation with Valuation, here is how I did. I hope it's right.
df['Valuation'] = df['Valuation'].str.replace(
"B","").str.replace(
"$","").astype(int)
I've tried in several way but none of them works. Here are some of them:
df['Funding'] = np.where(df.Funding.str.contain("M"),
df['Funding'] = ('0,'+ df['Funding']),
pass)
df['Funding'] = df['Funding'].str.replace(
"B", "").str.replace(
"$","").str.replace(
"M","0,")
if df['Funding'].str.contains("M").any():
df['Funding'] = df['Funding'].str.replace("M", "")
asd = "M"
if any(("M" in asd) for M in df['Funding']):
df['Funding'].join((df['Funding'][:0],'0,',df['Funding'][0:])) and replace("M", "")
Thank to all who want to help me. It's my first time with Python, I'm more familiare with R
If you want all your column values in billions, you can use:
df["Valuation"] = df["Funding"].str[1:-1].astype(int).where(df["Funding"].str.endswith("B"),df["Funding"].str[1:-1].astype(int).div(1000))
>>> df
Funding Valuation
0 $8B 8.000
1 $2B 2.000
2 $535M 0.535
Input df:
df = pd.DataFrame({"Funding": ["$8B", "$2B", "$535M"]})
I have the following dataframe:
df = pd.DataFrame({'REC2': {0: '18-24',
1: '18-24',
2: '25-34',
3: '25-34',
4: '35-44',
5: '35-44',
6: '45-54',
7: '45-54',
8: '55-64',
9: '55-64',
10: '65+',
11: '65+'},
'Q8_1': {0: 'No',
1: 'Yes',
2: 'No',
3: 'Yes',
4: 'No',
5: 'Yes',
6: 'No',
7: 'Yes',
8: 'No',
9: 'Yes',
10: 'No',
11: 'Yes'},
'val': {0: 0.9642857142857143,
1: 0.03571428571428571,
2: 0.8208955223880597,
3: 0.1791044776119403,
4: 0.8507462686567164,
5: 0.14925373134328357,
6: 0.8484848484848485,
7: 0.15151515151515152,
8: 0.8653846153846154,
9: 0.1346153846153846,
10: 0.9375,
11: 0.0625}})
which looks like this:
I am trying to create a separate pie chart for each age bin. Currently I am using a hardcoded version, where I need to type in all the available bins. However, I am looking for a solution that does this within a loop or automatically asigns the correct bins. This is my current solution:
df = data.pivot_table(values="val",index=["REC2","Q8_1"])
rcParams['figure.figsize'] = (6,10)
f, a = plt.subplots(3,2)
df.xs('18-24').plot(kind='pie',ax=a[0,0],y="val")
df.xs('25-34').plot(kind='pie',ax=a[1,0],y="val")
df.xs('35-44').plot(kind='pie',ax=a[2,0],y="val")
df.xs('45-54').plot(kind='pie',ax=a[0,1],y="val")
df.xs('55-64').plot(kind='pie',ax=a[1,1],y="val")
df.xs('65+').plot(kind='pie',ax=a[2,1],y="val")
Output:
I think you want:
df.groupby('REC2').plot.pie(x='Q8_1', y='val', layout=(2,3))
Update: I take a look and it turns out that groupby.plot does a different thing. So you can try the for loop:
df = df.set_index("Q8_1")
f, a = plt.subplots(3,2)
for age, ax in zip(set(df.REC2), a.ravel()):
df[df.REC2.eq(age)].plot.pie( y='val', ax=ax)
plt.show()
which yields: