Python - making scatterplot from non-numeric values - python

I have a csv file with data that I have imported into a dataframe.
'RI_df = pd.read_csv("../Week15/police.csv")'
Using .head() my data looks like this:
state stop_date stop_time county_name driver_gender driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop district
0 RI 2005-01-04 12:55 NaN M White Equipment/Inspection Violation Equipment False NaN Citation False 0-15 Min False Zone X4
1 RI 2005-01-23 23:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone K3
2 RI 2005-02-17 04:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone X4
3 RI 2005-02-20 17:15 NaN M White Call for Service Other False NaN Arrest Driver
RI_df.head().to_dict()
Out[55]:
{'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-04',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-20',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}}
RI_df['drugs_related_stop'].value_counts()
Out[27]:
False 90879
True 862
Name: drugs_related_stop, dtype: int64
I am trying to take the true value counts of "drug related stops" and put them on a line graph, in order to see if "drug related stops" have been increasing over time.
ax = RI_df['drugs_related_stop'].value_counts().plot(kind='line',
figsize=(10,8),
title="Drug stops")
ax.set_xlabel("drug stops")
ax.set_ylabel("number of stops")

You should just use groupby().count()
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')
Here is the complete code so you can double-check:
import pandas as pd
import numpy as np
df = pd.DataFrame({'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-23',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-17',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}})
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

This is what I'm getting with the code below...
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

Related

Pipeline & ColumnTransformer: ValueError: Selected columns are not unique in dataframe

Background: I am trying to learn from a notebook used in Kaggle House Price Prediction Dataset.
I am trying to use a Pipeline to transform numerical and categorical columns in a dataframe. It is having issues with my Categorical variables' names, which is a list stored in this variable categ_cols_names. It says that those categorical columns are not unique in dataframe, which I'm not sure what that means.
categ_cols_names = ['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond','YearBuilt','YearRemodAdd','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC','CentralAir','Electrical','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','KitchenQual','Functional','Fireplaces','GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageQual','GarageCond','PavedDrive','MoSold','YrSold','SaleType','SaleCondition','OverallQual','GarageCars','FullBath','YearBuilt']
Below is my code:
# Get numerical columns names
num_cols_names = X_train.columns[X_train.dtypes != object].to_list()
# Numerical columns with missing values
num_nan_cols = X_train[num_cols_names].columns[X_train[num_cols_names].isna().sum() > 0]
# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[num_nan_cols] = X_train[num_nan_cols].fillna(value = np.nan, axis = 1)
# Define pipeline for imputation of the numerical features
num_pipeline = Pipeline(steps = [
('Simple Imputer', SimpleImputer(strategy = 'median')),
('Robust Scaler', RobustScaler()),
('Power Transformer', PowerTransformer())
]
)
# Get categorical columns names
categ_cols_names = X_train.columns[X_train.dtypes == object].to_list()
# Categorical columns with missing values
categ_nan_cols = X_train[categ_cols_names].columns[X_train[categ_cols_names].isna().sum() > 0]
# Assign np.nan type to NaN values in categorical features
# in order to ensure detectability in posterior methods
X_train[categ_nan_cols] = X_train[categ_nan_cols].fillna(value = np.nan, axis = 1)
# Define pipeline for imputation and encoding of the categorical features
categ_pipeline = Pipeline(steps = [
('Categorical Imputer', SimpleImputer(strategy = 'most_frequent')),
('One Hot Encoder', OneHotEncoder(drop = 'first'))
])
ct = ColumnTransformer([
('Categorical Pipeline', categ_pipeline, categ_cols_names),
('Numerical Pipeline', num_pipeline, num_cols_names)],
remainder = 'passthrough',
sparse_threshold = 0,
n_jobs = -1)
pipe = Pipeline(steps = [('Column Transformer', ct)])
pipe.fit_transform(X_train)
The ValueError occurs on the .fit_transform() line:
Here is a sample of my X_train:
{'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
'Street': {0: 'Pave', 1: 'Pave', 2: 'Pave', 3: 'Pave', 4: 'Pave'},
'LotShape': {0: 'Reg', 1: 'Reg', 2: 'IR1', 3: 'IR1', 4: 'IR1'},
'LandContour': {0: 'Lvl', 1: 'Lvl', 2: 'Lvl', 3: 'Lvl', 4: 'Lvl'},
'Utilities': {0: 'AllPub',
1: 'AllPub',
2: 'AllPub',
3: 'AllPub',
4: 'AllPub'},
'LotConfig': {0: 'Inside', 1: 'FR2', 2: 'Inside', 3: 'Corner', 4: 'FR2'},
'LandSlope': {0: 'Gtl', 1: 'Gtl', 2: 'Gtl', 3: 'Gtl', 4: 'Gtl'},
'Neighborhood': {0: 'CollgCr',
1: 'Veenker',
2: 'CollgCr',
3: 'Crawfor',
4: 'NoRidge'},
'Condition1': {0: 'Norm', 1: 'Feedr', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
'Condition2': {0: 'Norm', 1: 'Norm', 2: 'Norm', 3: 'Norm', 4: 'Norm'},
'BldgType': {0: '1Fam', 1: '1Fam', 2: '1Fam', 3: '1Fam', 4: '1Fam'},
'HouseStyle': {0: '2Story',
1: '1Story',
2: '2Story',
3: '2Story',
4: '2Story'},
'OverallQual': {0: '7', 1: '6', 2: '7', 3: '7', 4: '8'},
'OverallCond': {0: '5', 1: '8', 2: '5', 3: '5', 4: '5'},
'YearBuilt': {0: '2003', 1: '1976', 2: '2001', 3: '1915', 4: '2000'},
'YearRemodAdd': {0: '2003', 1: '1976', 2: '2002', 3: '1970', 4: '2000'},
'RoofStyle': {0: 'Gable', 1: 'Gable', 2: 'Gable', 3: 'Gable', 4: 'Gable'},
'RoofMatl': {0: 'CompShg',
1: 'CompShg',
2: 'CompShg',
3: 'CompShg',
4: 'CompShg'},
'Exterior1st': {0: 'VinylSd',
1: 'MetalSd',
2: 'VinylSd',
3: 'Wd Sdng',
4: 'VinylSd'},
'Exterior2nd': {0: 'VinylSd',
1: 'MetalSd',
2: 'VinylSd',
3: 'Wd Shng',
4: 'VinylSd'},
'MasVnrType': {0: 'BrkFace',
1: 'None',
2: 'BrkFace',
3: 'None',
4: 'BrkFace'},
'ExterQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'TA', 4: 'Gd'},
'ExterCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'Foundation': {0: 'PConc', 1: 'CBlock', 2: 'PConc', 3: 'BrkTil', 4: 'PConc'},
'BsmtQual': {0: 'Gd', 1: 'Gd', 2: 'Gd', 3: 'TA', 4: 'Gd'},
'BsmtCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'Gd', 4: 'TA'},
'BsmtExposure': {0: 'No', 1: 'Gd', 2: 'Mn', 3: 'No', 4: 'Av'},
'BsmtFinType1': {0: 'GLQ', 1: 'ALQ', 2: 'GLQ', 3: 'ALQ', 4: 'GLQ'},
'BsmtFinType2': {0: 'Unf', 1: 'Unf', 2: 'Unf', 3: 'Unf', 4: 'Unf'},
'Heating': {0: 'GasA', 1: 'GasA', 2: 'GasA', 3: 'GasA', 4: 'GasA'},
'HeatingQC': {0: 'Ex', 1: 'Ex', 2: 'Ex', 3: 'Gd', 4: 'Ex'},
'CentralAir': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
'Electrical': {0: 'SBrkr', 1: 'SBrkr', 2: 'SBrkr', 3: 'SBrkr', 4: 'SBrkr'},
'BsmtFullBath': {0: '1', 1: '0', 2: '1', 3: '1', 4: '1'},
'BsmtHalfBath': {0: '0', 1: '1', 2: '0', 3: '0', 4: '0'},
'FullBath': {0: '2', 1: '2', 2: '2', 3: '1', 4: '2'},
'HalfBath': {0: '1', 1: '0', 2: '1', 3: '0', 4: '1'},
'BedroomAbvGr': {0: '3', 1: '3', 2: '3', 3: '3', 4: '4'},
'KitchenAbvGr': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
'KitchenQual': {0: 'Gd', 1: 'TA', 2: 'Gd', 3: 'Gd', 4: 'Gd'},
'Functional': {0: 'Typ', 1: 'Typ', 2: 'Typ', 3: 'Typ', 4: 'Typ'},
'Fireplaces': {0: '0', 1: '1', 2: '1', 3: '1', 4: '1'},
'GarageType': {0: 'Attchd',
1: 'Attchd',
2: 'Attchd',
3: 'Detchd',
4: 'Attchd'},
'GarageYrBlt': {0: '2003.0',
1: '1976.0',
2: '2001.0',
3: '1998.0',
4: '2000.0'},
'GarageFinish': {0: 'RFn', 1: 'RFn', 2: 'RFn', 3: 'Unf', 4: 'RFn'},
'GarageCars': {0: '2', 1: '2', 2: '2', 3: '3', 4: '3'},
'GarageQual': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'GarageCond': {0: 'TA', 1: 'TA', 2: 'TA', 3: 'TA', 4: 'TA'},
'PavedDrive': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'Y'},
'MoSold': {0: '2', 1: '5', 2: '9', 3: '2', 4: '12'},
'YrSold': {0: '2008', 1: '2007', 2: '2008', 3: '2006', 4: '2008'},
'SaleType': {0: 'WD', 1: 'WD', 2: 'WD', 3: 'WD', 4: 'WD'},
'SaleCondition': {0: 'Normal',
1: 'Normal',
2: 'Normal',
3: 'Abnorml',
4: 'Normal'},
'GrLivArea': {0: 1710, 1: 1262, 2: 1786, 3: 1717, 4: 2198},
'GarageArea': {0: 548, 1: 460, 2: 608, 3: 642, 4: 836},
'TotalBsmtSF': {0: 856, 1: 1262, 2: 920, 3: 756, 4: 1145},
'1stFlrSF': {0: 856, 1: 1262, 2: 920, 3: 961, 4: 1145},
'TotRmsAbvGrd': {0: 8, 1: 6, 2: 6, 3: 7, 4: 9}}

Plotly - set decimal place in choropleth

How do you convert number 1.425887B to 1.4 in plotly choropleth ?
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3' ],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'})
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50))
iplot(choromap2022)
This is the image of the result I got, I want to convert the population of China from 1.425887B to 1.4B
I try to look up on the plotly document but cannot find anything.
This is the output of df.head().to_dict()
'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 41128771, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}}```
This is trickier than it appears because plotly uses d3-format, but I believe they are using additional metric abbreviations in their formatting to have the default display numbers larger than 1000 in the format 1.425887B.
My original idea was to round to the nearest 2 digits in the hovertemplate with something like:
data2022 = dict(..., hovertemplate = "%{z:.2r}<br>%{text}<extra></extra>")
However, this removes the default metric abbreviation and causes the entire long form decimal to display. The population of China should show up as 1400000000 instead of 1.4B.
So one possible workaround would be to create a new column in your DataFrame called "2022 Population Text" and format the number using a custom function to round and abbreviate your number (credit goes to #rtaft for their function which does exactly that). Then you can pass this column to customdata, and display customdata in your hovertemplate (instead of z).
import pandas as pd
import plotly.graph_objects as go
data = {'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 1412000000, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}
}
## rounds a number to the specified precision, and adds metrics abbreviations
## i.e. 14230000000 --> 14B
## reference: https://stackoverflow.com/a/45846841/5327068
def human_format(num):
num = float('{:.2g}'.format(num))
magnitude = 0
while abs(num) >= 1000:
magnitude += 1
num /= 1000.0
return '{}{}'.format('{:f}'.format(num).rstrip('0').rstrip('.'), ['', 'K', 'M', 'B', 'T'][magnitude])
df = pd.DataFrame(data=data)
df['2022 Population Text'] = df['2022 Population'].apply(lambda x: human_format(x))
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3'],
customdata = df['2022 Population Text'],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'},
hovertemplate = "%{customdata}<br>%{text}<extra></extra>"
)
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50),
)
choromap2022.show()
Note: Since China wasn't included in your sample data, I changed the population of AFG to 1412000000 to test that the hovertemplate would display it as '1.4B'.

how to plot only True signal with plotly candlestick chart

Here is the code I am using:
df['C'] = np.where((df['spread'] > 60) & (df['volume'] > df['Ma_mult_high']),'green','red')
fig = go.Figure()
# add OHLC trace
fig.add_trace(go.Candlestick(x=df.index,
open=df['open'],
high=df['high'],
low=df['low'],
close=df['close'],
showlegend=False))
# add moving average traces
fig.add_trace(go.Scatter(x=df.index,
y=df['ma'],
opacity=0.7,
line=dict(color='blue', width=2),
name='MA 5'))
fig.add_trace(go.Scatter(
x = df.index,
y = df['close'],
mode = 'markers',
marker_color=df.C
))
fig.update_layout(xaxis_rangeslider_visible=False).show()`
the output
in the image, you can see that plot both True and false signal, maybe because the marker_color = "C" but if change that and use only color names it will plot noting even if i change the y = df['close'], i get the same problem
data {'timeStamp': {0: 1657220400000, 1: 1657222200000, 2: 1657224000000, 3: 1657225800000, 4: 1657227600000}, 'open': {0: 21357.7, 1: 21495.84, 2: 21812.46, 3: 21641.56, 4: 21624.03}, 'high': {0: 21499.87, 1: 21837.74, 2: 21838.1, 3: 21659.99, 4: 21727.87}, 'low': {0: 21325.0, 1: 21439.13, 2: 21526.4, 3: 21541.96, 4: 21567.56}, 'close': {0: 21495.83, 1: 21812.47, 2: 21641.56, 3: 21624.03, 4: 21619.57}, 'volume': {0: 3663.2089, 1: 7199.91652, 2: 4367.94336, 3: 1841.10043, 4: 1786.17022}, 'quoteVolume': {0: 78386481.2224664, 1: 155885063.7202956, 2: 94605455.6190078, 3: 39756576.8814698, 4: 38684342.7232105}, 'tradesCount': {0: 59053, 1: 111142, 2: 81136, 3: 56148, 4: 53122}, 'date': {0: Timestamp('2022-07-07 19:00:00'), 1: Timestamp('2022-07-07 19:30:00'), 2: Timestamp('2022-07-07 20:00:00'), 3: Timestamp('2022-07-07 20:30:00'), 4: Timestamp('2022-07-07 21:00:00')}, 'Avg_Volume': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}, 'Ma_mult_high': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}, 'Ma_mult_mid': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}, 'spread': {0: 78.9901069365825, 1: 79.43353152203923, 2: 54.82836060314386, 3: 14.85215623146836, 4: 2.782109662528346}, 'Marker': {0: 21502.87, 1: 21840.74, 2: 21523.4, 3: 21538.96, 4: 21564.56}, 'Symbol': {0: 'triangle-up', 1: 'triangle-up', 2: 'triangle-down', 3: 'triangle-down', 4: 'triangle-down'}, 'ma': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'C': {0: 'red', 1: 'red', 2: 'red', 3: 'red', 4: 'red'}}
It seems to me that the issue is in your np.where() statement, likely with the nan values in Ma_multi_high producing the false statement in df['volume'] > df['Ma_mult_high'] that result in 'red'.
Try this:
df['C'] = np.where((df['spread'] > 60) & (df['volume'] > df['Ma_mult_high'].fillna(0)),'green','red')

Join pandas dataframes according to common index value only

I have the following dataframes (this is just test data), in real samples, I have index values that are repeated a few times inside dataframe 1 and dataframe 2 - this causes the repeated/duplicate rows inside final dataframe.
DataFrame 1:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False}})
DataFrame 2:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
The two dataframes should be connected based on common Index values found in both dataframes only. Which means, any index values that don't match in those two dataframes; should not appear in the final combined/merged dataframe.
I want to ensure that the final dataframe is unique, and only captures combinations of columns, based on unique Index values.
When I try using the following code, the output is supposed to 'inner join' based on the unique index found in both dataframes.
final = pd.merge(df1, df2, left_index=True, right_index=True)
However, when I try applying the above merge technique on my larger (other) pandas dataframes, there are many rows being repeated/duplicated multiple times. When the merging happpens a few times with more dataframes, the rows gets repeated very frequently, with the same Index value.
I am expecting to see one Index value returned per row (with all the column combinations from each dataframe).
I am not sure why this happens. I can confirm that there is nothing wrong with the datasets.
Is there a better technique of merging those two dataframes, based on only common index values, and at the same time ensure that I don't repeat any rows (with the same index) in my final dataframe ? I often find that this merging often creates a giant final CSV file around 20GB in size too. The source files are only around 15MB into total.
Any help is much appreciated.
My end output should look like this (please copy and use this as Pandas DF):
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
This is expected behavior with non-unique idx values. Since you have 3 ID1 rows in one df and 2 ID1 in the other, you end up with 6 ID1 rows in your merged df. If you add validate="one_to_one" to pd.merge() you will get this Error. MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one mergeAll other validations fail except for many to many.
If it makes sense for your data, you can use the left_on, and right_on parameters to find unique combinations and give you a one-to-one if that's what you're after.
Edit after your new data:
Now that you have unique ids, this should work for you. Notice it doesn't throw a validation error.
final = pd.merge(df1, df2, left_on=['id'], right_on=['id'], validate='one_to_one')

I am trying to map the countries on the world map. The problem occurring is only USA is being shown on the output

fig = px.scatter_geo(df, locations="country", color = "country",
projection="natural earth")
fig.show()
On the output side, I am able to get the world map and in the legends, all the countries do appear. The problem is the countries are not shown on the map.
Here is the snap of the sample data:
{'id': {0: '72b83200-4881-4806-b910-af86905256c4',
1: '5db5df19-c06b-489a-b2f4-c2ffc26643ba',
2: '6c9e4f0d-ef87-497f-97af-df207a25331d',
3: '004bf779-368d-47ae-b3cc-07b0ecad2464',
4: '8a2265d9-1f81-4c47-953f-0d4bfab326c0'},
'name': {0: 'BALCO BRANDS PTY LTD',
1: 'Bambury',
2: 'Bata Shoe Company of Australia',
3: 'Bean Body Care',
4: 'Caprice Australia '},
'canonical_name': {0: 'balcobrands',
1: 'bambury',
2: 'batashoecompanyofaustralia',
3: 'beanbodycare',
4: 'capriceaustralia'},
'url': {0: 'http://www.balcobrands.com',
1: 'http://www.bambury.com.au',
2: 'http://www.bataindustrials.com.au',
3: 'https://global.beanbodycare.com',
4: 'http://www.caprice.com.au'},
'type': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'address': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'city': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'state': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'country': {0: 'Australia',
1: 'Australia',
2: 'Australia',
3: 'Australia',
4: 'Australia'},
'country_code': {0: 'AU', 1: 'AU', 2: 'AU', 3: 'AU', 4: 'AU'},
'created_at': {0: '2020-04-01 20:52:38.098099',
1: '2020-04-01 20:52:38.364935',
2: '2020-04-01 20:52:38.636768',
3: '2020-04-01 20:52:38.951573',
4: '2020-04-01 20:52:39.271376'},
'created_by': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'updated_at': {0: '2020-04-01 20:52:38.098099',
1: '2020-04-01 20:52:38.364935',
2: '2020-04-01 20:52:38.636768',
3: '2020-04-01 20:52:38.951573',
4: '2020-04-01 20:52:39.271376'},
'updated_by': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
The data did not contain the three-digit country codes. When the data was merged with another dataset having three-digit country codes, the required output was obtained.

Categories

Resources