Python Empty series when filtering data

Python Empty series when filtering data - python

I have the following DF:
pd.DataFrame({'Fecha': {0: '2022-05-01',
1: '2022-04-24',
2: '2022-04-21',
3: '2022-04-16',
4: '2022-04-10'},
'team': {0: 'América ',
1: 'Tigres UANL ',
2: 'América ',
3: 'Club Tijuana ',
4: 'América '},
'opponent': {0: 'Cruz Azul',
1: 'América',
2: 'León',
3: 'América',
4: 'Juárez'},
'variable': {0: 'xG_for', 1: 'xG_for', 2: 'xG_for', 3: 'xG_for', 4: 'xG_for'},
'value': {0: 1.53, 1: 0.47, 2: 1.4, 3: 0.65, 4: 1.58},
'venue': {0: 'H', 1: 'H', 2: 'H', 3: 'H', 4: 'H'}})
I want to filter the data to create a rolling plot with the following code:
Y_for = df[(df["team"] == "América") & (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
But when I run the code I get an empty series:
Series([], Name: value, dtype: float64)
What am I doing wrong?

== requires an exact match but you have trailing spaces ('América '), strip them with str.strip:
Y_for = df[(df["team"].str.strip() == "América")
& (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
Y_for
or use str.contains:
Y_for = df[ df["team"].str.contains("América")
& (df["variable"] == "xG_for")]["value"].reset_index(drop = True)
Y_for
output:
0 1.53
1 1.40
2 1.58
Name: value, dtype: float64

Related

Plotly - set decimal place in choropleth

How do you convert number 1.425887B to 1.4 in plotly choropleth ?
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3' ],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'})
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50))
iplot(choromap2022)
This is the image of the result I got, I want to convert the population of China from 1.425887B to 1.4B
I try to look up on the plotly document but cannot find anything.
This is the output of df.head().to_dict()
'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 41128771, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}}```

This is trickier than it appears because plotly uses d3-format, but I believe they are using additional metric abbreviations in their formatting to have the default display numbers larger than 1000 in the format 1.425887B.
My original idea was to round to the nearest 2 digits in the hovertemplate with something like:
data2022 = dict(..., hovertemplate = "%{z:.2r}<br>%{text}<extra></extra>")
However, this removes the default metric abbreviation and causes the entire long form decimal to display. The population of China should show up as 1400000000 instead of 1.4B.
So one possible workaround would be to create a new column in your DataFrame called "2022 Population Text" and format the number using a custom function to round and abbreviate your number (credit goes to #rtaft for their function which does exactly that). Then you can pass this column to customdata, and display customdata in your hovertemplate (instead of z).
import pandas as pd
import plotly.graph_objects as go
data = {'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 1412000000, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}
}
## rounds a number to the specified precision, and adds metrics abbreviations
## i.e. 14230000000 --> 14B
## reference: https://stackoverflow.com/a/45846841/5327068
def human_format(num):
num = float('{:.2g}'.format(num))
magnitude = 0
while abs(num) >= 1000:
magnitude += 1
num /= 1000.0
return '{}{}'.format('{:f}'.format(num).rstrip('0').rstrip('.'), ['', 'K', 'M', 'B', 'T'][magnitude])
df = pd.DataFrame(data=data)
df['2022 Population Text'] = df['2022 Population'].apply(lambda x: human_format(x))
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3'],
customdata = df['2022 Population Text'],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'},
hovertemplate = "%{customdata}<br>%{text}<extra></extra>"
)
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50),
)
choromap2022.show()
Note: Since China wasn't included in your sample data, I changed the population of AFG to 1412000000 to test that the hovertemplate would display it as '1.4B'.

Resample for each group hourly with duration value spent on each hour

As in the title I want to resample the data hourly for each unique group so that the the task_duration_hours represent time spent/elapsed until the hour is complete on that task for each assignement & task name & agent name.
I have the following df:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame( {'Assignement': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b'},
'Task_Name': {0: 'Task1', 1: 'Task1', 2: 'Task1', 3: 'Task1', 4: 'Task2', 5: 'Task2'},
'Agent_Name': {0: 'name1', 1: 'name1', 2: 'name1', 3: 'name1', 4: 'name2', 5: 'name2'},
'Task_Duration_Hours': {0: 0.5, 1: 0.5, 2: 1.0, 3: 1.0, 4: 1.5, 5: 1.5},
'id': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3},
'variable': {0: 'Task_Start_DateTime', 1: 'Task_End_DateTime', 2: 'Task_Start_DateTime', 3: 'Task_End_DateTime', 4: 'Task_Start_DateTime', 5: 'Task_End_DateTime'},
'datetime': {0: '2022-04-01 00:05:00', 1: '2022-04-01 00:35:00', 2: '2022-04-01 00:00:00', 3: '2022-04-01 01:00:00', 4: '2022-04-01 01:30:00', 5: '2022-04-01 03:00:00'}} )
df['datetime'] = pd.to_datetime(df['datetime'].astype(str).str.strip(), format="%Y-%m-%d %H:%M:%S")
Desired df:
from pandas import Timestamp
df1 = pd.DataFrame( {'Task_Duration_Hours': {
(Timestamp('2022-04-01 00:00:00'), 'a', 'Task1', 'name1'): 0.585,
(Timestamp('2022-04-01 01:00:00'), 'a', 'Task1', 'name1'): 0.915,
(Timestamp('2022-04-01 01:00:00'), 'b', 'Task2', 'name2'): 0.5,
(Timestamp('2022-04-01 02:00:00'), 'b', 'Task2', 'name2'): 1}} )
What I tried until now:
df = df[df['variable']=='Task_Start_DateTime']
df.drop('variable', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
df.set_index(['datetime','Assignement', 'Task_Name','Agent_Name'], inplace = True)
df = df.groupby([pd.Grouper(level='datetime', freq='1h'),
pd.Grouper(level='Assignement'),
pd.Grouper(level='Task_Name'),
pd.Grouper(level='Agent_Name')]).sum()
Result:
df = pd.DataFrame( {'Task_Duration_Hours': {
(Timestamp('2022-04-01 00:00:00'), 'a', 'Task1', 'name1'): 0.75,
(Timestamp('2022-04-01 01:00:00'), 'b', 'Task2', 'name2'): 1.5}} )
The duration is not divided per each "hourly bucket", the logic here is: do sum of duration for the tasks that started in that hour instead of do sum of elapsed duration until the hour is complete for each task, name assig...

In python pandas, count the integers in a particular column and also count all the elements in particular column

There is a huge df with multiple columns but want to read only specific column that is interested to me:
in the below data, I would like to read only the column 'Type 1'
import numpy as np
import pandas as pd
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 'HH', 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'np.NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df = pd.DataFrame(data)
df
int_count = df['Type 1'].count(0,numeric_only = True) # should count only cells that contain integers and return 8
total_count = df['Type 1'].count(0,numeric_only = False) # should count all the cells and return 9
I want something like count only the numeric values in particular column
eg: df['Type 1'].count(0,numeric_only = True) should return 8 (exclude counting the string 'HH' in Type 1 column)
df['Type 1'].count(0,numeric_only = False) should return 9 (total number of cells in the particular column)
but "df['Type 1'].count(0,numeric_only = True/False)" this is not working as I expect...

I would suggest the below:
int_count = len(df.loc[df['Type 1'].astype(str).str.isnumeric()])
total_count = len(df)

Python - making scatterplot from non-numeric values

I have a csv file with data that I have imported into a dataframe.
'RI_df = pd.read_csv("../Week15/police.csv")'
Using .head() my data looks like this:
state stop_date stop_time county_name driver_gender driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop district
0 RI 2005-01-04 12:55 NaN M White Equipment/Inspection Violation Equipment False NaN Citation False 0-15 Min False Zone X4
1 RI 2005-01-23 23:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone K3
2 RI 2005-02-17 04:15 NaN M White Speeding Speeding False NaN Citation False 0-15 Min False Zone X4
3 RI 2005-02-20 17:15 NaN M White Call for Service Other False NaN Arrest Driver
RI_df.head().to_dict()
Out[55]:
{'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-04',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-20',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}}
RI_df['drugs_related_stop'].value_counts()
Out[27]:
False 90879
True 862
Name: drugs_related_stop, dtype: int64
I am trying to take the true value counts of "drug related stops" and put them on a line graph, in order to see if "drug related stops" have been increasing over time.
ax = RI_df['drugs_related_stop'].value_counts().plot(kind='line',
figsize=(10,8),
title="Drug stops")
ax.set_xlabel("drug stops")
ax.set_ylabel("number of stops")

You should just use groupby().count()
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')
Here is the complete code so you can double-check:
import pandas as pd
import numpy as np
df = pd.DataFrame({'state': {0: 'RI', 1: 'RI', 2: 'RI', 3: 'RI', 4: 'RI'},
'stop_date': {0: '2005-01-23',
1: '2005-01-23',
2: '2005-02-17',
3: '2005-02-17',
4: '2005-02-24'},
'stop_time': {0: '12:55', 1: '23:15', 2: '04:15', 3: '17:15', 4: '01:20'},
'county_name': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'driver_gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'F'},
'driver_race': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'violation_raw': {0: 'Equipment/Inspection Violation',
1: 'Speeding',
2: 'Speeding',
3: 'Call for Service',
4: 'Speeding'},
'violation': {0: 'Equipment',
1: 'Speeding',
2: 'Speeding',
3: 'Other',
4: 'Speeding'},
'search_conducted': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'search_type': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'stop_outcome': {0: 'Citation',
1: 'Citation',
2: 'Citation',
3: 'Arrest Driver',
4: 'Citation'},
'is_arrested': {0: False, 1: False, 2: False, 3: True, 4: False},
'stop_duration': {0: '0-15 Min',
1: '0-15 Min',
2: '0-15 Min',
3: '16-30 Min',
4: '0-15 Min'},
'drugs_related_stop': {0: False, 1: False, 2: False, 3: False, 4: False},
'district': {0: 'Zone X4',
1: 'Zone K3',
2: 'Zone X4',
3: 'Zone X1',
4: 'Zone X3'}})
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

This is what I'm getting with the code below...
ax = df.groupby('stop_date', as_index=False).count().plot(kind='line',
figsize=(10,8), title="Drug stops", x='stop_date',
y='district')

replicate iferror and vlookup in a pandas join

I want to join two dataframes:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Country ': {0: 'de', 1: 'it', 2: 'de'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
df2 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1',1: 'campaign2', 2: 'none',3: 'campaign4',4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
edit:
let's even imagine the option:
df1 = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: '12345'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20}})
I have to join df2 and df1 on the keys:
Date
Campaign
Banner
The issue here is that when the match under the key "Campaign" is not found, the key should be switched to field "id_campaign".
I would like to obtain this dataframe:
df_joined = pd.DataFrame({'Banner': {0: 'banner1', 1: 'banner2', 2: 'banner3', 3: 'banner4', 4: 'banner5'},
'Campaign': {0: 'campaign1', 1: 'campaign2', 2: 'none', 3: 'campaign4', 4: 'campaign5'},
'Country ': {0: 'de', 1: 'it', 2: 'de', 3: 'en', 4: 'en'},
'Date': {0: '1/1/2016', 1: '2/1/2016', 2: '1/1/2016', 3: '3/1/2016', 4: '4/1/2016'},
'Value_1': {0: 10, 1: 5, 2: 20, 3: 0, 4: 0},
'Value_2': {0: 5, 1: 10, 2: 15, 3: 20, 4: 25},
'id_campaign': {0: 'none', 1: 'none', 2: '12345', 3: 'none', 4: 'none'}})
any help is really appreciated.

You can use double merge by 3 and 2 keys and then fill not match values by combine_first from column Value_1 of df4:
df3 = pd.merge(df2, df1.drop('Country', axis=1), on=['Date','Campaign','Banner'], how='left')
df4 = pd.merge(df2, df1, on=['Date','Banner'], how='left')
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10.0
1 banner2 campaign2 it 2/1/2016 10 none 5.0
2 banner3 none de 1/1/2016 15 12345 NaN
3 banner4 campaign4 en 3/1/2016 20 none NaN
4 banner5 campaign5 en 4/1/2016 25 none NaN
print (df4['Value_1'])
0 10.0
1 5.0
2 20.0
3 NaN
4 NaN
Name: Value_1, dtype: float64
df3['Value_1'] = df3['Value_1'].combine_first(df4['Value_1']).fillna(0).astype(int)
print (df3)
Banner Campaign Country Date Value_2 id_campaign Value_1
0 banner1 campaign1 de 1/1/2016 5 none 10
1 banner2 campaign2 it 2/1/2016 10 none 5
2 banner3 none de 1/1/2016 15 12345 20
3 banner4 campaign4 en 3/1/2016 20 none 0
4 banner5 campaign5 en 4/1/2016 25 none 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Empty series when filtering data - python

Related

Plotly - set decimal place in choropleth

Resample for each group hourly with duration value spent on each hour

In python pandas, count the integers in a particular column and also count all the elements in particular column

Python - making scatterplot from non-numeric values

replicate iferror and vlookup in a pandas join

Categories

Resources