Simple way to convert a Pandas Series for integer comparison - python

I have the very simple. following code, and would like to select all the teams that have a highest_ranking of 1.
import pandas as pd
table = pd.read_table('team_rankings.dat')
table.head()
rank team rating highest_rank highest_rating
0 1 Germany 2097 1 2205
1 2 Brazil 2086 1 2161
2 3 Spain 2011 1 2147
3 4 Portugal 1968 2 1991
4 5 Argentina 1967 1 2128
type((table['highest_rank']))
pandas.core.series.Series
table.loc[(table['highest_rank']) < 2]
then gives me a
TypeError: unorderable types: str() < int()
since some highest_rank enteries are '-'. Urgh. What's a simple way to perform this (integer) selection??

You can parse the "-" as a NaN-value. That might help you for more future tasks.
table = pd.read_table('team_rankings.dat', na_values="-")
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

User pd.to_numeric with errors ='coerce' i.e
df.loc[(pd.to_numeric(df['highest_rank'],errors='coerce')) < 2]
Output:
rank team rating highest_rank highest_rating
0 1 Germany 2097 1 2205
1 2 Brazil 2086 1 2161
2 3 Spain 2011 1 2147
4 5 Argentina 1967 1 2128

Related

How to divide multiple columns based on three conditions

This is my dataset where I have different countries, different models for the different countries, years and the price and volume.
data_dic = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9]
}
Country Model Year Price Volume
0 1 A 2005 100 4
4 2 A 2005 350 12
3 1 A 2020 953 10
7 2 A 2020 896 9
1 1 B 2005 172 8
5 2 B 2005 452 6
2 1 B 2020 852 9
6 2 B 2020 658 8
I would like to obtain the following where 1) column "Division_Price" is the division of price for Country 1 of Model A between the year 2005 and 2020 and 2) column "Division_Volume" is the division in volume for Country 1 of Model A between the year 2005 and 2020.
data_dic2 = {
"Country" : [1,1,1,1,2,2,2,2],
"Model" : ["A","B","B","A","A","B","B","A"],
"Year": [2005,2005,2020,2020,2005,2005,2020,2020],
"Price" : [100,172,852,953,350,452,658,896],
"Volume" : [4,8,9,10,12,6,8,9],
"Division_Price": [0.953,4.95,4.95,0.953,2.56,1.45,1.45,2.56],
"Division_Volume": [2.5,1.125,1.125,2.5,1,1.33,1.33,1],
}
print(data_dic2)
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 0.953 2.500
4 2 A 2005 350 12 2.560 1.000
3 1 A 2020 953 10 0.953 2.500
7 2 A 2020 896 9 2.560 1.000
1 1 B 2005 172 8 4.950 1.125
5 2 B 2005 452 6 1.450 1.330
2 1 B 2020 852 9 4.950 1.125
6 2 B 2020 658 8 1.450 1.330
My whole dataset has up to 50 countries and I have up to 10 models with years ranging 1990 to 2030.
I am still unsure how to account for the multiple conditions of three columns so that I can divide automatically the column Price and Volume based on the three conditions (i.e., Country, Year and Models)?
Thanks !
You can try the following, using df.pivot, df.stack() and df.merge:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.diff().bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Difference_')
)
>>> df2
Difference_Price Difference_Volume
Year Country Model
2005 1 A 853 6
2 A 546 3
2020 1 A 853 6
2 A 546 3
2005 1 B 680 1
2 B 206 2
2020 1 B 680 1
2 B 206 2
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Difference_Price Difference_Volume
0 1 A 2005 100 4 853 6
1 2 A 2005 350 12 546 3
2 1 A 2020 953 10 853 6
3 2 A 2020 896 9 546 3
4 1 B 2005 172 8 680 1
5 2 B 2005 452 6 206 2
6 1 B 2020 852 9 680 1
7 2 B 2020 658 8 206 2
EDIT:
For your new dataframe, I think the 0.953 would be 9.530, if so, you can use pct_change and add 1:
>>> df2 = ( df.pivot(['Year'], columns=['Model', 'Country'], values=['Price', 'Volume'])
.pct_change(1).add(1).bfill(downcast='infer').abs().stack().stack()
.sort_index(level=-1).add_prefix('Division_').round(3)
)
>>> df2
Division_Price Division_Volume
Year Country Model
2005 1 A 9.530 2.500
2 A 2.560 0.750
2020 1 A 9.530 2.500
2 A 2.560 0.750
2005 1 B 4.953 1.125
2 B 1.456 1.333
2020 1 B 4.953 1.125
2 B 1.456 1.333
>>> df.merge(df2, on=['Country', 'Model', 'Year'], how='right')
Country Model Year Price Volume Division_Price Division_Volume
0 1 A 2005 100 4 9.530 2.500
1 2 A 2005 350 12 2.560 0.750
2 1 A 2020 953 10 9.530 2.500
3 2 A 2020 896 9 2.560 0.750
4 1 B 2005 172 8 4.953 1.125
5 2 B 2005 452 6 1.456 1.333
6 1 B 2020 852 9 4.953 1.125
7 2 B 2020 658 8 1.456 1.333

Add rows based on column date with two column unique key - pandas

So I have a dataframe like:
Number Country StartDate EndDate
12 US 1/1/2023 12/1/2023
12 Mexico 1/1/2024 12/1/2024
And what I am trying to do is:
Number Country Date
12 US 1/1/2023
12 US 2/1/2023
12 US 3/1/2023
12 US 4/1/2023
12 US 5/1/2023
12 US 6/1/2023
12 US 7/1/2023
12 US 8/1/2023
12 US 9/1/2023
12 US 10/1/2023
12 US 11/1/2023
12 US 12/1/2023
12 Mexico 1/1/2024
12 Mexico 2/1/2024
12 Mexico 3/1/2024
12 Mexico 4/1/2024
12 Mexico 5/1/2024
12 Mexico 6/1/2024
12 Mexico 7/1/2024
12 Mexico 8/1/2024
12 Mexico 9/1/2024
12 Mexico 10/1/2024
12 Mexico 11/1/2024
12 Mexico 12/1/2024
This problem is very similar to Adding rows for each month in a dataframe based on column date
However that problem only accounts for the unique key being one column. In this example the unique key is the Number and the Country.
This is what I am currently doing however, it only accounts for one column 'Number' and I need to include both Number and Country as they are the unique key.
df1 = pd.concat([pd.Series(r.Number, pd.date_range(start = r.StartDate, end = r.EndDate, freq='MS'))
for r in df1.itertuples()]).reset_index().drop_duplicates()
Create the range then explode
df['New']= [pd.date_range(start = x, end = y, freq='MS') for x , y in zip(df.pop('StartDate'),df.pop('EndDate'))]
df=df.explode('New')
Out[54]:
Number Country New
0 12 US 2023-01-01
0 12 US 2023-02-01
0 12 US 2023-03-01
0 12 US 2023-04-01
0 12 US 2023-05-01
0 12 US 2023-06-01
0 12 US 2023-07-01
0 12 US 2023-08-01
0 12 US 2023-09-01
0 12 US 2023-10-01
0 12 US 2023-11-01
0 12 US 2023-12-01
1 12 Mexico 2024-01-01
1 12 Mexico 2024-02-01
1 12 Mexico 2024-03-01
1 12 Mexico 2024-04-01
1 12 Mexico 2024-05-01
1 12 Mexico 2024-06-01
1 12 Mexico 2024-07-01
1 12 Mexico 2024-08-01
1 12 Mexico 2024-09-01
1 12 Mexico 2024-10-01
1 12 Mexico 2024-11-01
1 12 Mexico 2024-12-01

Summing on all previous values of a dataframe in Python

I have data that looks like:
Year Month Region Value
1978 1 South 1
1990 1 North 22
1990 2 South 33
1990 2 Mid W 12
1998 1 South 1
1998 1 North 12
1998 2 South 2
1998 3 South 4
1998 1 Mid W 2
.
.
up to
2010
2010
My end date is 2010 but I want to sum all Values by the Region and Month by adding all previous year values together.
I don't want just a regular cumulative sum but a Monthly Cumulative Sum by Region where Month 1 of Region South is the cumulative Month 1 of Region South of all previous Month 1s before it, etc....
Desired output is something like:
Month Region Cum_Value
1 South 2
2 South 34
3 South 4
.
.
1 North 34
2 North 10
.
.
1 MidW 2
2 MidW 12
Use pd.DataFrame.groupby with pd.DataFrame.cumsum
df1['cumsum'] = df1.groupby(['Month', 'Region'])['Value'].cumsum()
Result:
Year Month Region Value cumsum
0 1978 1 South 1.0 1.0
1 1990 1 North 22.0 22.0
2 1990 2 South 33.0 33.0
3 1990 2 Mid W 12.0 12.0
4 1998 1 South 1.0 2.0
5 1998 1 North 12.0 34.0
6 1998 2 South 2.0 35.0
7 1998 3 South 4.0 4.0
8 1998 1 Mid W 2.0 2.0
Here's another solutions that corresponds more with your expected output.
df = pd.DataFrame({'Year': [1978,1990,1990,1990,1998,1998,1998,1998,1998],
'Month': [1,1,2,2,1,1,2,3,1],
'Region': ['South','North','South','Mid West','South','North','South','South','Mid West'],
'Value' : [1,22,33,12,1,12,2,4,2]})
#DataFrame Result
Year Month Region Value
0 1978 1 South 1
1 1990 1 North 22
2 1990 2 South 33
3 1990 2 Mid West 12
4 1998 1 South 1
5 1998 1 North 12
6 1998 2 South 2
7 1998 3 South 4
8 1998 1 Mid West 2
Code to run:
df1 = df.groupby(['Month','Region']).sum()
df1 = df1.drop('Year',axis=1)
df1 = df1.sort_values(['Month','Region'])
#Final Result
Month Region Value
1 Mid West 2
1 North 34
1 South 2
2 Mid West 12
2 South 35
3 South 4

how to apply unique function and transform and keep the complete columns in the data frame pandas

My goal here is to extract the count of rows in the data frame in which for each PatienNumber and year and month show the count of them and keep all the columns in the data frame.
This is the original data frame:
PatientNumber QT Answer Answerdate year month dayofyear count formula
1 1 transferring No 2017-03-03 2017 3 62 2.0 (1/3)
2 1 preparing food No 2017-03-03 2017 3 62 2.0 (1/3)
3 1 medications Yes 2017-03-03 2017 3 62 1.0 (1/3)
4 2 transferring No 2006-10-05 2006 10 275 3.0 0
5 2 preparing food No 2006-10-05 2006 10 275 3.0 0
6 2 medications No 2006-10-05 2006 10 275 3.0 0
7 2 transferring Yes 2007-4-15 2007 4 105 2.0 2/3
8 2 preparing food Yes 2007-4-15 2007 4 105 2.0 2/3
9 2 medications No 2007-4-15 2007 4 105 1.0 2/3
10 2 transferring Yes 2007-12-15 2007 12 345 1.0 1/3
11 2 preparing food No 2007-12-15 2007 12 345 2.0 1/3
12 2 medications No 2007-12-15 2007 12 345 2.0 1/3
13 2 transferring Yes 2008-10-10 2008 10 280 1.0 (1/3)
14 2 preparing food No 2008-10-10 2008 10 280 2.0 (1/3)
15 2 medications No 2008-10-10 2008 10 280 2.0 (1/3)
16 3 medications No 2008-10-10 2008 12 280 …… ………..
so the desired output should be the same as this with one more column which shows the unique rows of [patientNumber, year, month]. for patient number=1 shows 1 for the PatientNumber= 2 shows 1 in year 2006, shows 2 in year 2007
I applied this code:
data=data.groupby(['Clinic Number','year'])["month"].nunique().reset_index(name='counts')
the output of this code look like:
Clinic Number year **counts**
0 494383 1999 1
1 494383 2000 2
2 494383 2001 1
3 494383 2002 1
4 494383 2003 1
the output counts is correct, except it does not keep the whole fields. I want the complete columns because later I have to do some calculation on them.
then I tried this code:
data['counts'] = data.groupby(['Clinic Number','year','month'])['month'].transform('count')
Again its not good because it does not show correct count. the output of this code is like this:
Clinic Number Question Text Answer Text ... year month counts
1 3529933 bathing No ... 2011 1 10
2 3529933 dressing No ... 2011 1 10
3 3529933 feeding No ... 2011 1 10
4 3529933 housekeeping No ... 2011 1 10
5 3529933 medications No ... 2011 1 10
here counts should be 1 because for that patient and that year there is just one month.
Use, the following modification to your code.
df['counts'] = df.groupby(['PatientNumber','year'])["month"].transform('nunique')
transform returns a series equal length to your original dataframe, therefore you can add this series into your dataframe as a column.

pandas: if intersection then update dataframe

I have two dataframes:
countries:
Country or Area Name ISO-2 ISO-3
0 Afghanistan AF AFG
1 Philippines PH PHL
2 Albania AL ALB
3 Norway NO NOR
4 American Samoa AS ASM
contracts:
Country Name Jurisdiction Signature year
0 Yemen KY;NO;CA;NO 1999.0
1 Yemen BM;TC;YE 2007.0
2 Congo, CD;CD 2015.0
3 Philippines PH 2009.0
4 Philippines PH;PH 2007.0
5 Philippines PH 2001.0
6 Philippines PH;PH 1997.0
7 Bolivia, Plurinational State of BO;BO 2006.0
I want to:
check whether the column Jurdisctiction in contracts contains at least one two letter code from the countries ISO-2 column.
I have tried numerous ways of testing whether there is an intersection, but none of them works. My last try was:
i1 = pd.Index(contracts['Jurisdiction of Incorporation'].str.split(';'))
i2 = pd.Index(countries['ISO-2'])
print i1, i2
i1.intersection(i2)
Which gives me TypeError: unhashable type: 'list'
if at least one of the codes is present, I want to update the contracts dataframe with new column that will contain just boolean values
contracts['new column'] = np.where("piece of code that will actually work", 1, 0)
So the desired output would be
Country Name Jurisdiction Signature year new column
0 Yemen KY;NO;CA;NO 1999.0 1
1 Yemen BM;TC;YE 2007.0 0
2 Congo, CD;CD 2015.0 0
3 Philippines PH 2009.0 1
4 Philippines PH;PH 2007.0 1
5 Philippines PH 2001.0 1
6 Philippines PH;PH 1997.0 1
7 Bolivia, Plurinational State of BO;BO 2006.0 0
How can I achieve this?
A bit of a mouthful, but try this:
occuring_iso_2_codes = set(countries['ISO-2'])
contracts['new column'] = contracts.Jurisdiction.apply(
lambda s: int(bool(set(s.split(';')).intersection(occuring_iso_2_codes))))

Categories

Resources