Calculation in a dataframe column - python

I have a dataframe similar to:
City %SC Team
0 London 50.5 A
1 London 40.1 B
2 London 9.4 C
3 Birmingham 31.3 B
4 Birmingham 27.1 A
5 Birmingham 23.7 D
6 Birmingham 17.9 C
7 York 40.1 A
8 York 38.8 C
9 York 21.1 B
.
.
.
I want to separate the cities in Clear win, Marginal win, Extremely Marginal win based on the difference of the top 2 teams.
I have tried the following code:
df = pd.read_csv('file.csv')
Clear, Marginal, EMarginal = [],[],[]
for i in file['%SC']:
if i[0] - i[1] >= 10:
Clear.append('City','Team')
elif i[0] - i[1] < 10 and i[0] - i[1] >=2 :
Marginal.append('City','Team')
else:
EMarginal.append('City','Team')
Expected output:
Clear = [London , A]
Marginal = [Birmingham , B]
EMarginal = [York , A]
My approach doesn't seem right, can anyone suggest a way I could achieved the desired result? Many thanks

If I understand correctly, you want to divide the cities into groups according to the first two teams.
def classify(city):
largest = city.nlargest(2, '%SC')
diff = largest['%SC'].iloc[0] - largest['%SC'].iloc[1]
if diff >= 10:
return 'Clear'
elif diff < 10 and diff >=2 :
return 'Marginal'
return 'EMarginal'
groups = df.groupby("City").apply(classify)
# groups is the following table:
# City
# Birmingham Marginal
# London Clear
# York EMarginal
# dtype: object
If you insist to have a them as a list, you can call
groups.groupby(groups).apply(lambda g: list(g.index)).to_dict()
# Output:
# {'Clear': ['London'], 'EMarginal': ['York'], 'Marginal': ['Birmingham']}
If you still insist to include the winning team in each city, you can call
groups.name = "Margin"
df.join(groups, on="City")\
.groupby("Margin")\
.apply(
lambda g: list(
g.nlargest(1, "%SC")
.apply(lambda row: (row["City"], row["Team"]), axis=1)
)
).to_dict()
# Output
# {'Clear': [('London', 'A')], 'EMarginal': [('York', 'A')], 'Marginal': [('Birmingham', 'B')]}

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Splitting a csv into multiple csv of maximum 2000 rows while respecting grouping condition using Python

This is my very first question...
I'm trying to split a big csv of maximum 2000 rows. If it was just splitting it would be too easy. In this case, I can't just split by dividing the csv. Indeed, some rows need to be grouped together. Every file can't be bigger (it can be smaller) than 2000 but mostly, rows that needs to be together should be in the same file. Rows that needs to be together share the same combination for two columns --> That's how I know they need to be together.
Example with 10 records and split csv's of maximum 5 rows :
Country
Category
Product
Spain
A
1
Spain
A
2
Spain
A
3
Spain
B
4
Spain
B
5
Spain
B
6
Spain
B
7
Italy
B
8
Germany
A
9
Germany
A
10
Here all the rows having the same combination of Country and Category need to be together. If the maximum size of the split file is 5, we get the following:
Country
Category
Product
Spain
A
1
Spain
A
2
Spain
A
3
Country
Category
Product
Spain
B
4
Spain
B
5
Spain
B
6
Spain
B
7
Italy
B
8
Country
Category
Product
Germany
A
9
Germany
A
10
Any idea how I could solve this?
Thanks!!
You can find group sizes, then calculate which group should be the last in each file (the point at which it overflows the given maximum number of rows), then convert it to file numbers and group by file numbers to save individual CSVs.
In code this would look like the following (please see comments for explanation):
# set max records per file
N = 5
# find counts per group
z = df.groupby(['Country', 'Category'], sort=False).size().reset_index()
# find the last group in each file and set `x` = 1
i = 0
for j in range(len(z)):
if z.loc[i:j, 0].sum() > N:
z.loc[j, 'x'] = 1
i = j
# calculate the file number `f` as the cumsum of `x`
z['f'] = z['x'].fillna(0).cumsum().astype(int) + 1
# merge df and z to get file number for each record
# then groupby and save to separate CSV files
for f, df_g in df.merge(z[['Country', 'Category', 'f']]).groupby('f'):
df_g.drop(columns='f').to_csv(f'{f:03}.csv', index=False)
This would save your sample DataFrame into 3 files:
001.csv
Country Category Product
0 Spain A 1
1 Spain A 2
2 Spain A 3
002.csv
Country Category Product
0 Spain B 4
1 Spain B 5
2 Spain B 6
3 Spain B 7
4 Italy B 8
003.csv
Country Category Product
0 Germany A 9
1 Germany A 10
Example:
Country Category Product
0 Spain a 1
1 Belgium b 2
2 Spain a 2
3 Cuba c 3
4 Belgium c 4
5 Cuba a 5
new_df = df[df['Country']=='Spain']
Country Category Product
0 Spain a 1
2 Spain a 2
Then convert the subset to a new CSV subset file. (do the same for 'category' and 'product')
new_df.to_csv(file)
This is not an optimal solution but it is an easy and tractable solution that is likely good enough.
import csv
from itertools import groupby
with open(fn) as f:
reader=csv.reader(f)
header=next(reader)
data=[row for row in reader]
def key_func(li):
return (li[1],li[0],li[2])
data_dic={}
# If you only want a single country in a file, change the following line to
# for k,v in groupby(sorted(data, key=key_func), key=lambda li: (li[0], li[1])):
for k,v in groupby(sorted(data, key=key_func), key=lambda li: li[1]):
data_dic[k]=list(v)
chnk=5 # change this for the max lines per file
cnt=1
for k,v in data_dic.items():
for chunk in (v[i:i+chnk] for i in range(0, len(v), chnk)):
print(f'\n=== file {cnt}:')
cnt+=1
print('\n'.join([','.join(e) for e in [header]+chunk]))
With your example, prints:
=== file 1:
Country,Category,Product
Germany,A,10
Germany,A,9
Spain,A,1
Spain,A,2
Spain,A,3
=== file 2:
Country,Category,Product
Italy,B,8
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
With this input:
Country,Category,Product
Spain,A,1
Spain,A,2
Spain,A,3
Spain,A,4
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
Spain,B,8
Italy,B,8
Germany,A,9
Germany,A,10
Germany,A,11
Cuba,C,22
Prints:
=== file 1:
Country,Category,Product
Germany,A,10
Germany,A,11
Germany,A,9
Spain,A,1
Spain,A,2
=== file 2:
Country,Category,Product
Spain,A,3
Spain,A,4
=== file 3:
Country,Category,Product
Italy,B,8
Spain,B,4
Spain,B,5
Spain,B,6
Spain,B,7
=== file 4:
Country,Category,Product
Spain,B,8
=== file 5:
Country,Category,Product
Cuba,C,22

Groupby and fill data into a text template in Python

Given a small dataset as follow:
id city price area
0 1 a 12 6
1 2 a 3 7
2 3 a 3 8
3 4 b 2 9
4 5 b 5 6
I would like to groupby city and fill the data into a text template as follows:
For 【】 city, it has 【】district, 【】district and 【】district, the price is respectively 【】dollars, 【】dollars and【】dollar, the area sold is respectively 【】㎡,【】㎡ and 【】㎡.
Code:
df.groupby('city')['district'].apply(list)
Out[14]:
city
bj [hd, cy, tz]
sh [hp, pd]
Name: district, dtype: object
df.groupby('city')['price'].apply(list)
Out[15]:
city
bj [12, 3, 3]
sh [2, 5]
Name: price, dtype: object
df.groupby('city')['area'].apply(list)
Out[16]:
city
bj [6, 7, 8]
sh [9, 6]
Name: area, dtype: object
For example, the result will be like this:
For bj city, it has hd district, cy district and tz district, the price is respectively 12 dollars, 3 dollars and 3 dollar, the area sold is respectively 6 ㎡,7 ㎡ and 8 ㎡.
Is it possible I could get an approximate result (not necessary be exact same) as above with Python? Many thanks for Python or Pandas masters' kind help at advance.
df = pd.DataFrame(
dict(id = [1,2,3,4,5,],
city = list("a" * 3) + list("b" * 2),
district = ["hd", "cy", "tz", "hp", "pd",],
price = [12,3,3,2,5,],
area = [6,7,8,9,6,],)
)
# We can set a few initial variables to help the process out.
target = ["city",]
ignore = ["id",]
# This will produce -> ['district', 'price', 'area']
groupers = [i for i in df.columns if not i in tuple(target + ignore)]
# Iterate through each unique city value.
for c in df["city"].unique():
# Start our message.
msg = f"For city '{c}'," # I tweaked the formatting here.
# Subset of data based on target variable (in this case, 'city')
# Use `.drop_duplicates()` to retain unique rows.
dft = df.loc[df["city"] == c, groupers].drop_duplicates()
# --- OR, the following to use the `target` variable value. --- #
# dft = df.loc[df[target[0]] == c, groupers].drop_duplicates()
# Iterate a transposed index
for idx in dft.T.index:
# Make a quick value variable for clarity
vals = dft.T.loc[idx].values
# Do some ad hoc formatting for each specific field, if required
# Add your desired message start to the respective variable.
# `field` will be what is output to the message string.
if idx == "price":
msg_start = "the price is respectively "
field = "dollars"
elif idx == "area":
msg_start = "the area sold is respectively "
field = "m\u00b2"
else:
msg_start = " it has\n"
field = idx
# Add the message start section
msg += msg_start
# Use .join() with conditions to determine if the item is the last one in the list.
msg += ", ".join([f"{i} {field}" if i != vals[-1] else f"and {i} {field}" for i in vals])
# Add a newline for separation betweeen each set of items.
msg += "\n"
print(msg)
Output:
For city 'a', it has
hd district, cy district, and tz district
the price is respectively 12 dollars, and 3 dollars, and 3 dollars
the area sold is respectively 6 m², 7 m², and 8 m²
For city 'b', it has
hp district, and pd district
the price is respectively 2 dollars, and 5 dollars
the area sold is respectively 9 m², and 6 m²

Multiply columns based on two columns conditions from different dataframes?

I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB

Max value using idxmax

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

Categories

Resources