Groupby and fill data into a text template in Python - python

Given a small dataset as follow:
id city price area
0 1 a 12 6
1 2 a 3 7
2 3 a 3 8
3 4 b 2 9
4 5 b 5 6
I would like to groupby city and fill the data into a text template as follows:
For 【】 city, it has 【】district, 【】district and 【】district, the price is respectively 【】dollars, 【】dollars and【】dollar, the area sold is respectively 【】㎡,【】㎡ and 【】㎡.
Code:
df.groupby('city')['district'].apply(list)
Out[14]:
city
bj [hd, cy, tz]
sh [hp, pd]
Name: district, dtype: object
df.groupby('city')['price'].apply(list)
Out[15]:
city
bj [12, 3, 3]
sh [2, 5]
Name: price, dtype: object
df.groupby('city')['area'].apply(list)
Out[16]:
city
bj [6, 7, 8]
sh [9, 6]
Name: area, dtype: object
For example, the result will be like this:
For bj city, it has hd district, cy district and tz district, the price is respectively 12 dollars, 3 dollars and 3 dollar, the area sold is respectively 6 ㎡,7 ㎡ and 8 ㎡.
Is it possible I could get an approximate result (not necessary be exact same) as above with Python? Many thanks for Python or Pandas masters' kind help at advance.

df = pd.DataFrame(
dict(id = [1,2,3,4,5,],
city = list("a" * 3) + list("b" * 2),
district = ["hd", "cy", "tz", "hp", "pd",],
price = [12,3,3,2,5,],
area = [6,7,8,9,6,],)
)
# We can set a few initial variables to help the process out.
target = ["city",]
ignore = ["id",]
# This will produce -> ['district', 'price', 'area']
groupers = [i for i in df.columns if not i in tuple(target + ignore)]
# Iterate through each unique city value.
for c in df["city"].unique():
# Start our message.
msg = f"For city '{c}'," # I tweaked the formatting here.
# Subset of data based on target variable (in this case, 'city')
# Use `.drop_duplicates()` to retain unique rows.
dft = df.loc[df["city"] == c, groupers].drop_duplicates()
# --- OR, the following to use the `target` variable value. --- #
# dft = df.loc[df[target[0]] == c, groupers].drop_duplicates()
# Iterate a transposed index
for idx in dft.T.index:
# Make a quick value variable for clarity
vals = dft.T.loc[idx].values
# Do some ad hoc formatting for each specific field, if required
# Add your desired message start to the respective variable.
# `field` will be what is output to the message string.
if idx == "price":
msg_start = "the price is respectively "
field = "dollars"
elif idx == "area":
msg_start = "the area sold is respectively "
field = "m\u00b2"
else:
msg_start = " it has\n"
field = idx
# Add the message start section
msg += msg_start
# Use .join() with conditions to determine if the item is the last one in the list.
msg += ", ".join([f"{i} {field}" if i != vals[-1] else f"and {i} {field}" for i in vals])
# Add a newline for separation betweeen each set of items.
msg += "\n"
print(msg)
Output:
For city 'a', it has
hd district, cy district, and tz district
the price is respectively 12 dollars, and 3 dollars, and 3 dollars
the area sold is respectively 6 m², 7 m², and 8 m²
For city 'b', it has
hp district, and pd district
the price is respectively 2 dollars, and 5 dollars
the area sold is respectively 9 m², and 6 m²

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Finding the name of the maximum value for each category

I have a groupby object that shows the total price brand wise and state wise of different car brands:
grouped_a = cars_data.groupby(['brand','state'])
grouped_a['price'].sum()
what function can I use that returns the brand associated with the highest total price in each state? I have tried looping throug the groupby object but it doesn't work.
You can groupby to get the max total per state per brand, then merge to find the brand(s) with the highest total. Note you may have more than one brand that has the same total. Here's one way to do it:
row1list = ['Ford', 'California', 100]
row2list = ['Toyota', 'California', 200]
row3list = ['Toyota', 'California', 300]
cars_data = pd.DataFrame([row1list, row2list, row3list], columns=['brand', 'state', 'price'])
df_total_by_br_st = cars_data.groupby(['brand', 'state'], as_index=False).agg({'price': sum})
df_max_by_st = df_total_by_br_st.groupby('state', as_index=False).agg({'price': max})
df_max_by_st = df_max_by_st.rename(columns={'price': 'max_price'})
df_total_by_br_st = df_total_by_br_st.merge(df_max_by_st, on='state', how='left')
df_max_brand_by_state = df_total_by_br_st[df_total_by_br_st['price'] == df_total_by_br_st['max_price']]
print(df_max_brand_by_state)
# brand state price max_price
# 1 Toyota California 500 500
idxmax() gives you the row indexes with maximum values.
>>> cars_data
price state brand
0 10 emergency on
1 20 emergency ing
2 15 trance on
3 12 trance ing
>>> max_price_rows = cars_data.groupby(['state'])['price'].idxmax()
>>> max_price_rows
state
emergency 1
trance 2
Name: price, dtype: int64
So rows 1 and 2, i.e. with prices 20 and 15.
If you want the max per state, then pass that result to .loc[]:
>>> cars_data.loc[max_price_rows]
price state brand
1 20 emergency ing
2 15 trance on

How to find every number that a certain digit is present in, and label based on them

the dataset I have has the id feature and the number of purchases, I want to add a new column as a label of 4 or 3 numbers contained in id
for example
id total
08773338 100
08333777 80
for example
0877 = California
083 = Tokyo
then
id total label
08773338 100 california
08333777 80 tokyo
Idea is create dictionary for matching starts of numbers and labels and then set new column by DataFrame.loc if match values by Series.str.startswith:
d = {'0877':'California', '083':'Tokyo'}
for k, v in d.items():
df.loc[df['id'].str.startswith(k), 'label'] = v
print (df)
id total label
0 08773338 100 California
1 08333777 80 Tokyo
You can vectorize this with str.extract and map:
df
id total label
0 08773338 100 California
1 08333777 80 Tokyo
df.dtypes
id string
total int64
label object
dtype: object
d = {'0877': 'California', '083': 'Tokyo'}
p = r'(^{})'.format('|'.join(d.keys()))
df['label'] = df['id'].str.extract(p, expand=False).map(d)
df
id total label
0 08773338 100 California
1 08333777 80 Tokyo

Calculation in a dataframe column

I have a dataframe similar to:
City %SC Team
0 London 50.5 A
1 London 40.1 B
2 London 9.4 C
3 Birmingham 31.3 B
4 Birmingham 27.1 A
5 Birmingham 23.7 D
6 Birmingham 17.9 C
7 York 40.1 A
8 York 38.8 C
9 York 21.1 B
.
.
.
I want to separate the cities in Clear win, Marginal win, Extremely Marginal win based on the difference of the top 2 teams.
I have tried the following code:
df = pd.read_csv('file.csv')
Clear, Marginal, EMarginal = [],[],[]
for i in file['%SC']:
if i[0] - i[1] >= 10:
Clear.append('City','Team')
elif i[0] - i[1] < 10 and i[0] - i[1] >=2 :
Marginal.append('City','Team')
else:
EMarginal.append('City','Team')
Expected output:
Clear = [London , A]
Marginal = [Birmingham , B]
EMarginal = [York , A]
My approach doesn't seem right, can anyone suggest a way I could achieved the desired result? Many thanks
If I understand correctly, you want to divide the cities into groups according to the first two teams.
def classify(city):
largest = city.nlargest(2, '%SC')
diff = largest['%SC'].iloc[0] - largest['%SC'].iloc[1]
if diff >= 10:
return 'Clear'
elif diff < 10 and diff >=2 :
return 'Marginal'
return 'EMarginal'
groups = df.groupby("City").apply(classify)
# groups is the following table:
# City
# Birmingham Marginal
# London Clear
# York EMarginal
# dtype: object
If you insist to have a them as a list, you can call
groups.groupby(groups).apply(lambda g: list(g.index)).to_dict()
# Output:
# {'Clear': ['London'], 'EMarginal': ['York'], 'Marginal': ['Birmingham']}
If you still insist to include the winning team in each city, you can call
groups.name = "Margin"
df.join(groups, on="City")\
.groupby("Margin")\
.apply(
lambda g: list(
g.nlargest(1, "%SC")
.apply(lambda row: (row["City"], row["Team"]), axis=1)
)
).to_dict()
# Output
# {'Clear': [('London', 'A')], 'EMarginal': [('York', 'A')], 'Marginal': [('Birmingham', 'B')]}

pandas replace NaNs with modus of another column based on second column

I have a pandas dataframe with two columns, city and country. Both city and country contain missing values. consider this data frame:
temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
"city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})
I now want to fill in the NaNs in the city column with the mode of the country's city in the remaining data frame, e.g. for country A: city 1 is mentioned once; city 2 is mentioned twice; thus, fill the column city at index 2 with city 2 etc.
I have done
cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
new_dict_locations[k] = dict_locations[k]["city"]
Now having the value of the country and the corresponding city mode, I face two issues:
First: the case country C is bimodal - the key contains two entries. I want this key to refer to each of the entries with equal probability. The real data set has multiple modes, so it would be a list of len > 2.
Second: I'm stuck replacing the NaNs in city with the value corresponding to the value in the same line's country cell in new_dict_locations. In pseudo-code, this would be: `go through the column 'city'; if you find a missing value at position 'temp[i, city]', take the value of 'country' in that row (-> 'country_tmp'); take 'country_tmp' as key to the dictionary 'new_dict_locations'; if the dictionary at key 'country_temp' is a list, randomly select one item from that list; take the return value (-> 'city_tmp') and fill the cell with the missing value (temp[i, city]) with the value 'city_temp').
I've tried using different combinations of .fillna() and .replace() (and read this and other questions to no avail.* Can someone give me a pointer?
Many thanks in advance.
(Note: the referenced question replaces values in one cell according to a dict; my reference values are, however, in a different column.)
** EDIT **
executing temp["city"].fillna(temp['country'], inplace=True) and temp.replace({'city': dict_locations}) gives me an error: TypeError: unhashable type: 'dict' [This error is TypeError: unhashable type: 'numpy.ndarray' for the original data set but I cannot reproduce it with an example - if someone knows the whereabouts of the difference, I'd be super happy to hear their thoughts.]
Try map with dict new_dict_locations to create a new series s, and map again on s with np.random.choice to pick value from array. Finally, use s to fillna
s = (temp.country.map(new_dict_locations)
.map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x))
temp['city'] = temp.city.fillna(s)
Out[247]:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 6
10 country C city 5
11 country C city 6
Note: I thought 2 map may be joined to one by using dict comprehension. However, doing it will cause loosing of the randomness.
def get_mode(d):
for k,v in d.items():
if len(v)>1 and isinstance(v, np.ndarray):
d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
return d
Below dictionary is the one which will be used for filling.
new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())
# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)
This will give desired output:
country city
0 country A city 1
1 country A city 2
2 country A city 2
3 country A city 2
4 country B city 3
5 country B city 3
6 country B city 3
7 country B city 4
8 country C city 5
9 country C city 5
10 country C city 5
11 country C city 6

Categories

Resources