a data frame of the country name in rows with corresponding medals win in summer and winter Olympics
I want in this data frame to get the country name which has a max difference in summer gold and winter gold, let's say summer gold column name is x and winter gold column name is y
all the country names are an index of rows
It is always good to provide a sample data frame so we can help better. I think you are looking for this:
(df.y-df.x).idxmax()
And if you care only about the absolute value of difference:
(df.x-df.y).abs().idxmax()
Example:
df = pd.DataFrame({'x':[1,2,3],'y':[2,10,5]}, index=['a','b','c'])
x y
a 1 2
b 2 10
c 3 5
print((df.y-df.x).abs().idxmax())
b
Related
I have a dataframe (much larger than this example)as follows where all rows in the first two columns are repeated 5 times.
import pandas as pd
df = pd.DataFrame({'text':['the weather is nice','the weather is nice','the weather is nice','the weather is nice','the weather is nice',
'the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful','the house is beautiful',
'the day is long','the day is long','the day is long','the day is long','the day is long'],
'reference':['weather','weather','weather','weather','weather',
'house','house','house','house','house',
'day','day','day','day','day'],
'id':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
I would like to divide this pandas dataframe to two dataframes in a way that the first two consecutive rows appear in one and the three others appear in the second dataframe as follows.
The desired output:
first df:
text reference id
0 the weather is nice weather 1
1 the weather is nice weather 2
3 the house is beautiful house 6
4 the house is beautiful house 7
5 the day is long day 11
6 the day is long day 12
second df:
text reference id
0 the weather is nice weather 3
1 the weather is nice weather 4
2 the weather is nice weather 5
3 the house is beautiful house 8
4 the house is beautiful house 9
5 the house is beautiful house 10
6 the day is long day 13
7 the day is long day 14
8 the day is long day 15
obviously selecting n-rows does not work (e,g df.iloc[::3, :] or df[df.index % 3 == 0]) so I would like to know how the above-mentioned output would be possible.
If you want to group on the value of reference (first 2 items vs rest):
mask = df.groupby('reference').cumcount().gt(1)
groups = [g for k,g in df.groupby(mask)]
# or manually
# df1 = df[~mask]
# df2 = df[mask]
Using position:
mask = (np.arange(len(df))%5)<1
# or with a range index
# mask = df.index.mod(5).gt(1)
# then same as above using groupby or slicing
Make a mask m:
import numpy as np
m = np.tile([True, True, False, False, False], len(df) // 5)
df1 = df[m]
df2 = df[~m]
I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0
I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
I have this real estate data:
neighborhood type_property type_negotiation price
Smallville house rent 2000
Oakville apartment for sale 100000
King Bay house for sale 250000
...
I have this groupby that identifies which values in the data set are a house for sale, and then returns the 10th and 90th percentile and quantity of these houses for each neighborhood in a new data frame called df_breakdown. The result looks like this:
neighborhood tenthpercentile ninetiethpercentile Quantity
King Bay 250000.0 250000.0 1
Smallville 99000.0 120000.0 8
Oakville 45000.0 160000.0 6
...
I now want to take this information back to my original real estate data set, and filter out all listings if it's a house for sale over the 90th percentile or below the 10th percentile in respect to the percentiles calculated for each neighborhood. For example, I would want a house in the Oakville neighborhood that has a price of 350000 filtered out.
I have used this argument before:
df1 = df[df.price < df.price.quantile(.90)]
But I don't know how to utilize it for differing values for each neighborhood, or even if it is useful to use. Thank you in advance for the help.
Probably not the most elegant but you could join the percentile aggregations to each of the real estate data.
df.join(df.groupby(‘neighborhood’).quantile([0.1,0.9]), on=‘neighborhood’)
On mobile, so forgive me if the syntax isn’t perfect.
You can set them to have same indexes, broadcast the percentiles, and just use .between
So first,
df2 = df2.set_index('neighborhood')
df = df.set_index('neighborhood')
Then, broadcast using loc
df.loc[:, 't'], df.loc[:, 'n'] = df2.tenthpercentile, df2.ninetiethpercentile
Finally,
df.price.between(df.t, df.n)
which yields
neighborhood
Smallville False
Oakville True
King Bay True
King Bay False
dtype: bool
So to filter, just slice
df[df.price.between(df.t, df.n)]
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 5 years ago.
I have a dataset containing the indicators of different cities: so the same indicator is repeated several times for different cities. The dataset is something like this:
df
city indicator Value
0 Tokio Solid Waste Recycled 1.162000e+01
1 Boston Solid Waste Recycled 3.912000e+01
2 London Solid Waste Recycled 0.000000e+00
3 Tokio Own-Source Revenues 1.420000e+00
4 Boston Own-Source Revenues 0.000000e+00
5 London Own-Source Revenues 3.247000e+01
6 Tokio Green Area 4.303100e+02
7 Boston Green Area 7.166350e+01
8 London Green Area 1.997610e+01
9 Tokio City Land Area 9.910000e+01
10 Boston City Land Area 4.200000e+01
11 London City Land Area 8.956000e+01
From the original dataframe I would like to create a second dataframe like the following:
df1
Solid Waste Recycled Own-Source Revenues Green Area City Land Area
Tokio 1.162000e+01 1.420000e+00 4.303100e+02 9.910000e+01
Boston 3.912000e+01 0.000000e+00 7.166350e+01 4.200000e+01
London 0.000000e+00 3.247000e+01 1.997610e+01
Maybe there is a better solution, but you can groupby and then apply a function on each grouped dataframe to create a new one:
grouped = df.groupby('City')
res = defaultdict(list)
for k, k_df in grouped:
res['City'].append(k)
k_df.apply(lambda row: res[row['Indicator']].append(row['Value']), axis=1)
pd.DataFrame(res)
Notice this will work only if all the values appear for all cities. If you want to support missing values you add Nones for missing values of each city. This requires to collect all possible values and check they were inserted:
grouped = df.groupby('City')
res = defaultdict(list)
new_columns = set(df['Indicator']) #all possible values
for k, k_df in grouped:
res['City'].append(k)
k_df.apply(lambda row: res[row['Indicator']].append(row['Value']), axis=1)
for col in new_columns:
if len(res[col]) < len(res['City']): # check if values is missing
res[col].append(None)