Python: how to reshape a pandas dataframe? [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 5 years ago.
I have a dataset containing the indicators of different cities: so the same indicator is repeated several times for different cities. The dataset is something like this:
df
city indicator Value
0 Tokio Solid Waste Recycled 1.162000e+01
1 Boston Solid Waste Recycled 3.912000e+01
2 London Solid Waste Recycled 0.000000e+00
3 Tokio Own-Source Revenues 1.420000e+00
4 Boston Own-Source Revenues 0.000000e+00
5 London Own-Source Revenues 3.247000e+01
6 Tokio Green Area 4.303100e+02
7 Boston Green Area 7.166350e+01
8 London Green Area 1.997610e+01
9 Tokio City Land Area 9.910000e+01
10 Boston City Land Area 4.200000e+01
11 London City Land Area 8.956000e+01
From the original dataframe I would like to create a second dataframe like the following:
df1
Solid Waste Recycled Own-Source Revenues Green Area City Land Area
Tokio 1.162000e+01 1.420000e+00 4.303100e+02 9.910000e+01
Boston 3.912000e+01 0.000000e+00 7.166350e+01 4.200000e+01
London 0.000000e+00 3.247000e+01 1.997610e+01

Maybe there is a better solution, but you can groupby and then apply a function on each grouped dataframe to create a new one:
grouped = df.groupby('City')
res = defaultdict(list)
for k, k_df in grouped:
res['City'].append(k)
k_df.apply(lambda row: res[row['Indicator']].append(row['Value']), axis=1)
pd.DataFrame(res)
Notice this will work only if all the values appear for all cities. If you want to support missing values you add Nones for missing values of each city. This requires to collect all possible values and check they were inserted:
grouped = df.groupby('City')
res = defaultdict(list)
new_columns = set(df['Indicator']) #all possible values
for k, k_df in grouped:
res['City'].append(k)
k_df.apply(lambda row: res[row['Indicator']].append(row['Value']), axis=1)
for col in new_columns:
if len(res[col]) < len(res['City']): # check if values is missing
res[col].append(None)

Related

Drop 50% of rows containing a keyword [duplicate]

This question already has answers here:
"Drop random rows" from pandas dataframe
(2 answers)
Closed 2 years ago.
I have a dataset which contains a column 'location' with countries.
id location
0 001 United State
1 002 United State
2 003 Germany
3 004 Brazil
4 005 China
Now I only want the rows with specific countries.
I did this like this:
df2 = df[(df['location'].str.contains('United States')) | (df['location'].str.contains('Germany'))
That works.
Now I want only half of the rows with 'United States'.
(The reason is I have a really large dataset and most of the rows are 'United States'. For the sake of performance for further operations i want to cut half of it or just any %.)
Can anyone help me do that in a fast and clean way? Im sturggling.
TY <3
You can use sample for that, together with drop
df.drop(df[df['location'] == 'United State']).sample(frac=.5).index)
The filter inside, returns ALL the rows that have values equal to 'United State'. Then the sample takes randomly 50% of those and the index will return the index number which then will be used to drop those rows.

how to get a index of row after it satisfies certain condition

a data frame of the country name in rows with corresponding medals win in summer and winter Olympics
I want in this data frame to get the country name which has a max difference in summer gold and winter gold, let's say summer gold column name is x and winter gold column name is y
all the country names are an index of rows
It is always good to provide a sample data frame so we can help better. I think you are looking for this:
(df.y-df.x).idxmax()
And if you care only about the absolute value of difference:
(df.x-df.y).abs().idxmax()
Example:
df = pd.DataFrame({'x':[1,2,3],'y':[2,10,5]}, index=['a','b','c'])
x y
a 1 2
b 2 10
c 3 5
print((df.y-df.x).abs().idxmax())
b

Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.
If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)

How to filter out entries in a data frame with specific and different values?

I have this real estate data:
neighborhood type_property type_negotiation price
Smallville house rent 2000
Oakville apartment for sale 100000
King Bay house for sale 250000
...
I have this groupby that identifies which values in the data set are a house for sale, and then returns the 10th and 90th percentile and quantity of these houses for each neighborhood in a new data frame called df_breakdown. The result looks like this:
neighborhood tenthpercentile ninetiethpercentile Quantity
King Bay 250000.0 250000.0 1
Smallville 99000.0 120000.0 8
Oakville 45000.0 160000.0 6
...
I now want to take this information back to my original real estate data set, and filter out all listings if it's a house for sale over the 90th percentile or below the 10th percentile in respect to the percentiles calculated for each neighborhood. For example, I would want a house in the Oakville neighborhood that has a price of 350000 filtered out.
I have used this argument before:
df1 = df[df.price < df.price.quantile(.90)]
But I don't know how to utilize it for differing values for each neighborhood, or even if it is useful to use. Thank you in advance for the help.
Probably not the most elegant but you could join the percentile aggregations to each of the real estate data.
df.join(df.groupby(‘neighborhood’).quantile([0.1,0.9]), on=‘neighborhood’)
On mobile, so forgive me if the syntax isn’t perfect.
You can set them to have same indexes, broadcast the percentiles, and just use .between
So first,
df2 = df2.set_index('neighborhood')
df = df.set_index('neighborhood')
Then, broadcast using loc
df.loc[:, 't'], df.loc[:, 'n'] = df2.tenthpercentile, df2.ninetiethpercentile
Finally,
df.price.between(df.t, df.n)
which yields
neighborhood
Smallville False
Oakville True
King Bay True
King Bay False
dtype: bool
So to filter, just slice
df[df.price.between(df.t, df.n)]

Pandas dataframe: groupby one column, but concatenate and aggregate by others [duplicate]

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Pandas groupby: How to get a union of strings
(8 answers)
Closed 4 years ago.
How do I turn the below input data (Pandas dataframe fed from Excel file):
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith 100
334014 E&E Tom Smith 200
334014 Real Estate Perspectives Janet Brown 100
334014 E&E Janet Brown 200
into this:
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith, Janet Brown 100
334014 E&E Tom Smith, Janet Brown 200
So basiscally I want to group by Category, concatenate the Speakers, but not aggregate Price.
I tried different approaches with Pandas dataframe.groupby() and .agg(), but to no avail. Maybe there is simpler pure Python solution?
There are 2 possible solutions - aggregate by multiple columns and join:
dataframe.groupby(['ID','Category','Price'])['Speaker'].apply(','.join)
Or need aggregate only Price column, then is necessary aggregate all columns by first or last:
dataframe.groupby('Price').agg({'Speaker':','.join, 'ID':'first', 'Price':'first'})
Try this
df.groupby(['ID','Category'],as_index=False).agg(lambda x : x if x.dtype=='int64' else ', '.join(x))

Categories

Resources