Group by one column and find first of two columns pandas - python

I have one dataframe geomerge, I need to group by one column grpno. and select first of column MaxOfcount percent and first of column state code and display grpno. also. I have rename them as FirstOfMaxOfState count percent and FirstOfstate code
My input dataframe:
count percent grpno. state code MaxOfcount percent
0 14.78 1 CA 14.78
1 0.00 2 CA 0.00
2 0.00 2 FL 0.00
3 8.80 3 CA 8.80
4 0.00 6 NC 0.00
5 0.00 5 NC 0.00
6 59.00 4 MA 59.00
My output dataframe:
FirstOfMaxOfState count percent state pool number FirstOfstate code
0 14.78 1 CA
1 0.00 2 CA
2 8.80 3 CA
3 59.00 4 MA
4 0.00 5 NC
5 0.00 6 NC
Can anyone help on this?

Drop the unneeded column, group by grpno, take the first value, and flatten the multi-index:
df2 = df.drop('count percent', 1).groupby('grpno.').take([0]).reset_index(0)
Rename the columns:
mapping = {'state code':'FirstOfstate code' ,
'grpno.': 'state pool number',
'MaxOfcount percent': 'FirstOfMaxOfState count percent'}
df2.rename_axis(mapping, axis=1)
Result:
>>> df2
state pool number FirstOfMaxOfState count percent FirstOfstate code
0 1 14.78 CA
1 2 0.00 CA
3 3 8.80 CA
6 4 59.00 MA
5 5 0.00 NC
4 6 0.00 NC

Related

Select certain columns based on multiple criteria in pandas

I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5],
'type':['corp','smb','smb','corp','mid'],
'sales':[34567,2190,1870,22000,10000],
'sales_roi':[.10,.21,.22,.15,.16],
'sales_pct':[.38,.05,.08,.30,.20],
'sales_ln':[4.2,2.1,2.0,4.1,4],
'cost_pct':[22000,1000,900,14000,5000],
'flag':[0,1,0,1,1],
'gibberish':['bla','ble','bla','ble','bla'],
'tech':['lnx','mst','mst','lnx','mc']})
my_df['type'] = pd.Categorical(my_df.type)
my_df
id type sales sales_roi sales_pct sales_ln cost_pct flag gibberish tech
0 1 corp 34567 0.10 0.38 4.2 22000 0 bla lnx
1 2 smb 2190 0.21 0.05 2.1 1000 1 ble mst
2 3 smb 1870 0.22 0.08 2.0 900 0 bla mst
3 4 corp 22000 0.15 0.30 4.1 14000 1 ble lnx
4 5 mid 10000 0.16 0.20 4.0 5000 1 bla mc
And I want to filter out all variables who end in "_pct" or "_ln" or are equal to "gibberish" or "tech". This is what I have tried:
df_selected = df.loc[:, ~my_df.columns.str.endswith('_pct') &
~my_df.columns.str.endswith('_ln') &
~my_df.columns.str.contains('gibberish','tech')]
But it returns me an unwanted column ("tech"):
id type sales sales_roi flag tech
0 1 corp 34567 0.10 0 lnx
1 2 smb 2190 0.21 1 mst
2 3 smb 1870 0.22 0 mst
3 4 corp 22000 0.15 1 lnx
4 5 mid 10000 0.16 1 mc
This is the expected result:
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
Please consider that I have to deal with hundreds of variables and this is just an example of what I need.
Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter
my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
I would do it like this:
criterion = ["_pct", "_ln", "gibberish", "tech"]
for column in my_df:
for criteria in criterion:
if criteria in column:
my_df = my_df.drop(column, axis=1)
Ofcourse you can change the if statement in line 3 to endswith or something of your choice.

Add a column of normalised values based on sections of a dataframe column

I have got a dataframe of several hundred thousand rows. Which is of the following format:
time_elapsed cycle
0 0.00 1
1 0.50 1
2 1.00 1
3 1.30 1
4 1.50 1
5 0.00 2
6 0.75 2
7 1.50 2
8 3.00 2
I want to create a third column that will give me the percentage of each time instance that the row is of the cycle (until the next time_elapsed = 0). To give something like:
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 75
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100
I'm not fussed about the number of decimal places, I've just excluded them for ease here.
I started going along this route, but I keep getting errors.
data['percentage'] = data['time_elapsed'].sub(data.groupby(['cycle'])['time_elapsed'].transform(lambda x: x*100/data['time_elapsed'].max()))
I think it's the lambda function causing errors, but I'm not sure what I should do to change it. Any help is much appreciated :)
Use Series.div for division instead sub for subtract, then solution is simplify - get only max per groups, multiple by Series.mul, if necessary Series.round and last convert to integers by Series.astype:
s = data.groupby(['cycle'])['time_elapsed'].transform('max')
data['percentage'] = data['time_elapsed'].div(s).mul(100).round().astype(int)
print (data)
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 67
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100

Calculate the maximum over the interval that satisfies the condition [duplicate]

I have a dataframe with 2 columns:
count percent grpno.
0 14.78 1
1 0.00 2
2 8.80 3
3 9.60 4
4 55.90 4
5 0.00 2
6 0.00 6
7 0.00 5
8 6.90 1
9 59.00 4
I need to get the max of column 'count percent
' and group by column 'grpno.'. Though I tried doing the same by
geostat.groupby(['grpno.'], sort=False)['count percent'].max()
I get the output to be
grpno.
1 14.78
2 0.00
3 8.80
4 59.00
6 0.00
5 0.00
Name: count percent, dtype: float64
But I need output to be a dataframe that has the column name modified as 'MaxOfcount percent' and 'grpno.' Can anyone help on this? Thanks
res = df.groupby('grpno.')['count percent'].max().reset_index()
res.columns = ['grpno.', 'MaxOfcount percent']
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
You could also do it in one line:
res = df.groupby('grpno.', as_index=False)['count percent'].max().rename(columns={'count percent': 'MaxOfcount percent'})
You could use groupby with argument as_index=False:
In [119]: df.groupby(['grpno.'], as_index=False)[['count percent']].max()
Out[119]:
grpno. count percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
df1 = df.groupby(['grpno.'], as_index=False)[['count percent']].max()
df1.columns = df1.columns[:-1].tolist() + ['MaxOfcount percent']
In [130]: df1
Out[130]:
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00

Pandas read csv with multiple whitespaces and parse dates

I have a csv file that looks like
Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07
and I would like transform it into a dataframe with 2 columns: a datetime column of YYYYMMDD (using the "Year", "Mo", and "Da" columns in the raw data) and then the rainfall at the grid point (e.g. 01, 52) as the second column.
A desired output would be:
Datetime Rainfall
19500101 0.00
19500102 0.00
19500103 0.05
I am stuck on two issues: appropriately accounting for the whitespace during the read-in and properly using parse_dates.
The simple read-in command:
df = pd.read_csv(csv_fl)
Almost correctly reads in the headers, but splits the (01,52) into separate columns, yielding a trailing NaN, which shouldn't be there.
Year Mo Da (01 52)
0 1950 1 1 0.00 NaN
And trying to parse the dates using
df = pd.read_csv(csv_fl, parse_dates={'Datetime':[0,1,2]}, index_col=0)
leads to an IndexError
colnames.append(str(columns[c]))
IndexError: list index out of range
Any guidance is much appreciated.
If you pass params delim_whitespace=True and pass the 3 columns in a list to parse_dates the last step is just to overwrite the column names:
In [96]:
import pandas as pd
import io
t="""Year Mo Da (01,52)
1950 1 1 0.00
1950 1 2 0.00
1950 1 3 0.05
1950 1 4 0.00
1950 1 5 0.07
1950 1 6 0.07"""
df =pd.read_csv(io.StringIO(t), delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
df.columns = ['Datetime', 'Rainfall']
df
Out[96]:
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07
So I expect: df = pd.read_csv(csv_fl, delim_whitespace=True, parse_dates=[['Year','Mo','Da']])
should work followed by overwriting the column names
filename = "..."
>>> pd.read_csv(filename,
sep=" ",
skipinitialspace=True,
parse_dates={'Datetime': [0, 1, 2]},
usecols=[0, 1, 2, 3],
names=["Y", "M", "D", "Rainfall"],
skiprows=1)
Datetime Rainfall
0 1950-01-01 0.00
1 1950-01-02 0.00
2 1950-01-03 0.05
3 1950-01-04 0.00
4 1950-01-05 0.07
5 1950-01-06 0.07

find the max of column group by another column pandas

I have a dataframe with 2 columns:
count percent grpno.
0 14.78 1
1 0.00 2
2 8.80 3
3 9.60 4
4 55.90 4
5 0.00 2
6 0.00 6
7 0.00 5
8 6.90 1
9 59.00 4
I need to get the max of column 'count percent
' and group by column 'grpno.'. Though I tried doing the same by
geostat.groupby(['grpno.'], sort=False)['count percent'].max()
I get the output to be
grpno.
1 14.78
2 0.00
3 8.80
4 59.00
6 0.00
5 0.00
Name: count percent, dtype: float64
But I need output to be a dataframe that has the column name modified as 'MaxOfcount percent' and 'grpno.' Can anyone help on this? Thanks
res = df.groupby('grpno.')['count percent'].max().reset_index()
res.columns = ['grpno.', 'MaxOfcount percent']
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
You could also do it in one line:
res = df.groupby('grpno.', as_index=False)['count percent'].max().rename(columns={'count percent': 'MaxOfcount percent'})
You could use groupby with argument as_index=False:
In [119]: df.groupby(['grpno.'], as_index=False)[['count percent']].max()
Out[119]:
grpno. count percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00
df1 = df.groupby(['grpno.'], as_index=False)[['count percent']].max()
df1.columns = df1.columns[:-1].tolist() + ['MaxOfcount percent']
In [130]: df1
Out[130]:
grpno. MaxOfcount percent
0 1 14.78
1 2 0.00
2 3 8.80
3 4 59.00
4 5 0.00
5 6 0.00

Categories

Resources