Getting maximum values in a column - python

My dataframe looks like this:
Country Code Duration
A 1 0
A 1 1
A 1 2
A 1 3
A 2 0
A 2 1
A 1 0
A 1 1
A 1 2
I need to get max values from a "Duration" column - not just a maximum value, but a list of maximum values for each sequence of numbers in this column. The output might look like this:
Country Code Duration
A 1 3
A 2 1
A 1 2
I could have grouped by "Code", but its values are often repeating, so that's probably not an option. Any help or tips would be much appreciated.

Using idxmax after create another group key by diff and cumsum
df.loc[df.groupby([df.Country,df.Code.diff().ne(0).cumsum()]).Duration.idxmax()]
Country Code Duration
3 A 1 3
5 A 2 1
8 A 1 2

First we create a mask to mark the sequences. Then we groupby to create the wanted output:
m = (~df['Code'].eq(df['Code'].shift())).cumsum()
df.groupby(m).agg({'Country':'first',
'Code':'first',
'Duration':'max'}).reset_index(drop=True)
Country Code Duration
0 A 1 3
1 A 2 1
2 A 1 2

The problem is slightly unclear. However, assuming that order is important, we can move toward a solution.
import pandas as pd
d = pd.read_csv('data.csv')
s = d.Code
d['series'] = s.ne(s.shift()).cumsum()
print(pd.DataFrame(d.groupby(['Country','Code','series'])['Duration'].max().reset_index()))
Returns:
Country Code series Duration
0 A 1 1 3
1 A 1 3 2
2 A 2 2 1
You can then drop the series.

You might wanna check this link , it might be the answer you're looking for :
pandas groupby where you get the max of one column and the min of another column . It goes as :
result = df.groupby(['Code', 'Country']).agg({'Duration':'max'})[['Duration']].reset_index()

Related

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

Pandas groupby cumulative sum ignore current row

I know there's some questions about this topic (like Pandas: Cumulative sum of one column based on value of another) however, none of them fuull fill my requirements.
Let's say I have a dataframe like this one
.
I want to compute the cumulative sum of Cost grouping by month, avoiding taking into account the current value, in order to get the Desired column.By using groupby and cumsum I obtain colum CumSum
.
The DDL to generate the dataframe is
df = pd.DataFrame({'Month': [1,1,1,2,2,1,3],
'Cost': [5,8,10,1,3,4,1]})
IIUC you can use groupby.cumsum and then just subtract cost;
df['cumsum_'] = df.groupby('Month').Cost.cumsum().sub(df.Cost)
print(df)
Month Cost cumsum_
0 1 5 0
1 1 8 5
2 1 10 13
3 2 1 0
4 2 3 1
5 1 4 23
6 3 1 0
You can do the following:
df['agg']=df.groupby('Month')['Cost'].shift().fillna(0)
df['Cumsum']=df['Cost']+df['agg']

Pyspark Dataframe pivot and groupby count

I am working on a pyspark dataframe which looks like below
id
category
1
A
1
A
1
B
2
B
2
A
3
B
3
B
3
B
I want to unstack the category column and count their occurrences. So, the result I want is shown below
id
A
B
1
2
1
2
1
1
3
Null
3
I tried finding something on the internet that can help me but I couldn't find anything that could give me this specific result.
Short version, dont have to do multiple groupBy's
df.groupBy("id").pivot("category").count().show()
Try this -- (Not sure its optimized)
df = spark.createDataFrame([(1,'A'),(1,'A'),(1,'B'),(2,'B'),(2,'A'),(3,'B'),(3,'B'),(3,'B')],['id','category'])
df = df.groupBy('id','category').count()
df.groupBy('id').pivot('category').sum('count').show()

How to count number of records in a group and save them in a csv file?

I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1
Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1

accessing Groupby Sum results [duplicate]

I have a dataframe with 2 index levels:
value
Trial measurement
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
Which I want to turn into this:
Trial measurement value
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
How can I best do this?
I need this because I want to aggregate the data as instructed here, but I can't select my columns like that if they are in use as indices.
The reset_index() is a pandas DataFrame method that will transfer index values into the DataFrame as columns. The default setting for the parameter is drop=False (which will keep the index values as columns).
All you have to do call .reset_index() after the name of the DataFrame:
df = df.reset_index()
This doesn't really apply to your case but could be helpful for others (like myself 5 minutes ago) to know. If one's multindex have the same name like this:
value
Trial Trial
1 0 13
1 3
2 4
2 0 NaN
1 12
3 0 34
df.reset_index(inplace=True) will fail, cause the columns that are created cannot have the same names.
So then you need to rename the multindex with df.index = df.index.set_names(['Trial', 'measurement']) to get:
value
Trial measurement
1 0 13
1 1 3
1 2 4
2 0 NaN
2 1 12
3 0 34
And then df.reset_index(inplace=True) will work like a charm.
I encountered this problem after grouping by year and month on a datetime-column(not index) called live_date, which meant that both year and month were named live_date.
There may be situations when df.reset_index() cannot be used (e.g., when you need the index, too). In this case, use index.get_level_values() to access index values directly:
df['Trial'] = df.index.get_level_values(0)
df['measurement'] = df.index.get_level_values(1)
This will assign index values to individual columns and keep the index.
See the docs for further info.
As #cs95 mentioned in a comment, to drop only one level, use:
df.reset_index(level=[...])
This avoids having to redefine your desired index after reset.
I ran into Karl's issue as well. I just found myself renaming the aggregated column then resetting the index.
df = pd.DataFrame(df.groupby(['arms', 'success'])['success'].sum()).rename(columns={'success':'sum'})
df = df.reset_index()
Short and simple
df2 = pd.DataFrame({'test_col': df['test_col'].describe()})
df2 = df2.reset_index()
A solution that might be helpful in cases when not every column has multiple index levels:
df.columns = df.columns.map(''.join)
Similar to Alex solution in a more generalized form. It keeps the indexes untouched and adds index level as a new columns with its name.
for i in df.index.names:
df[i] = df.index.get_level_values(i)
which gives
value Trial measurement
Trial measurement
1 0 13 1 0
1 3 1 1
2 4 1 2
...

Categories

Resources