Sum DataFrame rows a column contains a substring

Sum DataFrame rows a column contains a substring - python

I have this DataFrame:
df1:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 2 XXX.othertext2
1 3 XXX.othertext3
1 2 XXX.othertext3
1 1 XXX.othertext2
1 1 XXX.othertext3
2 6 somerandomtext
2 9 XXX.othertext2
I want to sum rows by same Date that start with XXX.othertext2 until a new XXX.othertext2 or sometext (so it is the sum of fisrt XXX.othertext2 + all XXX.othertext3). The resulting row value of Info will be XXX.othertext2:
newdf:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 7 XXX.othertext2
1 2 XXX.othertext2
2 6 sometext
2 9 XXX.othertext2

Here's one option, with a custom grouper:
grouper = ((b.Info.str.contains('some')) | (b.Info == 'XXX.othertext2')).cumsum()
b.groupby(['Date', grouper]).sum().reset_index()
You can refine it more with a regex if necessary.

Related

How to get number of rows since last peak Pandas

I would like to get a rolling count of how many rows have been between the current row and the last peak. Example code:
Value | Rows since Peak
-----------------------
1 0
3 0
1 1
2 2
1 3
4 0
6 0
5 1

You can compare the values to the cummax and use it for a groupby.cumcount:
df['Rows since Peak'] = (df.groupby(df['Value'].eq(df['Value'].cummax())
.cumsum())
.cumcount()
)
How it works:
Every time a value is equal to the cumulated max (df['Value'].eq(df['Value'].cummax())) we start a new group (using cumsum to define the group). Then cumcount enumerates since the start of the group.
output:
Value Rows since Peak
0 1 0
1 3 0
2 1 1
3 2 2
4 1 3
5 4 0
6 6 0
7 5 1

Filter based on pairs within a group - if value represent at the end

Group Code
1 2
1 2
1 4
1 1
2 4
2 1
2 2
2 3
2 1
2 1
2 3
Within each group there are pairs. In Group 1 for example; the pairs are (2,2),(2,4),(4,1)
I want to filter these pairs based on code number 2 OR 4 being present at the END of the pair. In group 1 for example, only (2,2) and (2,4) will be kept while (4,1) will be filtered out.
The code am I using for determining code number being present at the beginning is
df[df.groupby("Group")['Code'].shift().isin([2,4])|df['Code'].isin([2,4])]
Excepted Output:
Group Code
1 2
1 2
1 4
2 1
2 2

Using your own suggested code, you can modify it to achieve your goal:
idx = df.groupby("Group")['Code'].shift(-1).isin([2,4])
df[idx | idx.shift()]
First you groupby 'Group' and then shift one up and check for values 2 or 4. Finally, you want both the end of pairs satisfying the condition (i.e. idx) and the begin of the pair (i.e. idx.shift())
output:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2

Assuming the data is sorted by Group, you can also do it without using groupby() to save some processing and speed up the process, as follows:
m = df['Code'].isin([2,4]) & df['Group'].eq(df['Group'].shift())
df[m | m.shift(-1)]
Result:
Group Code
0 1 2
1 1 2
2 1 4
5 2 1
6 2 2

Pandas rank valus in rows of DataFrame

Learning Python. I have a dataframe like this
cand1 cand2 cand3
0 40.0900 39.6700 36.3700
1 44.2800 44.2800 35.4200
2 43.0900 51.2200 46.3500
3 35.7200 55.2700 36.4700
and I want to rank each row according to the value of the columns, so that I get
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 1 3 2
3 3 1 2
I have now
for index, row in df.iterrows():
df.loc['Rank'] = df.loc[index].rank(ascending=False).astype(int)
print (df)
However, this keeps on repeating the whole dataframe. Note also the special case in row 2, where two values are the same.
Suggestion appreciated

Use df.rank instead of series rank
df_rank = df.rank(axis=1, ascending=False, method='min').astype(int)
Out[165]:
cand1 cand2 cand3
0 1 2 3
1 1 1 3
2 3 1 2
3 3 1 2

Adding new column to a DataFrame based on values in a list

Novice programmer here seeking help. I have a Dataframe that looks like this:
Name
0 "jackolsen"
1 "IsabelClark"
2 "JaneDoe"
3 "JackOlsen"
4 "JACKOLSEN"
5 "MariaSmith"
6 "JohnSmith"
7 "MaryKent"
8 "MaryKent"
And a list of names:
l = list("jackolsen", "janedoe", "johnsmith")
My desired output is a new column in the DataFrame which tells me if the name is on the list (value = 1) or not (value = 0)regardless if it is uppercase or lowercase. In this example it would be:
Name List
0 "jackolsen" 1
1 "IsabelClark" 0
2 "JaneDoe" 1
3 "JackOlsen" 1
4 "JACKOLSEN" 1
5 "MariaSmith" 0
6 "JohnSmith" 1
7 "MaryKent" 0
8 "MaryKent" 0
How can I achieve my desired output?

Use str.lower with isin:
df['List'] = df.Name.str.lower().isin(l).view('i1')
print(df)
Name List
0 jackolsen 1
1 IsabelClark 0
2 JaneDoe 1
3 JackOlsen 1
4 JACKOLSEN 1
5 MariaSmith 0
6 JohnSmith 1
7 MaryKent 0
8 MaryKent 0

df["List"] = df["Name"].str.lower().isin(l).astype(int)

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583

I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum DataFrame rows a column contains a substring - python

Here's one option, with a custom grouper: grouper = ((b.Info.str.contains('some')) | (b.Info == 'XXX.othertext2')).cumsum() b.groupby(['Date', grouper]).sum().reset_index() You can refine it more with a regex if necessary.

Related

How to get number of rows since last peak Pandas

Filter based on pairs within a group - if value represent at the end

Pandas rank valus in rows of DataFrame

Adding new column to a DataFrame based on values in a list

expand pandas groupby results to initial dataframe

Categories

Resources