Split dataframe into smaller dataframes based on range of values

Split dataframe into smaller dataframes based on range of values - python

I have the following dataframe:
x text
1 500 aa
2 550 bb
3 700 cc
4 750 dd
My goal is to split this df if the x-values are more than 100 points apart.
Is there a pandas function that allows you to make a split based on range of values?
Here is my desired output:
df_1:
x text
0 500 aa
1 550 bb
df_2:
x text
0 700 cc
1 750 dd

I believe you need convert groupby object to tuple and dictionary by helper Series:
d = dict(tuple(df.groupby(df['x'].diff().gt(100).cumsum())))
print (d)
{0: x text
1 500 aa
2 550 bb, 1: x text
3 700 cc
4 750 dd}
Detail:
First get difference by Series.diff, compare by Series.gt for greater and create consecutive groups by Series.cumsum:
print (df['x'].diff().gt(100).cumsum())
1 0
2 0
3 1
4 1
Name: x, dtype: int32

make a new column with shift(1) and then separate by the difference between the values of these columns

Related

Pivot select tables in dataframe to make values column headers in Python

I have a dataframe, df, where I would like to transform and pivot select values.
I wish to groupby id and date, sum the 'pwr' values and then count the type values.
df
df values that will be column headers: 'hi', 'hey'
id date type pwr de_id de_date de_type de_pwr base base_pos
aa q1 hey 10 aa q1 hey 5 200 40
aa q1 hi 5 200 40
aa q1 hey 5 200 40
aa q2 hey 2 aa q2 hey 3 200 40
aa q2 hey 2 aa q2 hey 3 200 40
bb q1 0 bb q1 hi 6 500 10
bb q1 0 bb q1 hi 6 500 10
Desired
id date hey hi total sum hey hi totald desum base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Doing
sum1 = df.groupby(['id','date']).agg({'pwr': 'sum', 'type': 'count', 'de_pwr': 'sum', 'de_type': 'count'})
pd.pivot_table(df, values = '' , columns = 'type')
Any suggestion will be helpful.

So, this is definitely not a 'clean' way to go around it, but since you have 2 separate totals summing along columns, I don't know how much cleaner it could get (and the output seems accurate).
You don't mention what aggregation you use to get base and base_pos values, so I went with mean (might need to change it).
type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['type'])
type_col['total'] = type_col.sum(axis = 1)
pwr_sum = df.groupby(['id','date'])['pwr'].sum()
de_type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['de_type'])
de_type_col['total_de'] = de_type_col.sum(axis = 1)
pwr_de_sum = df.groupby(['id','date'])['de_pwr'].sum()
base_and_pos = df.groupby(['id','date'])[['base','base_pos']].mean()
out = pd.concat([type_col, pwr_sum, de_type_col, pwr_de_sum, base_and_pos], axis = 1).fillna(0).astype('int')
Essentially use crosstab to get value counts and sum them along columns. The index of resulting DataFrame is the same as groupby(['id','date']), so you can then concatenate results of groupby without issue. Repeat the same process for de columns, apply groupby with your choice of aggregation to base and base_pos columns, and concatenate all results along axis = 1. Obviously, you can group some operations together (such as pwr sum, de_pwr sum and base/base_pos aggregation), but you'll need to reorder your columns after that to get the desired order.
Output:
id date hey hi total pwr hey hi total_de de_pwr base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10

Combine rows of a dataframe on ID to sum values, but keep categorical data

I’m cleaning data and had a question. I have a Contact dataset and an Account dataset. I need to merge the two dataframes on “ContactID”. Some ContactID’s have multiple accounts. So, when I merge them there are still some ContactID’s that have multiple rows. I need to combine these rows so that the numeric columns add together, while still keeping the categorical columns. Below is an example:
When I merge:
ContactID Value Type
1 800 A
1 70 A
2 100 B
3 300 A
4 200 C
5 500 B
5 600 B
What I need the data to look like when I merge:
ContactID Value Type
1 870 A
2 100 B
3 300 A
4 200 C
5 1100 B
I have tried this:
fulldf.groupby(fulldf.ContactID).sum()
But, then I only get a dataframe that contains numeric values.

Let us check with dtype and create the groupby dict
out = df.groupby('ContactID').agg(df.dtypes.map({'O':'first'}).fillna('sum').to_dict())
ContactID Value Type
ContactID
1 2 870 A
2 2 100 B
3 3 300 A
4 4 200 C
5 10 1100 B

After grouping you can apply individual aggregation functions to the columns of the DataFrame. For the numeric column use the sum, for the categorical column take the first element.
df.groupby("ContactID").agg(
{
"Value": lambda col: col.sum(),
"Type": lambda col: col.iloc[0],
}
)
# Output
ContactID Value Type
1 870 A
2 100 B
3 300 A
4 200 C
5 1100 B

Calculate the sum of the first n rows for each group

What I want to do is group by column A and then take the sum of first two rows, then assign that value as a new column. Example below:
DF:
ColA ColB
AA 2
AA 1
AA 5
AA 3
BB 9
BB 3
BB 2
BB 12
CC 0
CC 10
CC 5
CC 3
Desired DF:
ColA ColB NewCol
AA 2 3
AA 1 3
AA 5 3
AA 3 3
BB 9 12
BB 3 12
BB 2 12
BB 12 12
CC 0 10
CC 10 10
CC 5 10
CC 3 10
For AA, it looks at ColB and take the sum of the first two rows and assigns that summed value to newCol. I've tried this by creating a dictionary by looping through the unique ColA values, creating a subset dataframe of the first two rows, summing, then populating the dictionary with values. Then mapping the dictionary back - but my dataframe is VERY big and it takes forever. Any ideas?
Thank you!

You can use transform to get a new value per each row and a lambda function. In lambda you can use head(2) to get first 2 rows for each group and sum() them:
df.groupby('ColA')['ColB'].transform(lambda x: x.head(2).sum())

Counting mode occurrences for all columns in a dataframe

I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?

Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64

You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4

First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4

Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4

Pandas df.mode with multiple values per cell in column

I have a dataframe with a Keywords column. Each cell in that column has 5-10 individual values (comma seperated) which consist of 1 - 3 words. How can I count the most occurring keywords in the column?
I have tried df.Keywords.mode but it returns all values for each cell because they obviously don't occur multiple times within each cell.
Here an image to clarify:
All input is appreciated,
Thanks!

First use Series.str.split with expand=True for DataFrame and reshape by DataFrame.stack, then count by Series.value_counts and get top values by Series.head:
df = pd.DataFrame({'Keywords':['aa,bb,vv,vv','aa,aa,cc,bb','zz,bb,aa,ss']})
N = 5
df1 = (df.Keywords.str.split(',', expand=True)
.stack()
.value_counts()
.head(N)
.rename_axis('val')
.reset_index(name='count'))
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1
Another solution if no missing values is flatten splitted lists and count by Counter:
from collections import Counter
N = 5
df1 = pd.DataFrame(Counter([y for x in df.Keywords for y in x.split(',')]).most_common(N),
columns=['val','count'])
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split dataframe into smaller dataframes based on range of values - python

make a new column with shift(1) and then separate by the difference between the values of these columns

Related

Pivot select tables in dataframe to make values column headers in Python

Combine rows of a dataframe on ID to sum values, but keep categorical data

Calculate the sum of the first n rows for each group

Counting mode occurrences for all columns in a dataframe

Pandas df.mode with multiple values per cell in column

Categories

Resources