Group seperated counting values in a pandas dataframe - python

I have following df
A B
0 1 10
1 2 20
2 NaN 5
3 3 1
4 NaN 2
5 NaN 3
6 1 10
7 2 50
8 Nan 80
9 3 5
Consisting of repeating sequences from 1-3 seperated by a variable number of NaN's.I want to groupby each this sequences from 1-3 and get the minimum value of column B within these sequences.
Desired Output something like:
B_min
0 1
6 5
Many thanks beforehand
draj

Idea is first remove rows by missing values by DataFrame.dropna, then use GroupBy.cummin by helper Series created by compare A for equal by Series.eq and Series.cumsum, last data cleaning to one column DataFrame:
df = (df.dropna(subset=['A'])
.groupby(df['A'].eq(1).cumsum())['B']
.min()
.reset_index(drop=True)
.to_frame(name='B_min'))
print (df)
B_min
0 1
1 5

All you need to df.groupby() and apply min(). Is this what you are expecting?
df.groupby('A')['B'].min()
Output:
A
1 10
2 20
3 1
Nan 80
If you don't want the NaNs in your group you can drop them using df.dropna()
df.dropna().groupby('A')['B'].min()

Related

Mesh / divide / explode values in a column of a DataFrames according to a number of meshes for each value

Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2
You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64
Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result

How to aggregate all values in a pandas dataframe columns in 2 values

I have a Pandas dataframe contains some columns. Each columns have some differents values. See the image.
In col1 I have that the value 1 is more frequent than others, so, I need to transform this column to have values 1 and more then 1.
How can I do that?
My goals here is transforme this column in a categorical column but I have no idea how can I do that.
The output expected is something like the next image:
Try clip function on column:
df["col1"].clip(upper=2)
0 1
1 2
2 2
3 2
4 1
5 2
6 2
7 1
8 1
9 1
10 1
11 2
12 1

Adding multiple row values into one row keeping the index interval as same as the number of row added in python

I have a data frame with multiple columns (30/40) in a time series continuously from 1 to 1440 minutes.
df
time colA colB colC.....
1 5 4 3
2 1 2 3
3 5 4 3
4 6 7 3
5 9 0 3
6 4 4 0
..
Now I want to add two row values into one but I want to keep the interval of index 'time' same as the row number I am adding. The resulted data frame is:
df
time colA colB colC.......
1 6 6 6
3 11 11 6
5 13 4 3
..
Here I added two row values into one but the time index interval is also same as 2 rows. 1,3,5...
Is it possible to achieve that?
Another way would be to group your data set every two rows and aggregate with using sum on your 'colX' columns and mean on your time column. Chaining astype(int) will round the resulting values:
d = {col: 'sum' for col in [c for c in df.columns if c.startswith('col')]}
df.groupby(df.index // 2).agg({**d,'time': 'mean'}).astype(int)
prints back:
colA colB colC time
0 6 6 6 1
1 11 11 6 3
2 13 4 3 5
one way is to do the addition for all and then fix time:
df_new = df[1::2].reset_index(drop=True) + df[::2].reset_index(drop=True)
df_new['time'] = df[::2]['time'].values

NaN values emerging when concatenating multiple dataframes with Pandas

I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?
I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])

How, in python, can I count unique values in a column for gradually increasing numbers of rows within groups

I am working in python on a pandas data frame and am trying to count unique values of a column within groups. My problem is that I need that count to represent steadily increasing numbers of rows within the groups and I also don't want NaNs to be counted.
Simplified, the data looks like this
ID occup
1 NaN
1 A
1 NaN
1 Nan
1 B
2 K
2 NaN
2 L
2 L
2 M
The new column 'occupcount' should, within the groups defined by 'ID', count the number of unique values in 'occup' but, in the first row of each group I want the count to only consider the first row in the respective group. In the second row, I want to count over the first two rows. In the fifth row, I want the count of unique values over all five rows within each group. It should look like this:
ID occup occupcount
1 NaN 0
1 A 1
1 NaN 1
1 B 2
1 A 2
2 K 1
2 NaN 1
2 L 2
2 K 2
2 M 3
I tried to solve the task with something like
df['occupcount'] = (df.groupby(["ID"])['occup'].transform('nunique'))
But it only provides the total amount of unique values over all rows within each group, no gradual increase. Thanks in advance!
Idea is chain first duplicated values by both columns with not missing values for mask and then use GroupBy.cumsum:
df['occupcount'] = ((~df.duplicated(['ID','occup']) & df['occup'].notna())
.groupby(df['ID'])
.cumsum())
print (df)
ID occup occupcount
0 1 NaN 0
1 1 A 1
2 1 NaN 1
3 1 B 2
4 1 A 2
5 2 K 1
6 2 NaN 1
7 2 L 2
8 2 L 2
9 2 M 3

Categories

Resources