How can I compute my data frame by slicing the index - python

I have the data as
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
If A is my index and B is the value corresponding to A. I have to group the first three i.e [0.1,0.3,0.5] and calculate the average in B i.e [3,4,6]. similarly average of 2nd 3 data [8,2,10] corresponding to [0.7,0.9,1.1] and again of [2,3,4] corresponding to [1.3,1.5,1.7] and then prepare the table for this three average values. Final Data frame should be like
A=[1,2,3]
B=[average 1, average 2, average 3]

If need aggregate mean by each 3 values use helper array by length of DataFrame with integer division by 3 with GroupBy.mean:
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
df = pd.DataFrame({'col':B}, index=A)
print (df)
col
0.1 3
0.3 4
0.5 6
0.7 8
0.9 2
1.1 10
1.3 2
1.5 3
1.7 4
df = df.groupby(np.arange(len(df)) // 3).mean()
df.index +=1
print (df)
col
1 4.333333
2 6.666667
3 3.000000

Related

How to average n adjacent columns together in python pandas dataframe?

I have a dataframe that is a histogram with 2000 bins, with a column for each bin. I need to reduce it down to a quarter of the size - 500 bins.
Let's say we have the original dataframe:
A B C D E F G H
1 1 1 1 2 2 2 2
I want to reduce it to a new quarter width dataframe:
A B
1 2
where in the new dataframe, A is the average of A+B+C+D/4 in the original dataframe.
Feels like it should be easy, but can't work out how to do it! Cheers :)
Assuming you want to group the first 4 and last 4 columns (or any number of columns 4 by 4):
out = df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
ouput:
0 1
0 1.0 2.0
If you further want to relabel the columns A/B:
out = (df.groupby(np.arange(df.shape[1])//4, axis=1).mean()
.set_axis(['A', 'B'], axis=1)
)
output:
A B
0 1.0 2.0

Way to produce a table in pandas given a formula and zeros

Lets say I have the following df:
Letter Number
a 0
b 0
c 0
d 1
e 2
f 3
I want to apply the following formula to the df
for i in range(1,len(df):
x = df.loc[i,'Number'] /df.loc[i-1,'Number'] + df.loc[i,'Number']
df.loc[i,'Number'] = x
Note: The column 'Number' only has zeros in the first few rows. After, there are no more zeros.
How would I apply the formula to the df without slicing the zeros off?
You can get the previous row's number by using shift(). Then you can compute the value using the formula you defined. For that we can use df.apply()
Here's how we can do it.
import pandas as pd
df = pd.DataFrame({'Letter':list('abcdef'),'Number':[0,0,0,1,2,3]})
print (df)
# capture the previous row's value
df['Prev'] = df.Number.shift()
# check if prev row value is NaN or 0. It will be NaN for first row
# if Nan or 0, don't divide instead use 0. Then add current row value
df['New'] = df.apply(lambda x: x.Number + ((x.Number/x.Prev) if (pd.isnull(x.Prev) or x.Prev !=0) else 0), axis = 1)
print (df)
The output of this will be (Prev is the previous row; New is the computed result):
Letter Number Prev New
0 a 0 NaN NaN
1 b 0 0.0 0.0
2 c 0 0.0 0.0
3 d 1 0.0 1.0
4 e 2 1.0 4.0
5 f 3 2.0 4.5
If you want the first row to have a value of 0, we can modify the .shift() option a bit and fillna(0). That will make the first row value to be 0. You can drop the Prev column after the computation.
This snippet divides, using pd.Series.div, the Series Number by the shifted values of Number and then adds Number using pd.Series.add
>>> df.Number.div(df.Number.shift()).add(df.Number)
0 NaN
1 NaN
2 NaN
3 inf
4 4.0
5 4.5
Name: Number, dtype: float64

Calculate and add columns to a data frame using multiple columns for sorting

I have a pretty simple data frame with Columns A, B, C and I am would like to add several. I would like to create two cumulative summed columns and have these stored in that same data frame. Currently I'm doing it by creating two different data frames that are order differently and then plotting the results on the same graph but I'm guessing there is a more efficient approach. The columns I'm trying to create are:
(1) Column D = the cumulative sum of Column C ordered by increasing values in Column A
(2) Column E = The cumulative sum of Column C ordered by decreasing values in column B
This should work:
# Cumsum helps us get the cummulative sum and we sort after for correct order of column
df = pd.read_csv('Sample.csv')
df.insert(3,'D',df.sort_values(by = ['A']).C.cumsum().sort_values().values)
df.insert(4,'E',df.sort_values(by = ['B'], ascending = False).C.cumsum().sort_values().values)
print(df)
A B C D E
0 1 0.1 1 1 2
1 2 0.3 3 4 3
2 3 0.6 1 5 6
3 4 0.7 2 7 8
4 5 0.3 2 9 9

pandas unique values how to iterate as a starting point

Good Morning, (bad beginner)
I have the following pandas dataframe:
My goal is to take the firs time a new ID appears and let the VALUE COLUMN be 1000* DELTA of that row. for all consecutive rows of that ID, the VALUE is the VALUE of the row above * the DELTA of the current row.
I tried by getting all unique ID values:
a=stocks2.ID.unique()
a.tolist()
It works, unfortunately I do not really know how to iterate in the way I described. Any kind of help or tip would be greatly appreciated!
A way to do it would be as follows. Example dataframe:
df = pd.DataFrame({'ID':[1,1,5,3,3], 'delta':[0.3,0.5,0.2,2,4]}).assign(value=[2,5,4,2,3])
print(df)
ID delta value
0 1 0.3 2
1 1 0.5 5
2 5 0.2 4
3 3 2.0 2
4 3 4.0 3
Fill value from the row above as:
df['value'] = df.shift(1).delta * df.shift(1).value
Groupby to get the indices where the first ID appears:
w = df.groupby('ID', as_index=False).nth(0).index.values
And compute the values for value using the indices in w:
df.loc[w,'value'] = df.loc[w,'delta'] * 1000
Which gives for this example:
ID delta value
0 1 0.3 300.0
1 1 0.5 0.6
2 5 0.2 200.0
3 3 2.0 2000.0
4 3 4.0 4.0

Python - Pivot and create histograms from Pandas column, with missing values

Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")

Categories

Resources