If I have a dataframe df_i and I want to split it into sub-dataframes based on unique values of 'Cycle Number'
I use:
dfs = {k: df_i[df_i['Cycle Number'] == k] for k in df_i['Cycle Number'].unique()}
Assuming the 'Cycle Number' ranges from 1 to 50 and in each cycle, I have steps ranging from 1 to 15, how do I split each data frame into 15 further data frames?
I am presuming something of this type would work:
for i in range(1,51):
dsfs = {k: dfs[i][dfs[i]['Step Number'] == k] for k in dfs[i]['Step Number'].unique()}
But, this will return me 15 data frames only from the cycle number corresponding to 50, not the ones before.
If I want to access a sub-dataframe in the 20th Cycle with step number 10, is there a way of generating the subdata frame such that I can access it using something like dfs[20][10]?
A simple parallel:
Step Number Cycle Number Desired Access
1 1 dfs[1][1]
2 1 dfs[1][2]
3 1 dfs[1][3]
4 1 dfs[1][4]
5 1 dfs[1][5]
1 2 dfs[2][1]
2 2 dfs[2][2]
3 2 dfs[2][3]
4 2 dfs[2][4]
5 2 dfs[2][5]
1 3 dfs[3][1]
2 3 dfs[3][2]
3 3 dfs[3][3]
4 3 dfs[3][4]
5 3 dfs[3][5]
1 4 dfs[4][1]
2 4 dfs[4][2]
3 4 dfs[4][3]
4 4 dfs[4][4]
5 4 dfs[4][5]
You can use tuple keys instead and utilize groupby. Here's a minimal example:
df = pd.DataFrame([[0, 1, 2], [0, 1, 3], [1, 2, 4], [1, 2, 5], [1, 3, 6], [1, 3, 7]],
columns=['col1', 'col2', 'col3'])
dfs = dict(tuple(df.groupby(['col1', 'col2'])))
for k, v in dfs.items():
print(k)
print(v)
(0, 1)
col1 col2 col3
0 0 1 2
1 0 1 3
(1, 2)
col1 col2 col3
2 1 2 4
3 1 2 5
(1, 3)
col1 col2 col3
4 1 3 6
5 1 3 7
Related
Let's assume I have the following dataframe:
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
I want this dataframe sorted by col1 and col2 on the minimum value. The order of the indexes should be 2, 0, 1, 3.
I tried this with df.sort_values(by=['col2', 'col1']), but than it takes the minimum of col1 first and then of col2. Is there anyway to order by taking the minimum of two columns?
Using numpy.lexsort:
order = np.lexsort(np.sort(df[['col1', 'col2']])[:, ::-1].T)
out = df.iloc[order]
Output:
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
Note that you can easily handle any number of columns:
df.iloc[np.lexsort(np.sort(df[['col1', 'col2', 'col3']])[:, ::-1].T)]
col1 col2 col3 outcome
1 2 2 0 0
2 3 1 1 1
0 1 4 1 1
3 4 3 1 0
One way (not the most efficient):
idx = df[['col2', 'col1']].apply(lambda x: tuple(sorted(x)), axis=1).sort_values().index
Output:
>>> df.loc[idx]
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
>>> idx
Int64Index([2, 0, 1, 3], dtype='int64')
you can decorate-sort-undecorate where decoration is minimal and other (i.e., maximal) values per row:
cols = ["col1", "col2"]
(df.assign(_min=df[cols].min(axis=1), _other=df[cols].max(axis=1))
.sort_values(["_min", "_other"])
.drop(columns=["_min", "_other"]))
to get
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
I would compute min(col1, col2) as new column and then sort by it
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
df['colmin'] = df[['col1','col2']].min(axis=1) # compute min
df = df.sort_values(by='colmin').drop(columns='colmin') # sort then drop min
print(df)
gives output
col1 col2 col3 outcome
0 1 4 1 1
2 3 1 1 1
1 2 2 0 0
3 4 3 1 0
below is the df
df = pd.DataFrame({
'Sr. No': [1, 2, 3, 4, 5, 6],
'val1' : [2,3,2,4,1,2],
})
I want Val2 such that the first row is same as first row of val1
but but row 2 and below the formula is as show in the pic. I am assuming it should be an easy one with shift, but just not getting my head around this.
This is mul and cumsum:
df["new"] = df["Sr. No"].mul(df["val1"]).cumsum()
print (df)
Sr. No val1 new
0 1 2 2
1 2 3 8
2 3 2 14
3 4 4 30
4 5 1 35
5 6 2 47
I want to do a groupby on column 1 then get the sum of values from column 2, conditional on the value in column 3, which are then divided by the total sum in column 2, still grouped by column 1.
An example is given below:
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
I want to create a new column: col4. For this column I group by col1 and then get the percentage of col2 values where col3 is 1 divided by the total grouped sum of col2. Such that I would end up with the following result. ( I put it in fractions to make it easier to follow the calculations.
col1 col2 col3 col4
0 1 3 1 3/5
1 2 4 1 4/11
2 1 2 0 3/5
3 2 7 0 4/11
I tried the following, but this does not work unfortunately:
df.col4 = df.groupby(['col1']).transform(lambda x: np.where(x.col3 == 1, x.col2, 0).sum()) / df.groupby(['col1']).col2.transform('sum')
Edit | Extended example
I extended the example as the solution provided by Wen only covered the above simple example.
d = {'col1': [1, 2, 1, 2, 1, 2], 'col2': [3, 4, 2, 7, 6, 8], 'col3': [1, 1, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
4 1 6 1
5 2 8 0
Edit | Possible solution
I found a possible solution. I would like to do it in a cleaner way, but this is readable and pretty simple. Any alternatives to combine these two lines of code are still appreciated ofcourse.
df['col4'] = np.where(df.col3 == 1, df.col2, 0)
df['col4'] = df.groupby(['col1']).col4.transform('sum') / df.groupby(['col1']).col2.transform('sum')
You may need to correct your expected output , then using map after filter
df.col1.map(df.loc[df.col3==1,].set_index('col1').col2)/df.groupby(['col1']).col2.transform('sum')
Out[566]:
0 0.600000
1 0.363636
2 0.600000
3 0.363636
dtype: float64
simple :)
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
df['col4'] = 0.0
def con(data):
part_a = sum(data[data['col3'] == 1]['col2'])
part_b = sum(data['col2'])
data.col4 = part_a/part_b
return data
df.groupby('col1').apply(con)
Output
col1 col2 col3 col4
0 1 3 1 0.600000
1 2 4 1 0.363636
2 1 2 0 0.600000
3 2 7 0 0.363636
I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3
I have a pandas dataframe that has a column where the data is a list of statistics calculated from a groupby operation.
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
>>> df.groupby('a').apply(lambda row : calculate_stuff(row.b))
a
1 (0, 3, 9)
2 (0, 3, 10)
3 (0, 2, 2)
dtype: object
Basically, I have several statistics that depend on each other and have to be calculated for each groupby row. The function that does this returns a tuple of the statistics values. What I want is to create a new column for each index of the tuple so that it looks like this:
a col1 col2 col3
1 0 3 9
2 0 3 10
3 0 2 2
I don't think I can use df.groupby('a').agg because one of the calculations is required for the other calculations. Any suggestions?
edit: I realized my aggregate functions in my example were not aggregate functions so I changed them
Adding an extra a category item so the result is 4x3.
df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 4],
'b': [3, 4, 2, 3, 4, 3, 2, 1]})
new_cols = ['col1', 'col2', 'col3']
gb = df.groupby('a').apply(lambda group: calculate_stuff(group.b))
>>> pd.DataFrame(zip(*gb), columns=gb.index, index=new_cols).T
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2
4 0 1 1
You can try list comprehension:
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
group_df = df.groupby('a').apply(lambda row : calculate_stuff(row.b))
print pd.DataFrame([x for x in group_df],
columns=['col1','col2','col3'],
index=group_df.index)
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2