I have a dataframe with multilevel headers for the columns like this:
name 1 2 3 4
x y x y x y x y
A 1 4 3 7 2 1 5 2
B 2 2 6 1 4 5 1 7
How can I calculate the mean for 1x, 2x and 3x, but not 4x?
I tried:
df['mean']= df[('1','x'),('2','x'),('3','x')].mean()
This did not work, it syas key error. I would like to get:
name 1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2
B 2 2 6 1 4 5 1 7 4
Is there a way to calculate the mean while keeping the first column header as an integer?
This is only one solution:
import pandas as pd
iterables = [[1, 2, 3, 4], ["x", "y"]]
array = [
[1, 4, 3, 7, 2, 1, 5, 2],
[2, 2, 6, 1, 4, 5, 1, 7]
]
index = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(array, index=["A", "B"], columns=index)
df["mean"] = df.xs("x", level=1, axis=1).loc[:,1:3].mean(axis=1)
print(df)
1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2.0
B 2 2 6 1 4 5 1 7 4.0
Steps:
Select all the "x"-columns with df.xs("x", level=1, axis=1)
Select only columns 1 to 3 with .loc[:,1:3]
Calculate the mean value with .mean(axis=1)
Related
Suppose I have a nested dictionary of the format:
dictionary={
"A":[1, 2],
"B":[2, 3],
"Coords":[{
"X":[1,2,3],
"Y":[1,2,3],
"Z":[1,2,3],
},{
"X":[2,3],
"Y":[2,3],
"Z":[2,3],
}]
}
How can I turn this into a Pandas MultiIndex Dataframe?
Equivalently, how can I produce a Dataframe where the information in the row is not duplicated for every co-ordinate?
In what I imagine, the two rows of output DataFrame should appear as follows:
Index A B Coords
---------------------
0 1 2 X Y Z
1 1 1
2 2 2
3 3 3
--------------------
---------------------
1 2 3 X Y Z
2 2 2
3 3 3
--------------------
From your dictionary :
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(dictionary)
>>> df
A B Coords
0 1 2 {'X': [1, 2, 3], 'Y': [1, 2, 3], 'Z': [1, 2, 3]}
1 2 3 {'X': [2, 3], 'Y': [2, 3], 'Z': [2, 3]}
Then we can use pd.Series to extract the data in dict in the column Coords like so :
df_concat = pd.concat([df.drop(['Coords'], axis=1), df['Coords'].apply(pd.Series)], axis=1)
>>> df_concat
A B X Y Z
0 1 2 [1, 2, 3] [1, 2, 3] [1, 2, 3]
1 2 3 [2, 3] [2, 3] [2, 3]
To finish we use the explode method to get the list as rows and set the index on columns A and B to get the expected result :
>>> df_concat.explode(['X', 'Y', 'Z']).reset_index().set_index(['index', 'A', 'B'])
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
UPDATE :
If you are using a version of Pandas lower than 1.3.0, we can use the trick given by #MillerMrosek in this answer :
def explode(df, columns):
df['tmp']=df.apply(lambda row: list(zip(*[row[_clm] for _clm in columns])), axis=1)
df=df.explode('tmp')
df[columns]=pd.DataFrame(df['tmp'].tolist(), index=df.index)
df.drop(columns='tmp', inplace=True)
return df
explode(df_concat, ["X", "Y", "Z"]).reset_index().set_index(['index', 'A', 'B'])
Output :
X Y Z
index A B
0 1 2 1 1 1
2 2 2 2
2 3 3 3
1 2 3 2 2 2
3 3 3 3
I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3
I have two pandas.Series...
import pandas as pd
import numpy as np
length = 5
s1 = pd.Series( [1]*length ) # [1, 1, 1, 1, 1]
s2 = pd.Series( [2]*length ) # [2, 2, 2, 2, 2]
...and I would like to have them joined together in a single Series with the interleaved values from the first 2 series.
Something like: [1, 2, 1, 2, 1, 2, 1, 2, 1, 2]
Using np.column_stack:
In[27]:pd.Series(np.column_stack((s1,s2)).flatten())
Out[27]:
0 1
1 2
2 1
3 2
4 1
5 2
6 1
7 2
8 1
9 2
dtype: int64
Here we are:
s1.index = range(0,len(s1)*2,2)
s2.index = range(1,len(s2)*2,2)
interleaved = pd.concat([s1,s2]).sort_index()
idx values
0 1
1 2
2 1
3 2
4 1
5 2
6 1
7 2
8 1
9 2
Here's one using NumPy stacking, np.vstack -
pd.Series(np.vstack((s1,s2)).ravel('F'))
I have two data frames:
In [14]: rep1
Out[14]:
x y z
A 1 2 3
B 4 5 6
C 1 1 2
In [15]: rep2
Out[15]:
x y z
A 7 3 4
B 3 3 3
created with this code:
import pandas as pd
rep1 = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]),('C',[1,1,2])], orient='index', columns=['x', 'y', 'z'])
rep2 = pd.DataFrame.from_items([('A', [7, 3, 4]), ('B', [3, 3, 3])], orient='index', columns=['x', 'y', 'z'])
What I want to do then is to mesh rep1 and rep2 so that it results something like this:
gene rep1 rep2 type
A 1 7 x
B 4 3 x
A 2 3 y
B 5 3 y
A 3 4 z
B 6 3 z
row C is skipped because it is not shared by rep1 and rep2.
How can I achieve that?
This does it:
df =pd.concat([rep1.stack(),rep2.stack()],axis=1).reset_index().dropna()
df.columns =['GENE','TYPE','REP1','REP2']
df.sort(columns=['TYPE','GENE'], inplace=True)
Concatenate the stacked data frames on axis =1. Resetting the index gets you back the gene and type columns. dropna takes care of the nulls produced for gene c. Add the correct column names etc.
returns:
GENE TYPE REP1 REP2
0 A x 1 7
3 B x 4 3
1 A y 2 3
4 B y 5 3
2 A z 3 4
5 B z 6 3
>>> c1 = rep1.values.T.flatten()
>>> c2 = rep2.values.T.flatten()
>>> c3 = np.vstack((rep1.columns.values, rep2.columns.values)).T.flatten()
>>> pd.DataFrame(np.vstack((c1,c2,c3)).T)
0 1 2
0 1 7 x
1 4 3 x
2 2 3 y
3 5 3 y
4 3 4 z
5 6 3 z
Edit: When I was answering this, the question did not have row C at all. Now things are more complicated, but I'll leave this here anyway.
The title should say it all, I want to turn this DataFrame:
A NaN 4 3
B 2 1 4
C 3 4 2
D 4 2 8
into this DataFrame:
A 2 1 2
B 3 2 3
C 4 4 4
D NaN 4 8
And I want to do it in a nice manner. The ugly solution would be to take every column and form a new DataFrame.
To test, use:
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
The desired sort ignores the index values, so the operation appears to be more
like a NumPy operation than a Pandas one:
import pandas as pd
d = {'one':[None, 2, 3, 4],
'two':[4, 1, 4, 2],
'three':[3, 4, 6, 8],}
df = pd.DataFrame(d, index = list('ABCD'))
# one three two
# A NaN 3 4
# B 2 4 1
# C 3 6 4
# D 4 8 2
arr = df.values
arr.sort(axis=0)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
yields
one three two
A 2 3 1
B 3 4 2
C 4 6 4
D NaN 8 4