I am trying to create a Pandas DataFrame that holds label values to a 2D DataFrame. This is what I have done so far:
I am reading csv files using pd.read_csv and appending them to list, for the purpose of this question let's consider the following code:
import numpy as np
import pandas as pd
raw_sample = []
labels = [1,1,1,2,2,2]
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
Then, I added raw_sample to df=d.DataFrame(raw_sample). Then I added the labels to df by doing the following:
df = df.set_index([df.index, labels])
df.index = df.index.set_names('index', level=0)
df.index = df.index.set_names('labels', level=1)
I tried printing this and I got
0
index labels
0 1 0 1 2 3
0 0...
1 1 0 1 2 3
0 0...
2 1 0 1 2 3
0 1...
3 2 0 1 2 3
0 -0...
4 2 0 1 2 3
0 0...
5 2 0 1 2 3
0 -0...
I have also tried printing df[0], I still got the same thing.
I wanted to know if it is in the form of
index labels 0
0 1 1 2 3 4 5 6 7
3 5 6 7 9 5 4
3 4 5 6 7 8 9
1 1 4 3 2 4 5 6 7
3 5 6 7 4 5 6
2 3 4 3 4 5 3
...
I know that a DataFrame cannot take 2D array, the other thing was to use pd.Panel, for this I converted all the contents of raw_sample to numpy array and then converted raw_sample itself to numpy array and did the following:
p1 = pd.Panel(samples, items=map(str, labels))
but when I print this, I get
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 2
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
Looking at the Items, it looks like all the common values are grouped together.
I am not sure what to do at this point. Help!!
Update
Inputs:
labels = [1,1,1,2,2,2]
samples = [5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame, 5x4 pd.DataFrame]
Desired Output:
index labels samples
0 1 1 2 3 4 5 6 7
3 5 6 7 9 5 4
3 4 5 6 7 8 9
1 1 4 3 2 4 5 6 7
3 5 6 7 4 5 6
2 3 4 3 4 5 3
...
If select with not unique items, get another Panel:
np.random.seed(10)
labels = [1,1,1,2,2,2]
samples = np.random.randn(6, 5, 4)
p1 = pd.Panel(samples, items=map(str, labels))
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 2
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1['1'])
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: 1 to 1
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1.to_frame())
1 1 1 2 2 2
major minor
0 0 1.331587 1.331587 1.331587 -0.232182 -0.232182 -0.232182
1 0.715279 0.715279 0.715279 -0.501729 -0.501729 -0.501729
2 -1.545400 -1.545400 -1.545400 1.128785 1.128785 1.128785
3 -0.008384 -0.008384 -0.008384 -0.697810 -0.697810 -0.697810
1 0 0.621336 0.621336 0.621336 -0.081122 -0.081122 -0.081122
1 -0.720086 -0.720086 -0.720086 -0.529296 -0.529296 -0.529296
2 0.265512 0.265512 0.265512 1.046183 1.046183 1.046183
3 0.108549 0.108549 0.108549 -1.418556 -1.418556 -1.418556
2 0 0.004291 0.004291 0.004291 -0.362499 -0.362499 -0.362499
1 -0.174600 -0.174600 -0.174600 -0.121906 -0.121906 -0.121906
2 0.433026 0.433026 0.433026 0.319356 0.319356 0.319356
3 1.203037 1.203037 1.203037 0.460903 0.460903 0.460903
3 0 -0.965066 -0.965066 -0.965066 -0.215790 -0.215790 -0.215790
1 1.028274 1.028274 1.028274 0.989072 0.989072 0.989072
2 0.228630 0.228630 0.228630 0.314754 0.314754 0.314754
3 0.445138 0.445138 0.445138 2.467651 2.467651 2.467651
4 0 -1.136602 -1.136602 -1.136602 -1.508321 -1.508321 -1.508321
1 0.135137 0.135137 0.135137 0.620601 0.620601 0.620601
2 1.484537 1.484537 1.484537 -1.045133 -1.045133 -1.045133
3 -1.079805 -1.079805 -1.079805 -0.798009 -0.798009 -0.798009
But if have unique one, get DataFrame:
np.random.seed(10)
labels = list('abcdef')
samples = np.random.randn(6, 5, 4)
p1 = pd.Panel(samples, items=labels)
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: a to f
Major_axis axis: 0 to 4
Minor_axis axis: 0 to 3
print (p1['a'])
0 1 2 3
0 1.331587 0.715279 -1.545400 -0.008384
1 0.621336 -0.720086 0.265512 0.108549
2 0.004291 -0.174600 0.433026 1.203037
3 -0.965066 1.028274 0.228630 0.445138
4 -1.136602 0.135137 1.484537 -1.079805
print (p1.to_frame())
a b c d e f
major minor
0 0 1.331587 -1.977728 0.660232 -0.232182 1.985085 0.117476
1 0.715279 -1.743372 -0.350872 -0.501729 1.744814 -1.907457
2 -1.545400 0.266070 -0.939433 1.128785 -1.856185 -0.922909
3 -0.008384 2.384967 -0.489337 -0.697810 -0.222774 0.469751
1 0 0.621336 1.123691 -0.804591 -0.081122 -0.065848 -0.144367
1 -0.720086 1.672622 -0.212698 -0.529296 -2.131712 -0.400138
2 0.265512 0.099149 -0.339140 1.046183 -0.048831 -0.295984
3 0.108549 1.397996 0.312170 -1.418556 0.393341 0.848209
2 0 0.004291 -0.271248 0.565153 -0.362499 0.217265 0.706830
1 -0.174600 0.613204 -0.147420 -0.121906 -1.994394 -0.787269
2 0.433026 -0.267317 -0.025905 0.319356 1.107708 0.292941
3 1.203037 -0.549309 0.289094 0.460903 0.244544 -0.470807
3 0 -0.965066 0.132708 -0.539879 -0.215790 -0.061912 2.404326
1 1.028274 -0.476142 0.708160 0.989072 -0.753893 -0.739357
2 0.228630 1.308473 0.842225 0.314754 0.711959 -0.312829
3 0.445138 0.195013 0.203581 2.467651 0.918269 -0.348882
4 0 -1.136602 0.400210 2.394704 -1.508321 -0.482093 -0.439026
1 0.135137 -0.337632 0.917459 0.620601 0.089588 0.141104
2 1.484537 1.256472 -0.112272 -1.045133 0.826999 0.273049
3 -1.079805 -0.731970 -0.362180 -0.798009 -1.954512 -1.618571
It is same as in DataFrame with non unique columns:
samples = np.random.randn(6, 5)
df = pd.DataFrame(samples, columns=list('11122'))
print (df)
1 1 1 2 2
0 0.346338 -0.855797 -0.932463 -2.289259 0.634696
1 0.272794 -0.924357 -1.898270 -0.743083 -1.587480
2 -0.519975 -0.136836 0.530178 -0.730629 2.520821
3 0.137530 -1.232763 0.508548 -0.480384 -1.213064
4 -0.157787 -1.600004 -1.287620 0.384642 -0.568072
5 -0.649427 -0.659585 -0.813359 -1.487412 -0.044206
print (df['1'])
1 1 1
0 0.346338 -0.855797 -0.932463
1 0.272794 -0.924357 -1.898270
2 -0.519975 -0.136836 0.530178
3 0.137530 -1.232763 0.508548
4 -0.157787 -1.600004 -1.287620
5 -0.649427 -0.659585 -0.813359
EDIT:
Also for creating df from list need unique labels (no unique raise error) and function concat with parameter keys, for Panel call to_panel:
np.random.seed(100)
raw_sample = []
labels = list('abcdef')
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
df = pd.concat(raw_sample, keys=labels)
print (df)
0 1 2 3
a 0 -1.749765 0.342680 1.153036 -0.252436
1 0.981321 0.514219 0.221180 -1.070043
2 -0.189496 0.255001 -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
b 0 1.618982 1.541605 -0.251879 -0.842436
1 0.184519 0.937082 0.731000 1.361556
2 -0.326238 0.055676 0.222400 -1.443217
3 -0.756352 0.816454 0.750445 -0.455947
4 1.189622 -1.690617 -1.356399 -1.232435
c 0 -0.544439 -0.668172 0.007315 -0.612939
1 1.299748 -1.733096 -0.983310 0.357508
2 -1.613579 1.470714 -1.188018 -0.549746
3 -0.940046 -0.827932 0.108863 0.507810
4 -0.862227 1.249470 -0.079611 -0.889731
d 0 -0.881798 0.018639 0.237845 0.013549
1 -1.635529 -1.044210 0.613039 0.736205
2 1.026921 -1.432191 -1.841188 0.366093
3 -0.331777 -0.689218 2.034608 -0.550714
4 0.750453 -1.306992 0.580573 -1.104523
e 0 0.690121 0.686890 -1.566688 0.904974
1 0.778822 0.428233 0.108872 0.028284
2 -0.578826 -1.199451 -1.705952 0.369164
3 1.876573 -0.376903 1.831936 0.003017
4 -0.076023 0.003958 -0.185014 -2.487152
f 0 -1.704651 -1.136261 -2.973315 0.033317
1 -0.248889 -0.450176 0.132428 0.022214
2 0.317368 -0.752414 -1.296392 0.095139
3 -0.423715 -1.185984 -0.365462 -1.271023
4 1.586171 0.693391 -1.958081 -0.134801
p1 = df.to_panel()
print (p1)
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 6 (major_axis) x 5 (minor_axis)
Items axis: 0 to 3
Major_axis axis: a to f
Minor_axis axis: 0 to 4
EDIT1:
If need MultiIndex DataFrame is possible create helper range for unique values, use concat and last remove helper level of MultiIndex:
np.random.seed(100)
raw_sample = []
labels = [1,1,1,2,2,2]
mux = pd.MultiIndex.from_arrays([labels, range(len(labels))])
samples = np.random.randn(6, 5, 4)
for contents in range(samples.shape[0]):
raw_sample.append(pd.DataFrame(samples[contents]))
df = pd.concat(raw_sample, keys=mux)
df = df.reset_index(level=1, drop=True)
print (df)
0 1 2 3
1 0 -1.749765 0.342680 1.153036 -0.252436
1 0.981321 0.514219 0.221180 -1.070043
2 -0.189496 0.255001 -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
0 1.618982 1.541605 -0.251879 -0.842436
1 0.184519 0.937082 0.731000 1.361556
2 -0.326238 0.055676 0.222400 -1.443217
3 -0.756352 0.816454 0.750445 -0.455947
4 1.189622 -1.690617 -1.356399 -1.232435
0 -0.544439 -0.668172 0.007315 -0.612939
1 1.299748 -1.733096 -0.983310 0.357508
2 -1.613579 1.470714 -1.188018 -0.549746
3 -0.940046 -0.827932 0.108863 0.507810
4 -0.862227 1.249470 -0.079611 -0.889731
2 0 -0.881798 0.018639 0.237845 0.013549
1 -1.635529 -1.044210 0.613039 0.736205
2 1.026921 -1.432191 -1.841188 0.366093
3 -0.331777 -0.689218 2.034608 -0.550714
4 0.750453 -1.306992 0.580573 -1.104523
0 0.690121 0.686890 -1.566688 0.904974
1 0.778822 0.428233 0.108872 0.028284
2 -0.578826 -1.199451 -1.705952 0.369164
3 1.876573 -0.376903 1.831936 0.003017
4 -0.076023 0.003958 -0.185014 -2.487152
0 -1.704651 -1.136261 -2.973315 0.033317
1 -0.248889 -0.450176 0.132428 0.022214
2 0.317368 -0.752414 -1.296392 0.095139
3 -0.423715 -1.185984 -0.365462 -1.271023
4 1.586171 0.693391 -1.958081 -0.134801
But create panel is not possible:
p1 = df.to_panel()
print (p1)
>ValueError: Can't convert non-uniquely indexed DataFrame to Panel
Related
Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
I have a df:
idx pairs
1 ['000001.jpg', '000002.jpg']
2 ['000006.jpg', '000007.jpg', '000008.jpg', '000004.jpg', '000005.jpg', '000003.jpg']
3 ['000016.jpg', '000020.jpg', '000017.jpg', '000010.jpg', '000011.jpg', '000012.jpg'...]
the pairs can have any length of list. I'd like to create a new df of the 'pairs' into a pair where the first part is always the first one in the list of pairs. E.g.:
idx pairs
1 ['000001.jpg', '000002.jpg']
2 ['000006.jpg', '000007.jpg']
3 ['000006.jpg', '000008.jpg']
4 ['000006.jpg', '000004.jpg']
5 ['000006.jpg', '000005.jpg']
6 ['000006.jpg', '000003.jpg']
7 ['000016.jpg', '000020.jpg']
8 ['000016.jpg', '000017.jpg']
9 ['000016.jpg', '000010.jpg']
10 ['000016.jpg', '000011.jpg']
11 ['000016.jpg', '000012.jpg']
Seems like a great case for explode.
df['first'] = df.pairs.apply(lambda x: x[0])
df['others'] = df.pairs.apply(lambda x: x[1:])
df = df.explode('others')[['first', 'others']]
df = pd.DataFrame({'pairs': df.values.tolist()})
df = df.rename_axis('idx').reset_index()
df.idx += 1
Then the head of df will look like this:
idx pairs
0 1 [000001.jpg, 000002.jpg]
1 2 [000006.jpg, 000007.jpg]
2 3 [000006.jpg, 000008.jpg]
3 4 [000006.jpg, 000004.jpg]
4 5 [000006.jpg, 000005.jpg]
5 6 [000006.jpg, 000003.jpg]
Here's one using a nested list comprehension and reconstructing the dataframe:
from itertools import chain
l = [[[i,j_] for j_ in j] for i, *j in df.pairs]
print(pd.DataFrame(chain.from_iterable(l)))
0 1
0 000001.jpg 000002.jpg
1 000006.jpg 000007.jpg
2 000006.jpg 000008.jpg
3 000006.jpg 000004.jpg
4 000006.jpg 000005.jpg
5 000006.jpg 000003.jpg
6 000016.jpg 000020.jpg
7 000016.jpg 000017.jpg
8 000016.jpg 000010.jpg
9 000016.jpg 000011.jpg
10 000016.jpg 000012.jpg
You can use the 'str' method and 'explode':
df['first']= df.pairs.str[0]
df['pairs']= df.pairs.str[1:]
pairs first
0 [000002.jpg] 000001.jpg
1 [000007.jpg, 000008.jpg, 000004.jpg, 000005.jp... 000006.jpg
2 [000020.jpg, 000017.jpg, 000010.jpg, 000011.jp... 000016.jpg
df= df.explode("pairs")
pairs first
0 000002.jpg 000001.jpg
1 000007.jpg 000006.jpg
1 000008.jpg 000006.jpg
1 000004.jpg 000006.jpg
1 000005.jpg 000006.jpg
1 000003.jpg 000006.jpg
2 000020.jpg 000016.jpg
2 000017.jpg 000016.jpg
2 000010.jpg 000016.jpg
2 000011.jpg 000016.jpg
2 000012.jpg 000016.jpg
Swap the cols:
df= df.reindex(columns=["first","pairs"])
If you need one list in each row:
df= df.reset_index(drop=True).agg(list,axis=1)
0 [000001.jpg, 000002.jpg]
1 [000006.jpg, 000007.jpg]
2 [000006.jpg, 000008.jpg]
3 [000006.jpg, 000004.jpg]
4 [000006.jpg, 000005.jpg]
5 [000006.jpg, 000003.jpg]
6 [000016.jpg, 000020.jpg]
7 [000016.jpg, 000017.jpg]
8 [000016.jpg, 000010.jpg]
9 [000016.jpg, 000011.jpg]
10 [000016.jpg, 000012.jpg]
dtype: object
Edit:
'reindex' is not necessary in this case:
df['tail']= df.pairs.str[1:]
df['pairs']= df.pairs.str[0]
df= df.explode("tail")
etc.
]
I have the following df:
import numpy as np
import pandas as pd
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df =
0 1 2 3 lvl
0 0.928623 0.868600 0.854186 0.129116 0
1 0.667870 0.901285 0.539412 0.883890 0
2 0.384494 0.697995 0.242959 0.725847 0
3 0.993400 0.695436 0.596957 0.142975 0
4 0.518237 0.550585 0.426362 0.766760 0
5 0.359842 0.417702 0.873988 0.217259 0
6 0.820216 0.823426 0.585223 0.553131 0
7 0.492683 0.401155 0.479228 0.506862 0
..............................................
3 0.505096 0.426465 0.356006 0.584958 3
4 0.145472 0.558932 0.636995 0.318406 3
5 0.957969 0.068841 0.612658 0.184291 3
6 0.059908 0.298270 0.334564 0.738438 3
7 0.662056 0.074136 0.244039 0.848246 3
8 0.997610 0.043430 0.774946 0.097294 3
9 0.795873 0.977817 0.780772 0.849418 3
0 0.577173 0.430014 0.133300 0.760223 4
1 0.916126 0.623035 0.240492 0.638203 4
2 0.165028 0.626054 0.225580 0.356118 4
3 0.104375 0.137684 0.084631 0.987290 4
4 0.934663 0.835608 0.764334 0.651370 4
5 0.743265 0.072671 0.911947 0.925644 4
6 0.212196 0.587033 0.230939 0.994131 4
7 0.945275 0.238572 0.696123 0.536136 4
8 0.989021 0.073608 0.720132 0.254656 4
9 0.513966 0.666534 0.270577 0.055597 4
I am learning neat pandas functionality and thus wondering, what is the easiest way to compute average along lvl column?
What I mean is:
(df[df.lvl ==0 ] + df[df.lvl ==1 ] + df[df.lvl ==2 ] + df[df.lvl ==3 ] + df[df.lvl ==4 ]) / 5
The desired output should be a table of shape (10,4), without the column lvl, where each element is the average of 5 elements (with lvl = [0,1,2,3,4]. I hope it helps.
I think need:
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
#print (df)
df1 = (df[df.lvl ==0 ] + df[df.lvl ==1 ] +
df[df.lvl ==2 ] + df[df.lvl ==3 ] +
df[df.lvl ==4 ]) / 5
print (df1)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
df = df.set_index('lvl')
df2 = df.groupby(df.groupby('lvl').cumcount()).mean()
print (df2)
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
EDIT:
If each subset of DataFrame have index from 0 to len(subset):
df2 = df.mean(level=0)
print (df2)
0 1 2 3 lvl
0 0.411557 0.520560 0.578900 0.541576 2
1 0.253469 0.655714 0.532784 0.620744 2
2 0.468099 0.576198 0.400485 0.333533 2
3 0.620207 0.367649 0.531639 0.475587 2
4 0.699554 0.548005 0.683745 0.457997 2
5 0.322487 0.316137 0.489660 0.362146 2
6 0.430058 0.159712 0.631610 0.641141 2
7 0.399944 0.511944 0.346402 0.754591 2
8 0.400190 0.373925 0.340727 0.407988 2
9 0.502879 0.399614 0.321710 0.715812 2
The groupby function is exactly what you want. It will group based on a condition, in this case where 'lvl' is the same, and then apply the mean function to the values for each column in that group.
df.groupby('lvl').mean()
it seems like you want to group by the index and take average of all the columns except lvl
i.e.
df.groupby(df.index)[[0,1,2,3]].mean()
For a dataframe generated using
np.random.seed(456)
a = []
for i in range(5):
tmp_df = pd.DataFrame(np.random.random((10,4)))
tmp_df['lvl'] = i
a.append(tmp_df)
df = pd.concat(a, axis=0)
df.groupby(df.index)[[0,1,2,3]].mean()
outputs:
0 1 2 3
0 0.411557 0.520560 0.578900 0.541576
1 0.253469 0.655714 0.532784 0.620744
2 0.468099 0.576198 0.400485 0.333533
3 0.620207 0.367649 0.531639 0.475587
4 0.699554 0.548005 0.683745 0.457997
5 0.322487 0.316137 0.489660 0.362146
6 0.430058 0.159712 0.631610 0.641141
7 0.399944 0.511944 0.346402 0.754591
8 0.400190 0.373925 0.340727 0.407988
9 0.502879 0.399614 0.321710 0.715812
which is identical to the output from
df.groupby(df.groupby('lvl').cumcount()).mean()
without resorting to double groupby.
IMO this is cleaner to read and will for large dataframe, will be much faster.
I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904
I have got the following pandas data frame
Y X id WP_NER
0 35.973496 -2.734554 1 WP_01
1 35.592138 -2.903913 2 WP_02
2 35.329853 -3.391070 3 WP_03
3 35.392608 -3.928513 4 WP_04
4 35.579265 -3.942995 5 WP_05
5 35.519728 -3.408771 6 WP_06
6 35.759485 -3.078903 7 WP_07
I´d like to round Y and X columns using pandas.
How can I do that ?
You can now, use round on dataframe
Option 1
In [661]: df.round({'Y': 2, 'X': 2})
Out[661]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
Option 2
In [662]: cols = ['Y', 'X']
In [663]: df[cols] = df[cols].round(2)
In [664]: df
Out[664]:
Y X id WP_NER
0 35.97 -2.73 1 WP_01
1 35.59 -2.90 2 WP_02
2 35.33 -3.39 3 WP_03
3 35.39 -3.93 4 WP_04
4 35.58 -3.94 5 WP_05
5 35.52 -3.41 6 WP_06
6 35.76 -3.08 7 WP_07
You can apply round:
In [142]:
df[['Y','X']].apply(pd.Series.round)
Out[142]:
Y X
0 36 -3
1 36 -3
2 35 -3
3 35 -4
4 36 -4
5 36 -3
6 36 -3
If you want to apply to a specific number of places:
In [143]:
df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
Out[143]:
Y X
0 35.973 -2.735
1 35.592 -2.904
2 35.330 -3.391
3 35.393 -3.929
4 35.579 -3.943
5 35.520 -3.409
6 35.759 -3.079
EDIT
You assign the above to the columns you want to modify like the following:
In [144]:
df[['Y','X']] = df[['Y','X']].apply(lambda x: pd.Series.round(x, 3))
df
Out[144]:
Y X id WP_NER
0 35.973 -2.735 1 WP_01
1 35.592 -2.904 2 WP_02
2 35.330 -3.391 3 WP_03
3 35.393 -3.929 4 WP_04
4 35.579 -3.943 5 WP_05
5 35.520 -3.409 6 WP_06
6 35.759 -3.079 7 WP_07
Round is so smart that it works just on float columns, so the simplest solution is just:
df = df.round(2)
you can do the below:
df['column_name'] = df['column_name'].apply(lambda x: round(x,2) if isinstance(x, float) else x)
that check as well if the value of the cell is a float number. if is not float return the same value. that comes from the fact that a cell value can be a string or a NAN.
You can also - first check to see which columns are of type float - then round those columns:
for col in df.select_dtypes(include=['float']).columns:
df[col] = df[col].apply(lambda x: x if(math.isnan(x)) else round(x,1))
This also manages potential errors if trying to round nanvalues by implementing if(math.isnan(x))