This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
Supposed I have a set of data with two labels, put in a pandas Dataframe:
label1 label2
0 0 a
1 1 a
2 1 a
3 1 a
4 1 a
5 2 b
6 0 b
7 1 b
8 2 b
9 0 b
10 2 c
11 1 c
12 2 c
13 0 c
14 2 c
Using the following code, the number of elements for each combination of labels can be obtained:
grouped = df.groupby(['label1', 'label2'], sort = False)
grouped.size()
The result is something like this:
label1 label2
0 a 1
1 a 4
2 b 2
0 b 2
1 b 1
2 c 3
1 c 1
0 c 1
dtype: int64
However, I would also like to compare the distribution of data count for label 2 in each label 1 group. I imagine the most convenient way to further manipulate the data for this purpose would be having a Dataframe (or some sort of table) with label1/2 as rows/columns and content as data count, like this:
a b c
0 1 2 1
1 4 1 1
2 0 2 3
After a while of search, to my surprise, there seems no easy way to do this kind of dataframe reshaping in pandas.
Using a loop is possible. But I assumed it would be super slow, since in the real data, there are hundreds of thousands of different labels.
Moreover, there seems no way to get a group from only label1, after grouping with both label1 and label2, so the loop will have to be on the combination of labels, which might make things even slower and more complicated.
Anyone knows a smart way to do this?
Probably crosstab:
pd.crosstab(df.label1, df.label2)
Are you looking for pd.pivot_table?
df.pivot_table(index='label1', columns='label2', aggfunc='size').fillna(0)
Related
This question already has an answer here:
Pandas long to wide (unmelt or similar?) [duplicate]
(1 answer)
Closed 2 months ago.
I have a python dataframe with a few columns, let's say that it looks like this:
Heading 1
Values
A
1
A
2
B
9
B
8
B
6
What I want to is to "pivot" or group the table so it would look something like:
Heading 1
Value 1
Value 2
Value 3
A
1
2
B
9
8
6
I was trying to group the table or pivot/unpivot it by several ways, but i cannot figure out how to do it properly.
You can derive a new column that will hold a row number (so to speak) for each partition of heading 1.
df = pd.DataFrame({"heading 1":['A','A','B','B','B'], "Values":[1,2,9,8,6]})
df['rn'] = df.groupby(['heading 1']).cumcount() + 1
heading 1 Values rn
0 A 1 1
1 A 2 2
2 B 9 1
3 B 8 2
4 B 6 3
Then you can pivot, using the newly derived column as your columns argument:
df = df.pivot(index='heading 1', columns='rn', values='Values').reset_index()
rn heading 1 1 2 3
0 A 1.0 2.0 NaN
1 B 9.0 8.0 6.0
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I would like to know whether I can get some help in "translating" a multi dim list in a single column of a frame in pandas.
I found help here to translate a multi dim list in a column with multiple columns, but I need to translate the data in one
Suppose I have the following list of list
x=[[1,2,3],[4,5,6]]
If I create a frame I get
frame=pd.Dataframe(x)
0 1 2
1 2 3
4 5 6
But my desire outcome shall be
0
1
2
3
4
5
6
with the zero as column header.
I can of course get the result with a for loop, which from my point of view takes much time. Is there any pythonic/pandas way to get it?
Thanks for helping men
You can use np.concatenate
x=[[1,2,3],[4,5,6]]
frame=pd.DataFrame(np.concatenate(x))
print(frame)
Output:
0
0 1
1 2
2 3
3 4
4 5
5 6
First is necessary flatten values of lists and pass to DataFrame constructor:
df = pd.DataFrame([z for y in x for z in y])
Or:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(x)))
print (df)
0
0 1
1 2
2 3
3 4
4 5
5 6
If you use numpy you can utilize the method ravel():
pd.DataFrame(np.array(x).ravel())
This question already has answers here:
cartesian product in pandas
(13 answers)
Closed 4 years ago.
For example, the data is:
a=pd.DataFrame({'aa':[1,2,3]})
b=pd.DataFrame({'bb':[4,5]})
what I want is to union these two data frames so the new frame is :
aa bb
1 4
1 5
2 4
2 5
3 4
3 5
You can see that every value in a is linked to all the values in b in the new frame. I probably can use tile or repeat to do this. But I have multiple frames which need to be done repeatedly. So I want to know if there is a better way?
Could anyone help me out here?
You can do it like this:
In [24]: a['key'] = 1
In [25]: b['key'] = 1
In [27]: pd.merge(a, b, on='key').drop('key', axis=1)
Out[27]:
aa bb
0 1 4
1 1 5
2 2 4
3 2 5
4 3 4
5 3 5
you can use pd.MultiIndex.from_product and then reset_index. It is generating all the combinations between both set of data (the same idea than itertools.product)
df_outut = (pd.DataFrame(index=pd.MultiIndex.from_product([a.aa,b.bb],names=['aa','bb']))
.reset_index())
and you get
aa bb
0 1 4
1 1 5
2 2 4
3 2 5
4 3 4
5 3 5
I have a pandas dataframe like this:
X a b c
1 1 0 2
5 4 7 3
6 7 8 9
I want to print a column called 'count' which outputs the number of values greater than the value in the first column('x' in my case). The output should look like:
X a b c Count
1 1 0 2 2
5 4 7 3 1
6 7 8 9 3
I would like to refrain from using 'lambda function' or 'for' loop or any kind of looping techniques since my dataframe has a large number of rows. I tried something like this but i couldn't get what i wanted.
df['count']=df [ df.iloc [:,1:] > df.iloc [:,0] ].count(axis=1)
I Also tried
numpy.where()
Didn't have any luck with that either. So any help will be appreciated. I also have nan as part of my dataframe. so i would like to ignore that when i count the values.
Thanks for your help in advance!
You can using ge(>=) with sum
df.iloc[:,1:].ge(df.iloc[:,0],axis = 0).sum(axis = 1)
Out[784]:
0 2
1 1
2 3
dtype: int64
After assign it back
df['Count']=df.iloc[:,1:].ge(df.iloc [:,0],axis=0).sum(axis=1)
df
Out[786]:
X a b c Count
0 1 1 0 2 2
1 5 4 7 3 1
2 6 7 8 9 3
df['count']=(df.iloc[:,2:5].le(df.iloc[:,0],axis=0).sum(axis=1) + df.iloc[:,2:5].ge(df.iloc[:,1],axis=0).sum(axis=1))
In case anyone needs such a solution, you can just add the output you get from '.le' and '.ge' in one line. Thanks to #Wen for the answer to my question though!!!