I'm trying to pivot data in a way so that the index and columns of the resulting table aren't automatically sorted. An example of the data might be:
X Y Z
1 1 1
3 1 2
2 1 3
4 1 4
1 2 5
3 2 6
2 2 7
4 2 8
The data is interpreted as an X, Y and Z axis. The pivotted result should look like this:
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Instead the result looks like this, where the index and columns are sorted, and the data accordingly:
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
At this point I have lost information about the order in which the measurements were taken. For example say that I would plot the row at Y=1, with X as the X axis and the data value on the Y axis.
This would result in the figures in this picture. On the right is how I would like the data to be plotted. Does anyone have an idea how to prevent pandas from sorting the index and columns when pivotting a table?
I have an alternative to restore the order, as the ordering is based on the X relative to Y values, for instance, you can restore your X columns ordering by something like this:
import pandas as pd
# using your sample data
df = pd.read_clipboard()
df = df.pivot('Y', 'X', 'Z')
df
X 1 2 3 4
Y
1 1 3 2 4
2 5 7 6 8
# re-order your X columns by the values of first Y, for instance
df = df[df.T[1].values]
df
X 1 3 2 4
Y
1 1 2 3 4
2 5 6 7 8
Not the best approach, but sure it will achieve what you want.
Related
I'm trying to even up a dataset for machine learning. There are great answers for how to sample a dataframe with two values in a column (a binary choice).
In my case I have many values in column x. I want an equal number of records in the dataframe where
x is 0 or not 0
or in a more complicated example the value in x is 0, 5 or other value
Examples
x
0 5
1 5
2 5
3 0
4 0
5 9
6 18
7 3
8 5
** For the first **
I have 2 rows where x = 0 and 7 where x != 0. The result should balance this up and be 4 rows: the two with x = 0 and 2 where x != 0 (randomly selected). Preserving the same index for the sake of illustration
1 5
3 0
4 0
6 18
** For the second **
I have 2 rows where x = 0, 4 rows where x = 5 and 3 rows where x != 0 && x != 5. The result should balance this up and be 6 rows in total: two for each condition. Preserving the same index for the sake of illustration
1 5
3 0
4 0
5 9
6 18
8 5
I've done examples with 2 conditions & 3 conditions. A solution that generalises to more would be good. It is better if it detects the minimum number of rows (for 0 in this example) so I don't need to work this out first before writing the condition.
How do I do this with pandas? Can I pass a custom function to .groupby() to do this?
IIUC, you could groupby on the condition whether "x" is 0 or not and sample the smallest-group-size number of entries from each group:
g = df.groupby(df['x']==0)['x']
out = g.sample(n=g.count().min()).sort_index()
(An example) output:
1 5
3 0
4 0
5 9
Name: x, dtype: int64
For the second case, we could use numpy.select and numpy.unique to get the groups (the rest are essentially the same as above):
import numpy as np
groups = np.select([df['x']==0, df['x']==5], [1,2], 3)
g = df.groupby(groups)['x']
out = g.sample(n=np.unique(groups, return_counts=True)[1].min()).sort_index()
An example output:
2 5
3 0
4 0
5 9
7 3
8 5
Name: x, dtype: int64
IIUC, and you want any two non-zero records:
mask = df['x'].eq(0)
pd.concat([df[mask], df[~mask].sample(mask.sum())]).sort_index()
Output:
x
1 5
2 5
3 0
4 0
Part II:
mask0 = df['x'].eq(0)
mask5 = df['x'].eq(5)
pd.concat([df[mask0],
df[mask5].sample(mask0.sum()),
df[~(mask0 | mask5)].sample(mask0.sum())]).sort_index()
Output:
x
2 5
3 0
4 0
6 18
7 3
8 5
I have a pandas DataFrame with columns 'x', 'y', 'z'
However a lot of the x and y values are redundant. I want to take all rows that have the same x and y values and sum the third column, returning a smaller DataFrame.
So given
x y z
0 1 2 1
1 1 2 5
2 1 2 0
3 1 3 0
4 2 6 1
it would return:
x y z
0 1 2 6
1 1 3 0
2 2 6 1
I've tried
df = df.groupby(['x', 'y'])['z'].sum
but I'm not sure how to work with grouped objects.
Very close as-is; you just need to call .sum() and then reset the index:
>>> df.groupby(['x', 'y'])['z'].sum().reset_index()
x y z
0 1 2 6
1 1 3 0
2 2 6 1
There is also a parameter to groupby() that handles that:
>>> df.groupby(['x', 'y'], as_index=False)['z'].sum()
x y z
0 1 2 6
1 1 3 0
2 2 6 1
In your question, you have df.groupby(['x', 'y'])['z'].sum without parentheses. This simply references the method .sum as a Python object, without calling it.
>>> type(df.groupby(['x', 'y'])['z'].sum)
method
>>> callable(df.groupby(['x', 'y'])['z'].sum)
True
Another option without using groupby syntax is to use the indexes and summing on index levels like this:
df.set_index(['x','y']).sum(level=[0,1]).reset_index()
Output:
x y z
0 1 2 6
1 1 3 0
2 2 6 1
How to divide an column into 5 groups by the column's value sorted.
and add a column by the groups
for example
import pandas as pd
df = pd.DataFrame({'x1':[1,2,3,4,5,6,7,8,9,10]})
and I want add columns like this:
You probably want to look at pd.cut, and set the argument bins to an integer of however many groups you want, and the labels argument to False (to return integer indicators of your groups instead of ranges):
df['add_col'] = pd.cut(df['x1'], bins=5, labels=False) + 1
>>> df
x1 add_col
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
6 7 4
7 8 4
8 9 5
9 10 5
Note that the + 1 is only there so that your groups are numbered 1 to 5, as in your desired output. If you don't say + 1 they will be numbered 0 to 4
I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866
I would like to create a stacked bar plot from the following dataframe:
VALUE COUNT RECL_LCC RECL_PI
0 1 15686114 3 1
1 2 27537963 1 1
2 3 23448904 1 2
3 4 1213184 1 3
4 5 14185448 3 2
5 6 13064600 3 3
6 7 27043180 2 2
7 8 11732405 2 1
8 9 14773871 2 3
There would be 2 bars in the plot. One for RECL_LCC and other for RECL_PI. There would be 3 sections in each bar corresponding to the unique values in RECL_LCC and RECL_PI i.e 1,2,3 and would sum up the COUNT for each section. So far, I have something like this:
df = df.convert_objects(convert_numeric=True)
sub_df = df.groupby(['RECL_LCC','RECL_PI'])['COUNT'].sum().unstack()
sub_df.plot(kind='bar',stacked=True)
However, I get this plot:
Any idea on how to obtain 2 columns (RECL_LCC and RECL_PI) instead of these 3?
So your problem was that the dtypes were not numeric so no aggregation function will work as they were strings, so you can convert each offending column like so:
df['col'] = df['col'].astype(int)
or just call convert_objects on the df:
df.convert_objects(convert_numeric=True)