So I have an extremely simple dataframe:
values
1
1
1
2
2
I want to add a new column and for each row assign the sum of it's unique occurences, so the table would look like:
values unique_sum
1 3
1 3
1 3
2 2
2 2
I have seen some examples in R, but for python and pandas I have not come across anything and am stuck. I can list the value counts using .value_counts() and I have tried groupbyroutines but cannot fathom it.
Just use map to map your column onto its value_counts:
>>> x
A
0 1
1 1
2 1
3 2
4 2
>>> x['unique'] = x.A.map(x.A.value_counts())
>>> x
A unique
0 1 3
1 1 3
2 1 3
3 2 2
4 2 2
(I named the column A instead of values. values is not a great choice for a column name, because DataFrames have a special attribute called values, which prevents you from getting the column with x.values --- you'd have to use x['values'] instead.)
Related
I am working in python on a pandas data frame and am trying to count unique values of a column within groups. My problem is that I need that count to represent steadily increasing numbers of rows within the groups and I also don't want NaNs to be counted.
Simplified, the data looks like this
ID occup
1 NaN
1 A
1 NaN
1 Nan
1 B
2 K
2 NaN
2 L
2 L
2 M
The new column 'occupcount' should, within the groups defined by 'ID', count the number of unique values in 'occup' but, in the first row of each group I want the count to only consider the first row in the respective group. In the second row, I want to count over the first two rows. In the fifth row, I want the count of unique values over all five rows within each group. It should look like this:
ID occup occupcount
1 NaN 0
1 A 1
1 NaN 1
1 B 2
1 A 2
2 K 1
2 NaN 1
2 L 2
2 K 2
2 M 3
I tried to solve the task with something like
df['occupcount'] = (df.groupby(["ID"])['occup'].transform('nunique'))
But it only provides the total amount of unique values over all rows within each group, no gradual increase. Thanks in advance!
Idea is chain first duplicated values by both columns with not missing values for mask and then use GroupBy.cumsum:
df['occupcount'] = ((~df.duplicated(['ID','occup']) & df['occup'].notna())
.groupby(df['ID'])
.cumsum())
print (df)
ID occup occupcount
0 1 NaN 0
1 1 A 1
2 1 NaN 1
3 1 B 2
4 1 A 2
5 2 K 1
6 2 NaN 1
7 2 L 2
8 2 L 2
9 2 M 3
A super simple question, for which I cannot find an answer.
I have a dataframe with 1000+ columns and cannot drop by column number, I do not know them. I want to drop all columns between two columns, based on their names.
foo = foo.drop(columns = ['columnWhatever233':'columnWhatever826'])
does not work. I tried several other options, but do not see a simple solution. Thanks!
You can use .loc with column range. For example if you have this dataframe:
A B C D E
0 1 3 3 6 0
1 2 2 4 9 1
2 3 1 5 8 4
Then to delete columns B to D:
df = df.drop(columns=df.loc[:, "B":"D"].columns)
print(df)
Prints:
A E
0 1 0
1 2 1
2 3 4
I would like to know whether I can get some help in "translating" a multi dim list in a single column of a frame in pandas.
I found help here to translate a multi dim list in a column with multiple columns, but I need to translate the data in one
Suppose I have the following list of list
x=[[1,2,3],[4,5,6]]
If I create a frame I get
frame=pd.Dataframe(x)
0 1 2
1 2 3
4 5 6
But my desire outcome shall be
0
1
2
3
4
5
6
with the zero as column header.
I can of course get the result with a for loop, which from my point of view takes much time. Is there any pythonic/pandas way to get it?
Thanks for helping men
You can use np.concatenate
x=[[1,2,3],[4,5,6]]
frame=pd.DataFrame(np.concatenate(x))
print(frame)
Output:
0
0 1
1 2
2 3
3 4
4 5
5 6
First is necessary flatten values of lists and pass to DataFrame constructor:
df = pd.DataFrame([z for y in x for z in y])
Or:
from itertools import chain
df = pd.DataFrame(list(chain.from_iterable(x)))
print (df)
0
0 1
1 2
2 3
3 4
4 5
5 6
If you use numpy you can utilize the method ravel():
pd.DataFrame(np.array(x).ravel())
I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64
I have a data-frame which I'm using the pandas.groupby on a specific column and then running aggregate statistics on the produced groups (mean, median, count). I want to treat certain column values as members of the same group produced by the groupby rather than a distinct group per distinct value in the column which was used for the grouping. I was looking how I would accomplish such a thing.
For example:
>> my_df
ID SUB_NUM ELAPSED_TIME
1 1 1.7
2 2 1.4
3 2 2.1
4 4 3.0
5 6 1.8
6 6 1.2
So instead of the typical behavior:
>> my_df.groupby([SUB_NUM]).agg([count])
ID SUB_NUM Count
1 1 1
2 2 2
4 4 1
5 6 2
I want certain values (SUB_NUM in [1, 2]) to be computed as one group so instead something like below is produced:
>> # Some mystery pandas function calls
ID SUB_NUM Count
1 1, 2 3
4 4 1
5 6 2
Any help would be much appreciated, thanks!
For me works:
#for join values convert values to string
df['SUB_NUM'] = df['SUB_NUM'].astype(str)
#create mapping dict by dict comprehension
L = ['1','2']
d = {x: ','.join(L) for x in L}
print (d)
{'2': '1,2', '1': '1,2'}
#replace values by dict
a = df['SUB_NUM'].replace(d)
print (a)
0 1,2
1 1,2
2 1,2
3 4
4 6
5 6
Name: SUB_NUM, dtype: object
#groupby by mapping column and aggregating `first` and `size`
print (df.groupby(a)
.agg({'ID':'first', 'ELAPSED_TIME':'size'})
.rename(columns={'ELAPSED_TIME':'Count'})
.reset_index())
SUB_NUM ID Count
0 1,2 1 3
1 4 4 1
2 6 5 2
What is the difference between size and count in pandas?
You can create another column mapping the SUB_NUM values to actual groups and then group by it.
my_df['SUB_GROUP'] = my_df['SUB_NUM'].apply(lambda x: 1 if x < 3 else x)
my_df.groupby(['SUB_GROUP']).agg([count])