Say I have series:
A a 1
b 1
B c 5
d 8
e 5
where first two columns together is the hierarchical index. I want to find how many unique values are for index level=0, e.g., in this output should be A 1; B 2. How can this be done easily? Thanks!
groupby on level 0 and then call .nunique on the column:
>>> df
val
A a 1
b 1
B c 5
d 8
e 5
>>> df.groupby(level=0)['val'].nunique()
A 1
B 2
Name: val, dtype: int64
Related
I have a DataFrame which looks like this:
df:-
A B
1 a
1 a
1 b
2 c
3 d
Now using this dataFrame i want to get the following new_df:
new_df:-
item val_not_present
1 c #1 doesn't have values c and d(values not part of group 1)
1 d
2 a #2 doesn't have values a,b and d(values not part of group 2)
2 b
2 d
3 a #3 doesn't have values a,b and c(values not part of group 3)
3 b
3 c
or an individual DataFrame for each items like:
df1:
item val_not_present
1 c
1 d
df2:-
item val_not_present
2 a
2 b
2 d
df3:-
item val_not_present
3 a
3 b
3 c
I want to get all the values which are not part of that group.
You can use np.setdiff and explode:
values_b = df.B.unique()
pd.DataFrame(df.groupby("A")["B"].unique().apply(lambda x: np.setdiff1d(values_b,x)).rename("val_not_present").explode())
Output:
val_not_present
A
1 c
1 d
2 a
2 b
2 d
3 a
3 b
3 c
Another approach is using crosstab/pivot_table to get counts and then filter on where count is 0 and transform to dataframe:
m = pd.crosstab(df['A'],df['B'])
pd.DataFrame(m.where(m.eq(0)).stack().index.tolist(),columns=['A','val_not_present'])
A val_not_present
0 1 c
1 1 d
2 2 a
3 2 b
4 2 d
5 3 a
6 3 b
7 3 c
You could convert B to a categorical datatype and then compute the value counts. Categorical variables will show categories that have frequency counts of zero so you could do something like this:
df['B'] = df['B'].astype('category')
new_df = (
df.groupby('A')
.apply(lambda x: x['B'].value_counts())
.reset_index()
.query('B == 0')
.drop(labels='B', axis=1)
.rename(columns={'level_1':'val_not_present',
'A':'item'})
)
This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.
Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d
df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function
Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d
Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d
I have this dataframe:
dfx = pd.DataFrame([[1,2],['A','B'],[['C','D'],'E']],columns=list('AB'))
A B
0 1 2
1 A B
2 [C, D] E
... that I want to transform in ...
A B
0 1 2
1 A B
2 C E
3 D E
... adding a row for each value contained in column A if it's a list.
Which is the most pythonic way?
And vice versa, if I want to group by a column (let's say B) and have in column A a list of the grouped values? (so the opposite that the example above)
Thanks in advance,
Gianluca
You have mixed dataframe - int with str and list values (very problematic because many functions raise errors), so first convert all numeric to str by where and mask is by to_numeric with parameter errors='coerce' which convert non numeric to NaN:
dfx.A = dfx.A.where(pd.to_numeric(dfx.A, errors='coerce').isnull(), dfx.A.astype(str))
print (dfx)
A B
0 1 2
1 A B
2 [C, D] E
and then create new DataFrame by np.repeat and flat values of lists by chain.from_iterable:
df = pd.DataFrame({
"B": np.repeat(dfx.B.values, dfx.A.str.len()),
"A": list(chain.from_iterable(dfx.A))})
print (df)
A B
0 1 2
1 A B
2 C E
3 D E
Pure pandas solution convert column A to list and then create new DataFrame.from_records. Then drop original column A and join stacked df:
df = pd.DataFrame.from_records(dfx.A.values.tolist(), index = dfx.index)
df = dfx.drop('A', axis=1).join(df.stack().rename('A')
.reset_index(level=1, drop=True))[['A','B']]
print (df)
A B
0 1 2
1 A B
2 C E
2 D E
If need lists use groupby and apply tolist:
print (df.groupby('B')['A'].apply(lambda x: x.tolist()).reset_index())
B A
0 2 [1]
1 B [A]
2 E [C, D]
but if need list only if length of values is more as 1 is necessary if..else:
print (df.groupby('B')['A'].apply(lambda x: x.tolist() if len(x) > 1 else x.values[0])
.reset_index())
B A
0 2 1
1 B A
2 E [C, D]
Let me simplify my problem for easy explanation.
I have a pandas DataFrame table with the below format:
a b c
0 1 3 2
1 3 1 2
2 3 2 1
The numbers in each row present ranks of columns.
For example, the order of the first row is {a, c, b}.
How can I convert the above to the below ?
1 2 3
0 a c b
1 c a b
2 c b a
I googled all day long. But I couldn't find any solutions until now.
Looks like you are just mapping one value to another and renaming the columns, e.g.:
>>> df = pd.DataFrame({'a':[1,3,3], 'b':[3,1,2], 'c':[2,2,1]})
>>> df = df.applymap(lambda x: df.columns[x-1])
>>> df.columns = [1,2,3]
>>> df
1 2 3
0 a c b
1 c a b
2 c b a
I have a series of the form:
s = Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])
which looks like
0 [a, a, b]
1 [b, b, c, d]
2 []
3 [a, b, e]
dtype: object
I would like to count how many elements I have in total.
My naive tentatives like
s.values.hist()
or
s.values.flatten()
didn't work.
What am I doing wrong?
If we stick with the pandas Series as in the original question, one neat option from the Pandas version 0.25.0 onwards is the Series.explode() routine. It returns an exploded list to rows, where the index will be duplicated for these rows.
The original Series from the question:
s = pd.Series([['a','a','b'],['b','b','c','d'],[],['a','b','e']])
Let's explode it and we get a Series, where the index is repeated. The index indicates the index of the original list.
>>> s.explode()
Out:
0 a
0 a
0 b
1 b
1 b
1 c
1 d
2 NaN
3 a
3 b
3 e
dtype: object
>>> type(s.explode())
Out:
pandas.core.series.Series
To count the number of elements we can now use the Series.value_counts():
>>> s.explode().value_counts()
Out:
b 4
a 3
d 1
c 1
e 1
dtype: int64
To include also NaN values:
>>> s.explode().value_counts(dropna=False)
Out:
b 4
a 3
d 1
c 1
e 1
NaN 1
dtype: int64
Finally, plotting the histogram using Series.plot():
>>> s.explode().value_counts(dropna=False).plot(kind = 'bar')
s.map(len).sum()
does the trick. s.map(len) applies len() to each element and returns a series of all the lengths, then you can just use sum on that series.
import itertools
word_lists=[['apple','orange'],['red','yellow']]
vocab=list(set(itertools.chain.from_iterable(raw_data.word_lists)))
Personally, I love having arrays in dataframes, for every single item a single column. It will give you much more functionality. So, here's my alternative approach
>>> raw = [['a', 'a', 'b'], ['b', 'b', 'c', 'd'], [], ['a', 'b', 'e']]
>>> df = pd.DataFrame(raw)
>>> df
Out[217]:
0 1 2 3
0 a a b None
1 b b c d
2 None None None None
3 a b e None
Now, see how many values we have in each row
>>> df.count(axis=1)
Out[226]:
0 3
1 4
2 0
3 3
Applying sum() here would give you what you wanted.
Second, what you mentioned in a comment: get the distribution. There may be a cleaner approach here, but I still prefer the following over the hint that was given you in the comment
>>> foo = [col.value_counts() for x, col in df.iteritems()]
>>> foo
Out[246]:
[a 2
b 1
dtype: int64, b 2
a 1
dtype: int64, b 1
c 1
e 1
dtype: int64, d 1
dtype: int64]
foo contains distribution for every column now. The interpretation of columns is still "xth value", such that column 0 contains the distribution of all the "first values" in your arrays.
Next step, "sum them up".
>>> df2 = pd.DataFrame(foo)
>>> df2
Out[266]:
a b c d e
0 2 1 NaN NaN NaN
1 1 2 NaN NaN NaN
2 NaN 1 1 NaN 1
3 NaN NaN NaN 1 NaN
>>> test.sum(axis=0)
Out[264]:
a 3
b 4
c 1
d 1
e 1
dtype: float64
Note that for these very simple problems the difference between a series of lists and a dataframe with columns per item is not big, but once you want to do real data work, the latter gives you way more functionality. Moreover, it can potentially be more efficient, since you can use pandas internal methods.