Rolling up data frame along with count of rows in python - python

I am still in a learning phase in python and wanted to know how do we roll up the data and count the duplicate data rows in a column called count
The data frame structure is as follows
Col1| Value
A | 1
B | 1
A | 1
B | 1
C | 3
C | 3
C | 3
C | 3
My result should be as follows
Col1|Value|Count
A | 1 | 2
B | 1 | 2
C | 3 | 4

>>> df2 = df.groupby(['Col1', 'Value']).size().reset_index()
>>> df2.columns = ['Col1', 'Value', 'Count']
>>> df2
Col1 Value Count
0 A 1 2
1 B 1 2
2 C 3 4

Roman Pekar's fine answer is correct for this case. However, I saw it after trying to write a solution for the general case stated in the text of your question, not just the example with specific column names. So, for the general case, consider:
df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
For example:
import pandas as pd
df = pd.DataFrame({'Col1': ['a', 'a', 'a', 'b', 'c'], 'Value': [1, 2, 1, 3, 2]})
>>> df.groupby([df[c] for c in df.columns]).size().reset_index().rename(columns={0: 'Count'})
Col1 Value Count
0 a 1 2
1 a 2 1
2 b 3 1
3 c 2 1

You can also try:
df.groupby('Col1')['Value'].value_counts().reset_index(name='Count')

Related

How to transform a list of lists of strings to a frequency DataFrame?

I have a list of lists of strings (Essentially it's a corpus) and I'd like to convert it to a matrix where a row is a document in the corpus and the columns are the corpus' vocabulary.
I can do this with CountVectorizer but it would require quite a lot of memory as I would need to convert each list into a string that in turn CountVectorizer would tokenize.
I think it's possible to do it with Pandas only but I'm not sure how.
Example:
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
expected result:
| a | b | c |
|---|---|---|
| 1 | 1 | 1 |
| 2 | 0 | 0 |
| 0 | 1 | 2 |
I would combine collections.Counter and the DataFrame constructor:
from collections import Counter
corpus = [['a', 'b', 'c'],['a', 'a'],['b', 'c', 'c']]
df = pd.DataFrame(map(Counter, corpus)).fillna(0, downcast='infer')
Output:
a b c
0 1 1 1
1 2 0 0
2 0 1 2
Using only Pandas:
import pandas as pd
corpus = pd.DataFrame(corpus).T
corpus_freq = corpus.apply(pd.Series.value_counts).T
corpus_freq = corpus_freq.fillna(0)
End result:
a b c
0 1.0 1.0 1.0
1 2.0 0.0 0.0
2 0.0 1.0 2.0
You can also do this:
Set =[]
from collections import Counter
for lst in corpus:
r = dict(Counter(lst).most_common(3))
Set.append(r)
pd.DataFrame(Set).fillna(0, downcast='infer')
which gives:
a b c
0 1 1 1
1 2 0 0
2 0 1 2

Compare Two column in pandas

I have a dataframe like this
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | |
| ['2'] | 3 | 0 |
|['1,3']| 2 | |
if the value of B is equal to A within the quotes then C is 1. if not present in A it will be 0. Expected output is:
| A | B | C |
|-------|---|---|
| ['1'] | 1 | 1 |
|['1,2']| 2 | 1 |
| ['2'] | 3 | 0 |
|['1,3']| 2 | 0 |
Like this I want to get the dataframe for multiple rows. How do I write in python to get this kind of data frame?
If values in A are strings use:
print (df.A.tolist())
["['1']", "['1,2']", "['2']", "['1,3']"]
df['C'] = [int(str(b) in a.strip("[]'").split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 ['1'] 1 1
1 ['1,2'] 2 1
2 ['2'] 3 0
3 ['1,3'] 2 0
Or if values are one element lists use:
print (df.A.tolist())
[['1'], ['1,2'], ['2'], ['1,3']]
df['C'] = [int(str(b) in a[0].split(',')) for a, b in zip(df.A, df.B)]
print (df)
A B C
0 [1] 1 1
1 [1,2] 2 1
2 [2] 3 0
3 [1,3] 2 0
My code:
df = pd.read_clipboard()
df
'''
A B
0 ['1'] 1
1 ['1,2'] 2
2 ['2'] 3
3 ['1,3'] 2
'''
(
df.assign(A=df.A.str.replace("'",'').map(eval))
.assign(C=lambda d: d.apply(lambda s: s.B in s.A, axis=1))
.assign(C=lambda d: d.C.astype(int))
)
'''
A B C
0 [1] 1 1
1 [1, 2] 2 1
2 [2] 3 0
3 [1, 3] 2 0
'''
df['C'] = np.where(df['B'].astype(str).isin(df.A), 1,0)
basically you need to transform column b to string since column A is string. then seek for column B inside columnA.
result will be as you are defined.

How to get the aggregate result only on the first occurrence, and 0 to the other, in a computation with groupby and transform?

I would like to groupby and sum dataframe, without modifying the number of indexes but applying the operations to the first occurrence only.
Initial DF:
C1 | Val
a | 1
a | 1
b | 1
c | 1
c | 1
Wanted DF:
C1 | Val
a | 2
a | 0
b | 1
c | 2
c | 0
I tried to apply the following code:
df.groupby(['C1'])['Val'].transform('sum')
which it helps to propagate the aggregated results to the total number or rows. However, it does not seem that transform have arguments which allow to apply the results to first or last occurrence only.
Indeed, what I currently get is:
C1 | Val
a | 2
a | 2
b | 1
c | 2
c | 2
Using pandas.DataFrame.groupby:
s = df.groupby('C1')['Val']
v = s.sum().values
df.loc[:, 'Val'] = 0
df.loc[s.head(1).index, 'Val'] = v
print(df)
Output:
C1 Val
0 a 2
1 a 0
2 b 1
3 c 2
4 c 0

Creating new DataFrame from the cartesian product of 2 lists

What I want to achieve is the following in Pandas:
a = [1,2,3,4]
b = ['a', 'b']
Can I create a DataFrame like:
column1 column2
'a' 1
'a' 2
'a' 3
'a' 4
'b' 1
'b' 2
'b' 3
'b' 4
Use itertools.product with DataFrame constructor:
a = [1, 2, 3, 4]
b = ['a', 'b']
from itertools import product
# pandas 0.24.0+
df = pd.DataFrame(product(b, a), columns=['column1', 'column2'])
# pandas below
# df = pd.DataFrame(list(product(b, a)), columns=['column1', 'column2'])
print (df)
column1 column2
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4
I will put here another method, just in case someone prefers it.
full mockup below:
import pandas as pd
a = [1,2,3,4]
b = ['a', 'b']
df=pd.DataFrame([(y, x) for x in a for y in b], columns=['column1','column2'])
df
result below:
column1 column2
0 a 1
1 b 1
2 a 2
3 b 2
4 a 3
5 b 3
6 a 4
7 b 4

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Categories

Resources