pandas: adding rows in a column - python

i have a dataframe like this,
Count
1
0
1
1
1
I want to add N and N+1 in count column and store it in N, is it possible to do in pandas way?
result should like this, technically it is cumulative sum:
Counts
1
1
2
3
4

You can use the cumulative sum function, cumsum().
df = pd.DataFrame([1, 0, 1, 1,1], columns=['Count'])
df['Counts'] = df['Count'].cumsum()
print(df)
giving you the desired output.
Count Counts
0 1 1
1 0 1
2 1 2
3 1 3
4 1 4

Related

How to create new dataframe rows based on df value

I have a dataframe which is something like this:
index buyedA total
a 2 4
b 1 2
and I need to turn it into something like this:
index buyedA total
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1
I need for each index as many rows as specified by column total (each one filled with a value of 1), and if column buyedA says 2, I need 2 of those rows filled with a 1.
Is there a way to do so in Python?
Thanks!
Using repeat and a simple groupby
n = df.loc[df.index.repeat(df.total)].assign(total=1)
n['buyedA'] = n.groupby('index').total.cumsum().le(n.buyedA).astype(int)
index buyedA total
0 a 1 1
0 a 1 1
0 a 0 1
0 a 0 1
1 b 1 1
1 b 0 1
Let's try this:
#make sure index is in the dataframe index
df=df.set_index('index')
#use repeat and reindex
df_out = df.reindex(df.index.repeat(df['total'])).assign(total=1)
#Limit buyedA by row number in each group of index
df_out['buyedA'] = ((df_out.groupby('index').cumcount() + 1) <= df_out['buyedA']).mul(1)
df_out
output:
buyedA total
index
a 1 1
a 1 1
a 0 1
a 0 1
b 1 1
b 0 1

How to count the element in a column and take the result as a new column?

The DataFrame named df is shown as follows.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 3]})
Input:
id
0 1
1 1
2 3
I want to count the number of each id, and take the result as a new column count.
Expected:
id count
0 1 2
1 1 2
2 3 1
pd.factorize and np.bincount
My favorite. factorize does not sort and has time complexity of O(n). For big data sets, factorize should be preferred over np.unique
i, u = df.id.factorize()
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
np.unique and np.bincount
u, i = np.unique(df.id, return_inverse=True)
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
Assign the new count column to the dataframe by grouping on id and then transforming that column with value_counts (or size).
>>> f.assign(count=f.groupby('id')['id'].transform('value_counts'))
id count
0 1 2
1 1 2
2 3 1
Use Series.map with Series.value_counts:
df['count'] = df['id'].map(df['id'].value_counts())
#alternative
#from collections import Counter
#df['count'] = df['id'].map(Counter(df['id']))
Detail:
print (df['id'].value_counts())
1 2
3 1
Name: id, dtype: int64
Or GroupBy.transform for return Series with same size as original DataFrame with GroupBy.size:
df['count'] = df.groupby('id')['id'].transform('size')
print (df)
id count
0 1 2
1 1 2
2 3 1

Finding the count of letters in each column

I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?
Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist().
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially.
Deal with rows second by iterating through 'ATCG' sequentially.
This produces a list of lists which can be fed directly into pd.DataFrame.
With Series.value_counts():
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result.
The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ... on each row.

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources