How do I get the row count of a Pandas DataFrame? - python

How do I get the number of rows of a pandas dataframe df?

For a dataframe df, one can use any of the following:
len(df.index)
df.shape[0]
df[df.columns[0]].count() (== number of non-NaN values in first column)
Code to reproduce the plot:
import numpy as np
import pandas as pd
import perfplot
perfplot.save(
"out.png",
setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)),
n_range=[2**k for k in range(25)],
kernels=[
lambda df: len(df.index),
lambda df: df.shape[0],
lambda df: df[df.columns[0]].count(),
],
labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"],
xlabel="Number of rows",
)

Suppose df is your dataframe then:
count_row = df.shape[0] # Gives number of rows
count_col = df.shape[1] # Gives number of columns
Or, more succinctly,
r, c = df.shape

Use len(df) :-).
__len__() is documented with "Returns length of index".
Timing info, set up the same way as in root's answer:
In [7]: timeit len(df.index)
1000000 loops, best of 3: 248 ns per loop
In [8]: timeit len(df)
1000000 loops, best of 3: 573 ns per loop
Due to one additional function call, it is of course correct to say that it is a bit slower than calling len(df.index) directly. But this should not matter in most cases. I find len(df) to be quite readable.

How do I get the row count of a Pandas DataFrame?
This table summarises the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
Footnotes
DataFrame.count returns counts for each column as a Series since the non-null count varies by column.
DataFrameGroupBy.size returns a Series, since all columns in the same group share the same row-count.
DataFrameGroupBy.count returns a DataFrame, since the non-null count could differ across columns in the same group. To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count() where "x" is the column to count.
Minimal Code Examples
Below, I show examples of each of the methods described in the table above. First, the setup -
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
It seems silly to compare the performance of constant time operations, especially when the difference is on the level of "seriously, don't worry about it". But this seems to be a trend with other answers, so I'm doing the same for completeness.
Of the three methods above, len(df.index) (as mentioned in other answers) is the fastest.
Note
All the methods above are constant time operations as they are simple attribute lookups.
df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols). For example, df.shape returns (8, 2) for the example here.
Column Count of a DataFrame: df.shape[1], len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index), len(df.columns) is the faster of the two methods (but takes more characters to type).
Row Count of a Series: len(s), s.size, len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.size and len(s.index) are about the same in terms of speed. But I recommend len(df).
Note
size is an attribute, and it returns the number of elements (=count
of rows for any Series). DataFrames also define a size attribute which
returns the same result as df.shape[0] * df.shape[1].
Non-Null Row Count: DataFrame.count and Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
Calling DataFrame.count will return non-NaN counts for each column:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.count to similar effect:
s.count()
# 3
Group-wise Row Count: GroupBy.size
For DataFrames, use DataFrameGroupBy.size to count the number of rows per group.
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series, you'll use SeriesGroupBy.size.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Series is returned. This makes sense for DataFrames as well since all groups share the same row-count.
Group-wise Non-Null Row Count: GroupBy.count
Similar to above, but use GroupBy.count, not GroupBy.size. Note that size always returns a Series, while count returns a Series if called on a specific column, or else a DataFrame.
The following methods return the same thing:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
Meanwhile, for count, we have
df.groupby('A').count()
B
A
a 2
b 1
c 0
...called on the entire GroupBy object, vs.,
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
Called on a specific column.

TL;DR use len(df)
len() returns the number of items(the length) of a list object(also works for dictionary, string, tuple or range objects). So, for getting row counts of a DataFrame, simply use len(df).
For more about len function, see the official page.
Alternatively, you can access all rows and all columns with df.index, and df.columns,respectively. Since you can use the len(anyList) for getting the element numbers, using the
len(df.index) will give the number of rows, and len(df.columns) will give the number of columns.
Or, you can use df.shape which returns the number of rows and columns together (as a tuple) where you can access each item with its index. If you want to access the number of rows, only use df.shape[0]. For the number of columns, only use: df.shape[1].

Apart from the previous answers, you can use df.axes to get the tuple with row and column indexes and then use the len() function:
total_rows = len(df.axes[0])
total_cols = len(df.axes[1])

...building on Jan-Philip Gehrcke's answer.
The reason why len(df) or len(df.index) is faster than df.shape[0]:
Look at the code. df.shape is a #property that runs a DataFrame method calling len twice.
df.shape??
Type: property
String form: <property object at 0x1127b33c0>
Source:
# df.shape.fget
#property
def shape(self):
"""
Return a tuple representing the dimensionality of the DataFrame.
"""
return len(self.index), len(self.columns)
And beneath the hood of len(df)
df.__len__??
Signature: df.__len__()
Source:
def __len__(self):
"""Returns length of info axis, but here we use the index """
return len(self.index)
File: ~/miniconda2/lib/python2.7/site-packages/pandas/core/frame.py
Type: instancemethod
len(df.index) will be slightly faster than len(df) since it has one less function call, but this is always faster than df.shape[0]

I come to Pandas from an R background, and I see that Pandas is more complicated when it comes to selecting rows or columns.
I had to wrestle with it for a while, and then I found some ways to deal with:
Getting the number of columns:
len(df.columns)
## Here:
# df is your data.frame
# df.columns returns a string. It contains column's titles of the df.
# Then, "len()" gets the length of it.
Getting the number of rows:
len(df.index) # It's similar.

In case you want to get the row count in the middle of a chained operation, you can use:
df.pipe(len)
Example:
row_count = (
pd.DataFrame(np.random.rand(3,4))
.reset_index()
.pipe(len)
)
This can be useful if you don't want to put a long statement inside a len() function.
You could use __len__() instead but __len__() looks a bit weird.

You can do this also:
Let’s say df is your dataframe. Then df.shape gives you the shape of your dataframe i.e (row,col)
Thus, assign the below command to get the required
row = df.shape[0], col = df.shape[1]

Either of this can do it (df is the name of the DataFrame):
Method 1: Using the len function:
len(df) will give the number of rows in a DataFrame named df.
Method 2: using count function:
df[col].count() will count the number of rows in a given column col.
df.count() will give the number of rows for all the columns.

For dataframe df, a printed comma formatted row count used while exploring data:
def nrow(df):
print("{:,}".format(df.shape[0]))
Example:
nrow(my_df)
12,456,789

When using len(df) or len(df.index) you might encounter this error:
----> 4 df['id'] = np.arange(len(df.index)
TypeError: 'int' object is not callable
Solution:
lengh = df.shape[0]

An alternative method to finding out the amount of rows in a dataframe which I think is the most readable variant is pandas.Index.size.
Do note that, as I commented on the accepted answer,
Suspected pandas.Index.size would actually be faster than len(df.index) but timeit on my computer tells me otherwise (~150 ns slower per loop).

I'm not sure if this would work (data could be omitted), but this may work:
*dataframe name*.tails(1)
and then using this, you could find the number of rows by running the code snippet and looking at the row number that was given to you.

len(df.index) would work the fastest of all the ways listed

For a dataframe df:
When you're still writing your code:
len(df)
df.shape[0]
Fastest once your code is done:
len(df.index)
At normal data sizes each option will finish in under a second. So the "fastest" option is actually whichever one lets you work the fastest, which can be len(df) or df.shape[0] if you already have a subsetted df and want to just add .shape[0] briefly in an interactive session.
In final optimized code, the fastest runtime is len(df.index).
df[df.columns[0]].count() was omitted in the above discussion because no commenter has identified a case where it is useful. It is exponentially slow, and long to type. It provides the number of non-NaN values in the first column.
Code to reproduce the plot:
pip install pandas perfplot
import numpy as np
import pandas as pd
import perfplot
perfplot.save(
"out.png",
setup=lambda n: pd.DataFrame(np.arange(n * 3).reshape(n, 3)),
n_range=[2**k for k in range(25)],
kernels=[
lambda df: len(df.index),
lambda df: len(df),
lambda df: df.shape[0],
lambda df: df[df.columns[0]].count(),
],
labels=["len(df.index)", "df.shape[0]", "df[df.columns[0]].count()"],
xlabel="Number of rows",
)

Think, the dataset is "data" and name your dataset as " data_fr " and number of rows in the data_fr is "nu_rows"
#import the data frame. Extention could be different as csv,xlsx or etc.
data_fr = pd.read_csv('data.csv')
#print the number of rows
nu_rows = data_fr.shape[0]
print(nu_rows)

Related

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Question: Dividing values from two different Pandas Dataframes selected based on column values returns NaN

I am trying to select values from two different DataFrames based on certain column values and divide them with each other. If I try this I always get NaN values.
I added a simplified example below:
df = pd.DataFrame({'col1':['a','b','c','d'],
'col2':['x1','x2','x3','x4'],
'col3':[10,3,2,8]})
print(df)
df1 = pd.DataFrame({'col1':['a1','b1','c1','d1'],
'col2':['y1','y1','y3','y4'],
'col3':[5,4,1,6]})
print(df1)
a = df.loc[((df['col1']=='a')&(df['col2']=='x1')),'col3']
print(a)
b = df1.loc[((df1['col1']=='d1')&(df1['col2']=='y4')), 'col3']
print(b)
c = a/b
print(c)
How can I overcome this problem?
Pandas is so powerful because it aligns on the indices when doing certain actions, for example mathematical methods.
If we print both a and b in your example, we can see that their indices are respectively 0 and 3:
print(a.index)
print(b.index)
Int64Index([0], dtype='int64')
Int64Index([3], dtype='int64')
This means that when doing the operation a/b, pandas cannot align any values and thus returns NaN.
Your solution would be to reset_index:
a.reset_index(drop=True) / b.reset_index(drop=True)
0 1.666667
Name: col3, dtype: float64
Or cast to numpy arrays so we lose the indices:
a.to_numpy() / b.to_numpy()
array([1.66666667])
Having that said, your operation seems not really logical and the problem lies probably deeper. Since the matching of the values does not make any sense right now.

pandas dataframe: len(df) is not equal to number of iterations in df.iterrows()

I have a dataframe where I want to print each row to a different file. When the dataframe consists of e.g. only 50 rows, len(df) will print 50 and iterating over the rows of the dataframe like
for index, row in df.iterrows():
print(index)
will print the index from 0 to 49.
However, if my dataframe contains more than 50'000 rows, len(df)and the number of iterations when iterating over df.iterrows() differ significantly. For example, len(df) will say e.g. 50'554 and printing the index will go up to over 400'000.
How can this be? What am I missing here?
First, as #EdChum noted in the comment, your question's title refers to iterrows, but the example you give refers to iteritems, which loops in the orthogonal direction to that relevant to len. I assume you meant iterrows (as in the title).
Note that a DataFrame's index need not be a running index, irrespective of the size of the DataFrame. For example:
df = pd.DataFrame({'a': [1, 2, 3, 4]}, index=[2, 4, 5, 1000])
>>> for index, row in df.iterrows():
... print index
2
4
5
1000
Presumably, your long DataFrame was just created differently, then, or underwent some manipulation, affecting the index.
If you really must iterate with a running index, you can use Python's enumerate:
>>> for index, row in enumerate(df.iterrows()):
... print index
0
1
2
3
(Note that, in this case, row is itself a tuple.)

Return multiple objects from an apply function in Pandas

I'm practicing with using apply with Pandas dataframes.
So I have cooked up a simple dataframe with dates, and values:
dates = pd.date_range('2013',periods=10)
values = list(np.arange(1,11,1))
DF = DataFrame({'date':dates, 'value':values})
I have a second dataframe, which is made up of 3 rows of the original dataframe:
DFa = DF.iloc[[1,2,4]]
So, I'd like to use the 2nd dataframe, DFa, and get the dates from each row (using apply), and then find and sum up any dates in the original dataframe, that came earlier:
def foo(DFa, DF=DF):
cutoff_date = DFa['date']
ans=DF[DF['date'] < cutoff_date]
DFa.apply(foo, axis=1)
Things work fine. My question is, since I've created 3 ans, how do I access these values?
Obviously I'm new to apply and I'm eager to get away from loops. I just don't understand how to return values from apply.
Your function needs to return a value. E.g.,
def foo(df1, df2):
cutoff_date = df1.date
ans = df2[df2.date < cutoff_date].value.sum()
return ans
DFa.apply(lambda x: foo(x, DF), axis=1)
Also, note that apply returns a DataFrame. So your current function would return a DataFrame for each row in DFa, so you would end up with a DataFrame of DataFrames
There's a bit of a mixup the way you're using apply. With axis=1, foo will be applied to each row (see the docs), and yet your code implies (by the parameter name) that its first parameter is a DataFrame.
Additionally, you state that you want to sum up the original DataFrame's values for those less than the date. So foo needs to do this, and return the values.
So the code needs to look something like this:
def foo(row, DF=DF):
cutoff_date = row['date']
return DF[DF['date'] < cutoff_date].value.sum()
Once you make the changes, as foo returns a scalar, then apply will return a series:
>> DFa.apply(foo, axis=1)
1 1
2 3
4 10
dtype: int64

Cleanest iteration/functional application on Pandas Dataframe regardless of length

I constantly struggle with cleanly iterating or applying a function to Pandas DataFrames of variable length. Specifically, a length 1 DataFrame slice (Pandas Series).
Simple example, a DataFrame and a function that acts on each row of it. The format of the dataframe is known/expected.
def stringify(row):
return "-".join([row["y"], str(row["x"]), str(row["z"])])
df = pd.DataFrame(dict(x=[1,2,3],y=["foo","bar","bro"],z=[-99,1.04,213]))
Out[600]:
x y z
0 1 foo -99.00
1 2 bar 1.04
2 3 bro 213.00
df_slice = df.iloc[0] # This is a Series
Usually, you can apply the function in one of the following ways:
stringy = df.apply(stringify,axis=1)
# or
stringy = [stringify(row) for _,row in df.iterrows()]
Out[611]: ['foo-1--99.0', 'bar-2-1.04', 'bro-3-213.0']
## Error with same syntax if Series
stringy = df_slice.apply(stringify, axis=1)
If the dataframe is empty, or has only one entry, these methods no longer work. A Series does not have an iterrows() method and apply applies the function to each column (not rows).
Is there a cleaner built in method to iterate/apply functions to DataFrames of variable length? Otherwise you have to constantly write cumbersome logic.
if type(df) is pd.DataFrame:
if len(df) == 0:
return None
else:
return df.apply(stringify, axis=1)
elif type(df) is pd.Series:
return stringify(df)
I realize there are methods to ensure you form length 1 DataFrames, but what I am asking is for a clean way to apply/iterate on the various pandas data structures when it could be like-formatted dataframes or series.
There is no generic way to write a function which will seemlessly handle both
DataFrames and Series. You would either need to use an if-statement to check
for type, or use try..except to handle exceptions.
Instead of doing either of those things, I think it is better to make sure you create the right type of object before calling apply. For example, instead of using df.iloc[0] which returns a Series, use df.iloc[:1] to select a DataFrame of length 1. As long as you pass a slice range instead of a single value to df.iloc, you'll get back a DataFrame.
In [155]: df.iloc[0]
Out[155]:
x 1
y foo
z -99
Name: 0, dtype: object
In [156]: df.iloc[:1]
Out[156]:
x y z
0 1 foo -99

Categories

Resources