I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5
Related
I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]
I am applying a function on a dataframe df and that function returns a dataframe int_df, but the result is getting stored as a series.
df
limit
0 4
new_df
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number
This is a pseudocode of what I have done:
def foo(x):
limit = x['limit']
int_df = pd.DataFrame(columns=['A', 'B']) # Create empty dataframe
# Append a new row to the dataframe
for i in range(0, limit):
int_df.loc[len(int_df.index)] = [i, 'Number']
return int_df # This is dataframe
new_df = df.apply(foo, axis=1)
new_df # This is a series but I need a dataframe
Is this the right way to do this?
IIUC, here's one way:
df = df.limit.apply(range).explode().to_frame('A').assign(B='number')
OUTPUT:
A B
0 0 Number
1 1 Number
2 2 Number
3 3 Number
edit: I understand how to get the actual values, but I wonder how to append a row with these 2 sums to the existing df?
I have a dataframe score_card that looks like:
15min_colour
15min_high
15min_price
30min_colour
30min_high
30min_price
1
1
-1
1
-1
1
1
-1
1
1
1
1
-1
1
-1
1
1
1
-1
1
-1
1
-1
1
Now I'd like to add a row that sums up all the 15min numbers (first 3 columns) and the 30min numbers and so on (the actual df is larger). Means I don't want to add up the individual columns but rather the sum of the columns' sums. The row I'd like to add would look like:
sum_15min_colour&15min_high&15min_price
sum_30min_colour&30min_high&30min_price
0
8
Please disregard the header, it's only to clarify what I'm intending to do.
I assume there's a multiindex involved, but I couldn't figure out how to apply it to my existing df to achieve the desired output.
Also, is it possible to add a colum with the sum of the whole table?
Thanks for your support.
You can sum in this way:
np.sum(df.filter(like='15').values), np.sum(df.filter(like='30').values)
0,8
groupby
Can take a callable (think function) and use it on the index or columns
df.groupby(lambda x: x.split('_')[0], axis=1).sum().sum()
15min 0
30min 8
dtype: int64
it's depends on axis.
Simply - this sum the value in axis 0:
So in your case - columns(it's sum all values in columns vertically).
df.sum(axis = 0, skipna = True)
print(df):
OUTPUT:
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
print(df)
OUTPUT:
So in your case:
summed0Axis = df.sum(axis = 0, skipna = True)
sum_column = summed0Axis["15min_colour"] + summed0Axis["15min_high"] + summed0Axis["15min_price"]
print(sum_column)
more intelligent option:
Find all columns, which included 15:
columnsWith15 = df.loc[:,df.columns.str.contains("15").sum]
columnsWith30 = df.loc[:,df.columns.str.contains("30").sum]
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
Given a dataframe df, I would like to generate a new variable/column for each row based on the values in the previous row. df is sorted so that the order of the rows is meaningful.
Normally, we can use either map or apply, but it seems that neither of them allows the access to values in the previous row.
For example, given existing rows a b c, I want to generate a new column d, which is based on some calculation using the value of c in the previous row.
How should I do it in pandas?
If you just want to do a calculation based on the previous row, you can calculate and then shift:
In [2]: df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
In [3]: df
Out[3]:
a b
0 0 0
1 1 10
2 2 20
# a calculation based on other column
In [4]: df['c'] = df['b'] + 1
# shift the column
In [5]: df['c'] = df['c'].shift()
In [6]: df
Out[6]:
a b c
0 0 0 NaN
1 1 10 1
2 2 20 11
If you want to do a calculation based on multiple rows, you could look at the rolling_apply function (http://pandas.pydata.org/pandas-docs/stable/computation.html#moving-rolling-statistics-moments and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.rolling_apply.html#pandas.rolling_apply)
You can use the dataframe 'apply' function and leverage the unused the 'kwargs' parameter to store the previous row.
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[0,10,20]})
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['a'] + curr_row['b'] + prev_row.get('c', 0)
df[new_col] = df.apply(running_total, axis=1)
print(df)
# Output will be:
# a b c
# 0 0 0 0
# 1 1 10 11
# 2 2 20 33
This example uses a decorator to store the previous row in a dictionary and then pass it to the function when Pandas calls it on the next row.
Disclaimer 1: The 'prev_row' variable starts off empty for the first row so when using it in the apply function I had to supply a default value to avoid a 'KeyError'.
Disclaimer 2: I am fairly certain this will be slower the apply operation but I did not do any tests to figure out how much.