edit: I understand how to get the actual values, but I wonder how to append a row with these 2 sums to the existing df?
I have a dataframe score_card that looks like:
15min_colour
15min_high
15min_price
30min_colour
30min_high
30min_price
1
1
-1
1
-1
1
1
-1
1
1
1
1
-1
1
-1
1
1
1
-1
1
-1
1
-1
1
Now I'd like to add a row that sums up all the 15min numbers (first 3 columns) and the 30min numbers and so on (the actual df is larger). Means I don't want to add up the individual columns but rather the sum of the columns' sums. The row I'd like to add would look like:
sum_15min_colour&15min_high&15min_price
sum_30min_colour&30min_high&30min_price
0
8
Please disregard the header, it's only to clarify what I'm intending to do.
I assume there's a multiindex involved, but I couldn't figure out how to apply it to my existing df to achieve the desired output.
Also, is it possible to add a colum with the sum of the whole table?
Thanks for your support.
You can sum in this way:
np.sum(df.filter(like='15').values), np.sum(df.filter(like='30').values)
0,8
groupby
Can take a callable (think function) and use it on the index or columns
df.groupby(lambda x: x.split('_')[0], axis=1).sum().sum()
15min 0
30min 8
dtype: int64
it's depends on axis.
Simply - this sum the value in axis 0:
So in your case - columns(it's sum all values in columns vertically).
df.sum(axis = 0, skipna = True)
print(df):
OUTPUT:
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
print(df)
OUTPUT:
So in your case:
summed0Axis = df.sum(axis = 0, skipna = True)
sum_column = summed0Axis["15min_colour"] + summed0Axis["15min_high"] + summed0Axis["15min_price"]
print(sum_column)
more intelligent option:
Find all columns, which included 15:
columnsWith15 = df.loc[:,df.columns.str.contains("15").sum]
columnsWith30 = df.loc[:,df.columns.str.contains("30").sum]
Related
I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]
I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like
I want something like this.
Index Sentence
0 I
1 want
2 like
3 this
Keyword Index
want 1
this 3
I tried with df.index("Keyword") but its not giving for all the rows. It will be really helpful if someone solve this.
Use isin with boolean indexing only:
df = df[df['Sentence'].isin(['want', 'this'])]
print (df)
Index Sentence
1 1 want
3 3 this
EDIT: If need compare by another column:
df = df[df['Sentence'].isin(df['Keyword'])]
#another DataFrame df2
#df = df[df['Sentence'].isin(df2['Keyword'])]
And if need index values:
idx = df.index[df['Sentence'].isin(df['Keyword'])]
#alternative
#idx = df[df['Sentence'].isin(df['Keyword'])].index
I have a large dataset with columns labelled from 1 - 65 (among other titled columns), and want to find how many of the columns, per row, have a string (of any value) in them. For example, if all rows 1 - 65 are filled, the count should be 65 in this particular row, if only 10 are filled then the count should be 10.
Is there any easy way to do this? I'm currently using the following code, which is taking very long as there are a large number of rows.
array = pd.read_csv(csvlocation, encoding = "ISO-8859-1")
for i in range (0, lengthofarray)
for k in range(1,66):
if array[k][i]!="":
array["count"][i]=array["count"][i]+1
From my understanding of the post and the subsequent comments, you are interested in knowing the number of strings in each row for columns labels 1 through 65. There are two steps, the first is to subset your data down to columns 1 through 65, and then the following is the count the number of strings in each row. To do this:
import pandas as pd
import numpy as np
# create sample data
df = pd.DataFrame({'col1': list('abdecde'),
'col2': np.random.rand(7)})
# change one val of column two to string for illustration purposes
df.loc[3, 'col2'] = 'b'
# to create the subset of columns, you could use
# subset = [str(num) for num in list(range(1, 66))]
# and then just use df[subset]
# for each row, count the number of columns that have a string value
# applymap operates elementwise, so we are essentially creating
# a new representation of your data in place, where a 1 represents a
# string value was there, and a 0 represent not a string.
# we then sum along the rows to get the final counts
col_str_counts = np.sum(df.applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
# we changed the column two value above, so to check that the count is 2 for that row idx:
col_str_counts[3]
>>> 2
# and for the subset, it would simply become:
# col_str_counts = np.sum(df[subset].applymap(lambda x: 1 if isinstance(x, str) else 0), axis=1)
You should be able to adapt your problem to this example
Say we have this dataframe
df = pd.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
0 1 2
0 foo bar
1 bar
2
3 foo bar bar
Then we create a boolean mask where a cell != "" and sum those values
df['count'] = (df != "").sum(1)
print(df)
0 1 2 count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
df = pandas.DataFrame([["","foo","bar"],["","","bar"],["","",""],["foo","bar","bar"]])
total_cells = df.size
df['filled_cell_count'] = (df != "").sum(1)
print(f"{df}")
0 1 2 filled_cell_count
0 foo bar 2
1 bar 1
2 0
3 foo bar bar 3
total_filled_cells = df['filled_cell_count'].sum()/total_cells
print()
print(f"Total Filled Cells in dataframe: {total_filled_cells}")
Total Filled Cells in dataframe: 0.5