pandas select rows by condition for all of dataframe columns - python

I have a dataframe
d = {'col1': [1, 2], 'col2': [3, 4], 'col3' : [5,6]}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 5
1 2 4 6
for example, i need select all rows with value = 1 so my code is:
df[df['col1']==1]
col1 col2 col3
0 1 3 5
but how i can choose not only 'col1' but all columns, i have try this code:
for col in df.columns:
print(df[df[col]==1])
but outpus not in pandas dataframe's view:
col1 col2 col3
0 1 3 5
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
can i go over all the columns and get view like in dataframe?

You can use df.eq to check if any value in the df is equal to 1 and using df.any on axis=1 , this will return True for all rows where any of the column values have 1. Finally use boolean indexing
output = df[df.eq(1).any(axis=1)]

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Pandas DataFrame filter

My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).

Python Pandas .groupby().mean() with NaN values [duplicate]

Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))

Group by with aggregation function as new field in pandas

If I do the following group by on a mysql table
SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1
what I get is a table with three columns
col1 col2 agg_col
How can I do the same on a pandas dataframe?
Suppose I have a Dataframe that has three columns col1 col2 and col3. Group by operation
grouped = my_df.groupby('col1')
will returned the data grouped by col1
Also
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
will return the aggregated column equivalent to the one on the sql query. But how can I add this on the grouped dataframe?
We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series:
agg_col_series.reset_index(name='agg_col')
Full example with dummy data:
import random
import pandas as pd
col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]
df = pd.DataFrame(data={
'col1': col1,
'col2': col2,
'col3': col3,
})
grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
print agg_col_series.reset_index(name='agg_col')
index col1 agg_col
0 1 15566
1 2 20056
2 3 17313
3 4 17304
4 5 16380
Let's use groupby with a lambda function that uses size and nunique
then rename the series to 'agg_col' and reset_index to get a dataframe.
import pandas as pd
import numpy as np
np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
'Col2':np.random.randint(1000,9999,50),
'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})
df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()
Output:
Col1 agg_col
0 A 120
1 B 96
2 C 190

Python - Combinig pandas dataframes

I have 3 dataframes that I'd like to combine. They look like this:
df1 |df2 |df3
col1 col2 |col1 col2 |col1 col3
1 5 2 9 1 some
2 data
I'd like the first two df-s to be merged into the third df based on col1, so the desired output is
df3
col1 col3 col2
1 some 5
2 data 9
How can I achieve this? I'm trying:
df3['col2'] = df1[df1.col1 == df3.col1].col2 if df1[df1.col1 == df3.col1].col2 is not None else df2[df2.col1 == df3.col1].col2
For this I get ValueError: Series lengths must match to compare
It is guaranteed, that df3's col1 values are present either in df1 or df2. What's the way to do this? PLEASE NOTE, that a simple concat will not work, since there is other data in df3, not just col1.
If df1 and df2 don't have duplicates in col1, you can try this:
pd.concat([df1, df2]).merge(df3)
Data:
df1 = pd.DataFrame({'col1': [1], 'col2': [5]})
df2 = pd.DataFrame({'col1': [2], 'col2': [9]})
df3 = pd.DataFrame({'col1': [1,2], 'col3': ['some', 'data']})

Categories

Resources