Getting rows where multiple columns are not blank in pandas - python

I have a table like so
id col_1 col_2 col_3
101 1 17 12
102 17
103 4
2
how do i only records where col_1, col_2, and col_3 are not blank?
Expected output:
id col_1 col_2 col_3
101 1 17 12

This will select only those rows in the dataframe, where all ['col_1', 'col_2', 'col_3'] are non-empty:
df[df[['col_1', 'col_2', 'col_3']].ne('').all(axis=1)]

here is one way to do it
make a list of the blank, nulls etc and then convert the columns that has any of these values into a True/False, and take their sum. you need the rows where sum is zero
df[df.isin ([' ','', np.nan]).astype(int).sum(axis=1).eq(0)]
id col_1 col_2 col_3
0 101 1 17 12

This can be done using DataFrame.query():
df = df.query(' and '.join([f'{col}!=""' for col in ['col_1','col_2','col_3']]))
Alternatively you can do it this way:
df = df[lambda x: (x.drop(columns='id') != "").all(axis=1)]

Related

Creating a column containing the other columns as a JSON object?

I'm trying add a column to my dataframe that contains the information from the other columns as a json object
My dataframe looks like this:
col_1
col_2
1
1
2
2
I'm then trying to add the json column using the following
for i, row in df:
i_val = row.to_json()
df.at[i,'raw_json'] = i_val
However it results in a "cascaded" dataframe where the json appears twice
col_1
col_2
raw_json
1
1
{"col_1":1,"col_2":1,"raw_json":{"col_1":1,"col_2":1}}
2
2
{"col_1":2,"col_2":2,"raw_json":{"col_1":2,"col_2":2}}
I'm expecting it to look like the following
col_1
col_2
raw_json
1
1
{"col_1":1,"col_2":1}
2
2
{"col_1":2,"col_2":2}
use df.to_json(orient='records')
df['raw_json'] = df.to_json(orient='records')
col_1 col_2 raw_json
0 1 1 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}]
1 2 2 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}]
Using a list comp and itterrows (your expected has a dict if you want json you can remove the [0]):
df["raw_json"] = [pd.DataFrame(data=[row], columns=df.columns).to_dict(orient="records")[0] for _, row in df.iterrows()]
print(df)
Output:
col_1 col_2 raw_json
0 1 1 {'col_1': 1, 'col_2': 1}
1 2 2 {'col_1': 2, 'col_2': 2}

How to summarize rows on column in pandas dataframe

I have a DataFrame that looks like this:
Col2 Col3
0 5 8
1 1 0
2 3 5
3 4 1
4 0 7
How can I sum values and get rid of index. To make it looks like this?
Col2 Col3
13 21
Sample code:
import pandas as pd
df = pd.DataFrame()
df["Col1"] = [0,2,4,6,2]
df["Col2"] = [5,1,3,4,0]
df["Col3"] = [8,0,5,1,7]
df["Col4"] = [1,4,6,0,8]
df_new = df.iloc[:, 1:3]
print(df_new)
Use .sum() to get the sums for each column. It produces a Series where each row contains the sum of a column. Transpose this to turn each row into a column, and then use .to_string(index=False) to print out the DataFrame without the index:
pd.DataFrame(df.sum()).T.to_string(index=False)
This outputs:
Col2 Col3
13 21
You can try:
df_new.sum(axis = 0)
df_new.reset_index(drop=True, inplace=True)

Match columns pandas Dataframe

I want to match two pandas Dataframes by the name of their columns.
import pandas as pd
df1 = pd.DataFrame([[0,2,1],[1,3,0],[0,4,0]], columns=['A', 'B', 'C'])
A B C
0 0 2 1
1 1 3 0
2 0 4 0
df2 = pd.DataFrame([[0,0,1],[1,5,0],[0,7,0]], columns=['A', 'B', 'D'])
A B D
0 0 0 1
1 1 5 0
2 0 7 0
If the names match, do nothing. (Keep the column of df2)
If a column is in Dataframe 1 but not in Dataframe 2, add the column in Dataframe 2 as a vector of zeros.
If a column is in Dataframe 2 but not in Dataframe 1, drop it.
The output should look like this:
A B C
0 0 0 0
1 1 5 0
2 0 7 0
I know if I do:
df2 = df2[df1.columns]
I get:
KeyError: "['C'] not in index"
I could also add the vectors of zeros manually, but of course this is a toy example of a much longer dataset. Is there any smarter/pythonic way of doing this?
It appears that df2 columns should be the same as df1 columns after this operation, as columns that are in df1 and not df2 should be added, while columns only in df2 should be removed. We can simply reindex df2 to match df1 columns with a fill_value=0 (this is the safe equivalent to df2 = df2[df1.columns] when adding new columns with a fill value):
df2 = df2.reindex(columns=df1.columns, fill_value=0)
df2:
A B C
0 0 0 0
1 1 5 0
2 0 7 0

I want to merge a DataFrame with a CSV

I need to merge 1 df with 1 csv.
df1 contains only 1 columns (id list of the product I want to update)
df2 contains 2 columns (id of all the products, quantity)
df1=pd.read_csv(id_file, header=0, index_col=False)
df2 = pd.DataFrame(data=result_q)
df3=pd.merge(df1, df2)
What I want: a dataframe that contains only id from csv/df1 merge with the quantities of df2 for the same id
if you want only the products that ya have in first data_frame you can use this:
df_1
Out[11]:
id
0 1
1 2
2 4
3 5
df_2
Out[12]:
id prod
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
6 7 g
7 8 h
df_3 = df_1.merge(df_2,on='id')
df_3
Out[14]:
id prod
0 1 a
1 2 b
2 4 d
3 5 e
you neede use the parameter on='column' so the will generate a new df only with the correspondent rows that have the same id.
you can use new_df= pd.merge(df1,df2, on=['Product_id'])
I've found the solution. I needed to reset the index for my df2
df1=pd.read_csv(id_file)
df2 = pd.DataFrame(data=result_q).reset_index()
df1['id'] = pd.to_numeric(df1['id'], errors = 'coerce')
df2['id'] = pd.to_numeric(df2['id'], errors = 'coerce')
df3=df1.merge(df2, on='id')
Thank you everyone!

How to reverse a dummy variables from a pandas dataframe

I would like to reverse a dataframe with dummy variables. For example,
from df_input:
Course_01 Course_02 Course_03
0 0 1
1 0 0
0 1 0
To df_output
Course
0 03
1 01
2 02
I have been looking at the solution provided at Reconstruct a categorical variable from dummies in pandas but it did not work. Please, Any help would be much appreciated.
Many Thanks,
Best Regards,
Carlo
We can use wide_to_long, then select rows that are not equal to zero i.e
ndf = pd.wide_to_long(df, stubnames='T_', i='id',j='T')
T_
id T
id1 30 0
id2 30 1
id1 40 1
id2 40 0
not_dummy = ndf[ndf['T_'].ne(0)].reset_index().drop('T_',1)
id T
0 id2 30
1 id1 40
Update based on your edit :
ndf = pd.wide_to_long(df.reset_index(), stubnames='T_',i='index',j='T')
not_dummy = ndf[ndf['T_'].ne(0)].reset_index(level='T').drop('T_',1)
T
index
1 30
0 40
You can use:
#create id to index if necessary
df = df.set_index('id')
#create MultiIndex
df.columns = df.columns.str.split('_', expand=True)
#reshape by stack and remove 0 rows
df = df.stack().reset_index().query('T != 0').drop('T',1).rename(columns={'level_1':'T'})
print (df)
id T
1 id1 40
2 id2 30
EDIT:
col_name = 'Course'
df.columns = df.columns.str.split('_', expand=True)
df = (df.replace(0, np.nan)
.stack()
.reset_index()
.drop([col_name, 'level_0'],1)
.rename(columns={'level_1':col_name})
)
print (df)
Course
0 03
1 01
2 02
Suppose you have the following dummy DF:
In [152]: d
Out[152]:
id T_30 T_40 T_50
0 id1 0 1 1
1 id2 1 0 1
we can prepare the following helper Series:
In [153]: v = pd.Series(d.columns.drop('id').str.replace(r'\D','').astype(int), index=d.columns.drop('id'))
In [155]: v
Out[155]:
T_30 30
T_40 40
T_50 50
dtype: int64
now we can multiply them, stack and filter:
In [154]: d.set_index('id').mul(v).stack().reset_index(name='T').drop('level_1',1).query("T > 0")
Out[154]:
id T
1 id1 40
2 id1 50
3 id2 30
5 id2 50
I think melt() was pretty much made for this?
Your data, I think:
df_input = pd.DataFrame.from_dict({'Course_01':[0,1,0],
'Course_02':[0,0,1],
'Course_03':[1,0,0]})
Change names to match your desired output:
df_input.columns = df_input.columns.str.replace('Course_','')
Melt the dataframe:
dataMelted = pd.melt(df_input,
var_name='Course',
ignore_index=False)
Clean up zeros, etc:
df_output = (dataMelted[dataMelted['value'] != 0]
.drop('value', axis=1)
.sort_index())
>>> df_output
Course
0 03
1 01
2 02
#Create a new column for the categorical
df['categ']=0
for i in range(df):
if df['Course01']==1:
df['categ']='01'
if df['Course02']==1:
df['categ']='02'
if df['Course03']==1:
df['categ']='03'
df.categ.astype('category']

Categories

Resources