Related
I have a data frame. I want to change values in column C to null values based on whether conditions in columns A and B are met. To do this, I think I need to iterate over the rows of the dataframe, but I can't figure out how:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
dataframe image
I tried something like this:
for row in df.iterrows()
if df['A'] > 2 and df['B'] == 3:
df['C'] == np.nan
but I just keep getting errors. Could someone please show me how to do this?
Yours is not a DataFrame, it's a dictionary. This is a DataFrame:
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
It is usually faster to use pandas/numpy arithmetic instead of regular Python loops.
df.loc[(df['A'].values > 2) & (df['B'].values == 3), 'C'] = np.nan
Or if you insist on your way of coding, the code (besides converting df to a real DataFrame) can be updated to:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if row.loc['A'] > 2 and row.loc['B'] == 3:
df.loc[i, 'C'] = np.nan
or
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 4, 1, 4], 'B': [9, 2, 5, 3], 'C': [0, 0, 5, 3]})
for i, row in df.iterrows():
if df.loc[i, 'A'] > 2 and df.loc[i, 'B'] == 3:
df.loc[i, 'C'] = np.nan
You can try
df.loc[(df["A"].values > 2) & (df["B"].values==3), "C"] = None
Using pandas and numpy is way easier for you :D
having a random df
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
cols_in = list(df)[0:2]+list(df)[4:]
now:
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i,cols_in])
obviously in the cycle, x return an error due to col_in assignment in iloc.
How could be possible apply a mixed style slicing of df like in append function ?
It seems like you want to exclude one column? There is no column 4, so depending on which columns you are after, something like this might be what you are after:
df = pd.DataFrame([[1,2,3,4],[4,5,6,7],[7,8,9,10],[10,11,12,13],[14,15,16,17]], columns=['A', 'B','C','D'])
If you want to get the column indeces from column names you can do:
cols = ['A', 'B', 'D']
cols_in = np.nonzero(df.columns.isin(cols))[0]
x = []
for i in range(df.shape[0]):
x.append(df.iloc[i, cols_in].to_list())
x
Output:
[[1, 2, 4], [4, 5, 7], [7, 8, 10], [10, 11, 13], [14, 15, 17]]
I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]
I was wondering if I can used pandas .drop method to drop rows when chaining methods to construct a data frame.
Dropping rows is straight forward once the data frame exists:
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
print(df1)
# drop the entries that match "2"
df1 = df1[df1['A'] !=2]
print(df1)
However, I would like to do this while I am creating the data frame:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
# .drop(lambda x: x['A']!=2)
)
print(df2)
The commented line does not work, but maybe there is a correct way of doing this. Grateful for any input.
Use DataFrame.loc with callable:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.loc[lambda x: x['AA']!=2]
)
Or DataFrame.query:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.query("AA != 2")
)
print(df2)
AA B
0 1 5
2 3 3
You can use DataFrame.apply with DataFrame.dropna:
df2 = (pd.DataFrame({'A': [1, 2, 3], 'B': [5, 4, 3]})
.rename(columns={'A': 'AA'})
.apply(lambda x: x if x['AA'] !=2 else np.nan,axis=1).dropna()
)
print(df2)
AA B
0 1.0 5.0
2 3.0 3.0
Maybe you can try a selection with ".loc" at the end of your definition line code.
col1= ['A','B','A','C','A','B','A','C','A','C','A','A','A']
col2= [1,1,4,2,4,5,6,3,1,5,2,1,1]
df = pd.DataFrame({'col1':col1, 'col2':col2})
for A we have [1,4,4,6,1,2,1,1], 8 items but i want to limit the size to 5 while converting Data frame to dict/list
Output:
Dict = {'A':[1,4,4,6,1],'B':[1,5],'C':[2,3,5]}
Use pandas.DataFrame.groupby with apply:
df.groupby('col1')['col2'].apply(lambda x:list(x.head(5))).to_dict()
Output:
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}
Use DataFrame.groupby with lambda function, convert to list and filter first 5 values by indexing, last convert to dictionary by Series.to_dict:
d = df.groupby('col1')['col2'].apply(lambda x: x.tolist()[:5]).to_dict()
print (d)
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}