I have another basic question. So I have a dataframe like so:
cols = a,b,c,d,e which contains integers.
I want column e's value to equal 1 if columns b and c or columns a, b and c = 1.
Although d's column does not matter in this computation, it matters somewhere else so I cannot drop it.
How would I do that on pandas?
Use .loc:
df.loc[df['a']==df['b']==df['c']==1,'e']=1
Related
Assume we have 1 dataframe: df1(5000, 6)
Let's assume it has the following structure (I will only write the first 3 columns since the others should just be copied)
A B C
'A-0000-ALEX,A-00030-PAUL' 1 '1112-PPAI 12.00: First Name\n4554-ALGF 09:00 Groceries\n'
and I want to create a dataframe that for every time a row has multiple values in a row to create extra rows that separates them as follows
A B C
'A-0000-ALEX' 1 '1112-PPAI 12.00: First Name'
'A-00030-PAUL' 1 '4554-ALGF 09:00 Groceries'
Thus in A column new rows should be created, in column B just copy the same value that already existed. For column C the values should also be splitted in the new rows and for the rest of the columns (D,E,F) they should just be copied in the new rows that will be created
Edit : Just realized that it is not a 1-1 match between the values of A and C. Some rows have only 1 value in A column and some others rows have the same number of values in A column like they have in column C
One solution here is to create a list of the wanted values for all your columns instead of a string by splitting the string:
df['A'] = df['A'].apply(lambda x: x.split(","))
df['C'] = df['C'].apply(lambda x: x.split('\n')[:-1])
Note that the [:-1] is used here because split will create a list of three strings for column C because you have a '\n' at the end of the string as well.
Then you can use the function pd.explode to explode your lists into rows:
df = df.explode(['A','C'])
Is there any way to drop duplicate columns, but replacing their values depending upon conditions like
in table below, I would like to remove duplicate/second A and B columns, but want to replace the value of primary A and B (1st and 2nd column) where value is 0 but 1 in duplicate columns.
Ex - In 3rd row, where A, B have value 0 , should replace with 1 with their respective duplicate columns value..
Input Data :
Output Data:
This is an example of a problem I'm working on, my real data have around 200 columns so i'm hoping to find an optimal solution without hardcoding columns names for removal..
Use DataFrame.any per duplicated columns names if only 1,0 values in columns:
df = df.any(axis=1, level=0).astype(int)
Or if need maximal value per duplicated columns names:
df = df.max(axis=1, level=0)
I have two data frames like this: The first has one column and 720 rows (dataframe A), the second has ten columns and 720 rows(dataframe B). The dataframes contain only numerical values.
I am trying to compare them this way: I want to go through each column of dataframe B and compare each cell(row) of that column to the corresponding row in dataframe A .
(Example: For the first column of dataframe B I compare the first row to the first row of dataframe A, then the second row of B to the second row of A etc.)
Basically I want to compare each column of dataframe B to the single column in dataframe A, row by row.
If the the value in dataframe B is smaller or equal than the value in dataframe A, I want to add +1 to another dataframe (or list, depending on how its easier). In the end, I want to drop any column in dataframe B that doesnt have at least one cell to satisfy the condition (basically if the value added to the list or new dataframe is 0).
I tried something like this (written for a single row, I was thinking of creating a for loop using this) but it doesn't seem to do what I want:
DfA_i = pd.DataFrame(DA.iloc[i])
DfB_j = pd.DataFrame(DB.iloc[j])
B = DfB_j.values
DfC['Criteria'] = DfA_i.apply(lambda x: len(np.where(x.values <= B)), axis=1)
dv = dt_dens.values
if dv[1] < 1:
DF = DA.drop(i)
I hope I made my problem clear enough and sorry for any mistakes. Thanks for any help.
Let's try:
dfB.loc[:, dfB.ge(dfA.values).any()]
Explanation: dfA.values returns the numpy array with shape (720,1). Then dfB.ge(dfA.values) check each column from dfB against that single column from dfA; this returns a boolean dataframe of same size with dfB. Finally .any() check along the columns of that boolean dataframe for any True.
how about this:
pd.DataFrame(np.where(A.to_numpy() <= B.to_numpy(),1,np.nan), columns=B.columns, index=A.index).dropna(how='all')
you and replace the np.nan in the np.where condition with whatever values you wish, including keeping the original values of dataframe 'B'
Suppose I have a pandas dataframe whose column is are a,b,c and index are dates.
df = pd.DataFrame(columns = ["a","b","c"],index=range(5),data=np.random.rand(5,3))
And, I have a string called formula = (2*a+b)/c, where a,b,c here refer to columns of pandas data frame. What is the most efficient way to go about this?
The solution should give the same answer as this
(2*df["a"]+df["b"])/df["c"]
The bonus question is what if the formula contains lagged value formula = (2*a[-1]+b)/c, where a[-1] would use the data from previous row of column a. Thanks.
Use DataFrame.eval to evaluate a string describing operations on DataFrame columns.
formula = "(2*a+b)/c"
df.eval(formula)
0 6.432992
1 1.175234
2 3.274955
3 2.050857
4 7.605282
dtype: float64
Let's say I have a dataframe with column names as follows:
col_id_1, col_id_2, ..., col_id_m, property_1, property_2 ..., property_n
As an example, how would I search across all col_ids for, say, the value 5 (note that 5 won't appear in multiple col_ids in the same row), and then choose all rows that contain this value? On top of that, once I've found all rows that have a col_id containing the value 5, I'll combine all the col_ids with value 5 into a single id column, and also only choose, say, property_8 and property_25000 as additional columns.
In this case, I would have a table with the following columns:
id, property_8, property_25000
where the id column only contains rows with value 5. Is such a thing possible in pandas?
IIUC, first filter you columns by contain the col_id, then we using any check if any columns have the number 5
df.loc[df.filter(like='col_id').eq(5).any(1),['property_8','property_25000']].assign(id=5)
You can refine this answer with creative list comprehension (for col names). One simple approach may be to subset using OR | ... alternatively AND &
df_new = df[(df['col_id_1'] == 5) | (df['col_id_2'] == 5) | (df['col_id_3'] == 5)]
df_new will represent a dataframe reflecting your parameters, you can then just subset the columns accordingly
df_new = df_new[['id', 'propert_8', 'property_25000']]