remove duplicate values in the next n rows, but keeping first - python

would really appreciate on the below question, i don't really know where to start trying,
I have a dataframe
pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
I want to write a function that iterates through the values, and remove any duplicates in the next n rows.
For example, if n=5, starting from the first number "1", if there is any "1" in the next 5 rows, it is deleted (marked by "x"). In the next iteration, the second "1" wouldn't be used given it is deleted from the first iteration.
The resulting dataframe would be
pd.DataFrame({'value':[1,'x',2,'x','x','x',1,'x','x',2,'x','x']})
I would want to eventually drop the "x" rows but for the purpose of illustration I've marked it out.

Do you want to actually see the 'x' are they just to demonstrate to us they're to be deleted?
If the latter you could do something like this:
x1 = pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
x1['t'] = x1.index //5
x1.drop_duplicates(subset = ['value', 't']).drop(columns = 't')
value
0 1
2 2
5 1
9 2
10 1

Related

Iterate over two arrays looking for coincidences

So I got this two CSV in Python using Pandas, they both have two columns where column A is an ID an column B is their value, I need to figure out how to explore CSV 1 to look for one of the values of CSV 2. So let's say that the value of ID #5 in CSV 2 is "10", but on CSV 1 "10" is in the ID #15, whereas ID #5 has a value of 20
This is my code as of right now
def searchValue(array_A, array_B):
for z, b in array_b:
for y, a in array_a:
if(b['value'] in a['value']):
print('true')
return True
else:
print('fake')
return False
I'll appreciate any tip on this, I've been trying with enclosing array_a and array_b in length, range and others with no luck, I really don't know what I'm doing wrong
My algorithm needs to simply return a True or False depending if the value of B is present anywhere on the values of A

Replace values for each group

I want to replace values in ['animal'] for each subid/group, based on a condition.
The values in animal column are numbers (0-3) and vary for each subid, so a where cond == 1 might look like [0,3] for one subid or [2,1] or [0,3] and the same goes for b.
for s in sids:
a = df[(df['subid']==s)&(df['cond'] == 1)]['animal'].unique()
b = df[(df['subid']==s)&(df['cond'] == 0)]['animal'].unique()
df["animal"].replace({a[0]: 0,a[1]:1,b[0]:2,b[1]:3})
The thing is I think the dataframe overwrites entirely each time and uses only the last iteration of the for loop instead of saving the appropriate values for each group.
I tried specifying the subid at the beginning like so, df[df['subid']==s["animal"].replace({a[0]:0,a[1]:1,b[0]:2,b[1]:3}) but it didn't work.
Any pointers are appreciated, thanks!

how to increment each row value in round robin method till some extent of given integer

trying to split an integer value to each row in dataframe
i have a pandas dataframe with 4 rows and say an integer as 5. the end result should have 2 for the first row and 1 for rest 3 rows.
df=pd.DataFrame(['a','b','c','d'],columns=['name'])
df['val']=0
no=5
while no>0:
for row in df['val']:
df['val']= row+1
no-=1
'Eventually one count has to be taken from 'no' and added to each row in the dataframe
i need to iterate through rows in a dataframe and increment the cell value till any said integer count.
Expected output
after 1 iteration the df will look like this and no will decrement to 4
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,0,0,0])),columns=['name','val'])
after 2nd iteration df will be as below and no will decrement to 3
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,1,0,0])),columns=['name','val'])
and this iteration goes till the any given integer. so for this case 4 th iteration will be as below and and no will decrement to 1
df=pd.DataFrame(list(zip(['a','b','c','d'],[1,1,1,1])),columns=['name','val'])
and the final 5th iteration will be as below and no will decrement to 0 and loop ends
df=pd.DataFrame(list(zip(['a','b','c','d'],[2,1,1,1])),columns=['name','val'])
From what I understand, you want the value 1 in each row other than the first one, and [no - (number of non-first rows)] in the first row. You can get that by:
df['col'] = 1
df.loc[0, 'col'] = no - (len(df) - 1)
It is worth to mention that loops over the rows are almost always to be avoided when working with pandas. You can do almost everything without using loops on the rows (and way faster).
EDIT:
This is the best way to achieve what you requested in the edited question, if you have to loop on no for some other reason:
df['val'] = 0
no = 5
for i in range(no):
df.iloc[i % len(df), -1] += 1
But if you don't really need to loop, and just get the output in val, use:
df.loc[df.reset_index().index, 'val'] = no // len(df) + 1 * (df.reset_index().index < no % len(df))

First Transition Value of DataFrame Column without Temporary Variables

I am trying to find the first transition value of a dataframe column as efficiently as possible. I would prefer not to have temporary variables. Say I have a dataframe (df) with a column of:
Column1
0
0
0
-1
1
In this case, the value which I'm looking for is -1, which is the first time the value changes. I want to use this in an if statement for whether the value is first transitioning to 1 or -1. The pseudocode being:
if (first transition value == 1):
# Something
elif: (first transition value == -1):
# Something else
General case
You can compare the values in the dataframe to the first one, take only the differing values and use the first value of these.
df[df.Column1 != df.Column1.iloc[0]].Column1.values[0]
Special case
If you always want to find the first differing element from 0 you could just do it like this:
df[df.Column1 != 0].Column1.values[0]

Split value present in one Column by specific character

I have a Data-frame which contain two column.
On the first column (Motif_name) my value look like that :
Motif_Name_xx/Description/Homer
Second column just contain a score.
I'm trying to split my first column by '/' and conserve the first element.
Basically what I tried :
df=df['Motif_name'].str.split('/').str[1]
Here the problem is that my data-frame is replaced :
print(df)
0 Motif_1
1 Motif_2
I lost the header and the second column...
I expect to have a data-frame like that :
Motif_name Score
0 Motif_Name_xx1 0.001
1 Motif_Name_xx2 0.05
2 Motif_Name_xx3 0.02
3 Motif_Name_xx4 0.01
It seems need parameter n=1 for split by first / and str[0] for get first value of lists (python count from 0) and then assign it to same column:
df['Motif_name'] = df['Motif_name'].str.split('/', n=1).str[0]
Ok I just see the solution when I was editing my question, so if someone else need the answer :
EF1a_R1_df['Motif_name']=EF1a_R1_df['Motif_name'].str.split('/').str[0]
Basically, in place to replace the all data-frame, just replace the column and it work well.

Categories

Resources