Split value present in one Column by specific character - python

I have a Data-frame which contain two column.
On the first column (Motif_name) my value look like that :
Motif_Name_xx/Description/Homer
Second column just contain a score.
I'm trying to split my first column by '/' and conserve the first element.
Basically what I tried :
df=df['Motif_name'].str.split('/').str[1]
Here the problem is that my data-frame is replaced :
print(df)
0 Motif_1
1 Motif_2
I lost the header and the second column...
I expect to have a data-frame like that :
Motif_name Score
0 Motif_Name_xx1 0.001
1 Motif_Name_xx2 0.05
2 Motif_Name_xx3 0.02
3 Motif_Name_xx4 0.01

It seems need parameter n=1 for split by first / and str[0] for get first value of lists (python count from 0) and then assign it to same column:
df['Motif_name'] = df['Motif_name'].str.split('/', n=1).str[0]

Ok I just see the solution when I was editing my question, so if someone else need the answer :
EF1a_R1_df['Motif_name']=EF1a_R1_df['Motif_name'].str.split('/').str[0]
Basically, in place to replace the all data-frame, just replace the column and it work well.

Related

Find Sign when Sign Changes in Pandas Column while Ignoring Zeros using Vectorization

I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output

remove duplicate values in the next n rows, but keeping first

would really appreciate on the below question, i don't really know where to start trying,
I have a dataframe
pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
I want to write a function that iterates through the values, and remove any duplicates in the next n rows.
For example, if n=5, starting from the first number "1", if there is any "1" in the next 5 rows, it is deleted (marked by "x"). In the next iteration, the second "1" wouldn't be used given it is deleted from the first iteration.
The resulting dataframe would be
pd.DataFrame({'value':[1,'x',2,'x','x','x',1,'x','x',2,'x','x']})
I would want to eventually drop the "x" rows but for the purpose of illustration I've marked it out.
Do you want to actually see the 'x' are they just to demonstrate to us they're to be deleted?
If the latter you could do something like this:
x1 = pd.DataFrame({'value':[1,1,2,2,1,1,1,1,1,2,1,1]})
x1['t'] = x1.index //5
x1.drop_duplicates(subset = ['value', 't']).drop(columns = 't')
value
0 1
2 2
5 1
9 2
10 1

Iterating over dataframe and using replace method based on condtions

I am attempting to iterate over a specific column in my dataframe.
The column is:
df['column'] = ['1.4million', '1,235,000','100million',NaN, '14million', '2.5mill']
I am trying to clean this column and eventually get it all to integers to do more work with. I am stuck on the step to clean out "million." I would like to replace the "million" with five zeros when there is a decimal (ie 1.4million becomes 1.400000) and the "million" with six zeros when there is no decimal (ie 100million becomes 100000000).
To simplify, the first step I'm trying is to just focus on filtering out the values with a decimal and replace those with 5 zeros. I have attempted to use np.where for this, however I cannot use the replace method with numpy.
I also attempted to use pd.DataFrame.where, but am getting an error:
for i,row in df.iterrows():
df.at[i,'column'] = pd.DataFrame.where('.' in df.at[i,'column'],df.at[i,'column'].replace('million',''),df.at[i,'column'])
``AttributeError: 'numpy.ndarray' object has no attribute 'replace'
Im sure there is something I'm missing here. (I'm also sure that I'll be told that I don't need to use iterrows here, so I am open to suggestions on that as well).
Given your sample data - it looks like you can strip out commas and then take all digits (and . characters) until the string mill or end of string and split those out, eg:
x = df['column'].str.replace(',', '').str.extract('(.*?)(mill.*)?$')
This'll give you:
0 1
0 1.4 million
1 1235000 NaN
2 100 million
3 NaN NaN
4 14 million
5 2.5 mill
Then take the number part and multiply it by a million where there's something in column 1 else multiple it by 1, eg:
res = pd.to_numeric(x[0]) * np.where(x[1].notna(), 1_000_000, 1)
That'll give you:
0 1400000.0
1 1235000.0
2 100000000.0
3 NaN
4 14000000.0
5 2500000.0
Try this:
df['column'].apply(lambda x : x.replace('million','00000'))
Make sure your dtype is string before applying this
For the given data:
df['column'].apply(lambda x: float(str(x).split('m')[0])*10**6
if 'million' in str(x) or 'mill' in str(x) else x)
If there may be many forms of million in the column, then regex search.

First Transition Value of DataFrame Column without Temporary Variables

I am trying to find the first transition value of a dataframe column as efficiently as possible. I would prefer not to have temporary variables. Say I have a dataframe (df) with a column of:
Column1
0
0
0
-1
1
In this case, the value which I'm looking for is -1, which is the first time the value changes. I want to use this in an if statement for whether the value is first transitioning to 1 or -1. The pseudocode being:
if (first transition value == 1):
# Something
elif: (first transition value == -1):
# Something else
General case
You can compare the values in the dataframe to the first one, take only the differing values and use the first value of these.
df[df.Column1 != df.Column1.iloc[0]].Column1.values[0]
Special case
If you always want to find the first differing element from 0 you could just do it like this:
df[df.Column1 != 0].Column1.values[0]

Pandas - slice sections of dataframe into multiple dataframes

I have a Pandas dataframe with 3000+ rows that looks like this:
t090:   c0S/m:    pr:      timeJ:  potemp090C:   sal00:  depSM:  \
407  19.3574  4.16649  1.836  189.617454      19.3571  30.3949   1.824
408  19.3519  4.47521  1.381  189.617512      19.3517  32.9250   1.372
409  19.3712  4.44736  0.710  189.617569      19.3711  32.6810   0.705
410  19.3602  4.26486  0.264  189.617627      19.3602  31.1949   0.262
411  19.3616  3.55025  0.084  189.617685      19.3616  25.4410   0.083
412  19.2559  0.13710  0.071  189.617743      19.2559   0.7783   0.071
413  19.2092  0.03000  0.068  189.617801      19.2092   0.1630   0.068
414  19.4396  0.00522  0.068  189.617859      19.4396   0.0321   0.068
What I want to do is: create individual dataframes from each portion of the dataframe in which the values in column 'c0S/m' exceed 0.1 (eg rows 407-412 in the example above).
So let's say that I have 7 sections in my 3000+ row dataframe in which a series of rows exceed 0.1 in the second column. My if/for/while statement will slice these sections and create 7 separate dataframes.
I tried researching the best I could but could not find a question that would address this problem. Any help is appreciated.
Thank you.
You can try this:
First add a column of 0 or 1 based on whether the value is greater than 1 or less.
df['splitter'] = np.where(df['c0S/m:'] > 1, 1, 0)
Now groupby this column diff.cumsum()
df.groupby((df['splitter'].diff(1) != 0).astype('int').cumsum()).apply(lambda x: [x.index.min(),x.index.max()])
You get the required blocks of indices
splitter
1 [407, 411]
2 [412, 414]
3 [415, 415]
Now you can create dataframes using loc
df.loc[407:411]
Note: I added a line to your sample df using:
df.loc[415] = [19.01, 5.005, 0.09, 189.62, 19.01, 0.026, 0.09]
to be able to test better and hence its splitting in 3 groups
Here's another way.
sub_set = df[df['c0S/m'] > 0.1]
last = None
for i in sub_set.index:
if last is None:
start = i
else:
if i - last > 1:
print start, last
start = i
last = i
I think it works. (Instead of print start, last you could insert code to create the slices you wanted of the original data frame).
Some neat tricks here that do an even better job.

Categories

Resources