Pandas dataframe: slicing column values using second column for slice index - python

I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.
Here's an example of the columns of interest.
motif_len sequence
0 3 ATTATTATTATT
1 4 ATCTATCTATCT
2 3 ATCATCATCATC
I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
I've tried a few things with no luck.
>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]
Both make the motif column but all the values are NaN.
I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.
I also tried to create a series by iterating through each
Any ideas?

You can call apply on the df pass axis=1 to apply row-wise and use the column values to slice the str:
In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df
Out[5]:
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC

Related

df.index vs df["index"] after resetting index [duplicate]

This question already has an answer here:
Proper way to access a column of a pandas dataframe
(1 answer)
Closed last month.
import pandas as pd
df1 = pd.DataFrame({
"value": [1, 1, 1, 2, 2, 2]})
print(df1)
print("-------------------------")
print(df1.reset_index())
print("-------------------------")
print(df1.reset_index().index)
print("-------------------------")
print(df1.reset_index()["index"])
produces the output
value
0 1
1 1
2 1
3 2
4 2
5 2
-------------------------
index value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
-------------------------
RangeIndex(start=0, stop=6, step=1)
-------------------------
0 0
1 1
2 2
3 3
4 4
5 5
Name: index, dtype: int64
I am wondering why print(df1.reset_index().index) and
print(df1.reset_index()["index"]) prints different things in this case? The latter prints the "index" column, while the former prints the indices.
If we want to access the reset indices (the column), then it seems we have to use brackets?
The .index attribute in a pandas DataFrame will always point to the Index (row label) of the DataFrame not a column named "index".
If we want to access the reset indices (the column), then it seems we
have to use brackets?
Yes, or you can assign a name when reseting the index for example:
df1.reset_index(names='the_index').the_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# Name: the_index, dtype: int64
Several things happened. First, when you don't specify and index, pandas uses a RangeIndex object as a virtual index of the dataframe. The dataframe is a collection of numpy arrays which are naturally indexed from 0, 1, 2, and etc. Since RangeIndex is just 0, 1, etc... it doesn't actually create its values in memory. Had you printed the index of the original df1, it would be a RangeIndex, just like df1.reset_index().index.
reset_index has an optional drop parameter. By default, pandas will take the existing index and turn it into a column of the dataframe. This was a RangeIndex object but it had to be expanded into a realized column to fit with the other columns in the df. Had you included drop=True, there would be no "index" column.
When you reset the index, dataframes always have to have some index and the default is that virtual RangeIndex you see.
DataFrames have a shortcut where some columns can be addressed by attribute name rather than item (the square brackets). But, if the column name doesn't meet python's attribute naming rules or if it clashes with an existing attribute, you can't reference it that way. .index is the dataframe index so if you happen to also have a column "index", you need to access it via the square bracket item protocol.
One could argue that pandas should never have allowed the attribute access path because it can't be used consistently. I wouldn't argue that (except I totally would).
It does this because you are printing different things:
print(df1.reset_index().index)
is the same as:
df = df1.reset_index()
print(df.index)
This firstly adds an Id index to the dataframe then prints the actual index of the df.
print(df1.reset_index()["index"])
is the equivalent of
df = df1.reset_index()
print(df["index"])
It firstly adds an Id index to the dataframe but keeps both "index" and "values" columns. It then prints the Column named "Index" (which is NOT the index of the df)
If you want to make the "index" column the index, you must use:
df = df1.set_index("index")

Difference between giving pandas a python iterable vs a pd.Series for column

What are some of the differences between passing a List vs a pd.Series type to create a new dataFrame column? For example, from trial-and-error I've noticed:
# (1d) We can also give it a Series, which is quite similar to giving it a List
df['cost1'] = pd.Series([random.choice([1.99,2.99,3.99]) for i in range(len(df))])
df['cost2'] = [random.choice([1.99,2.99,3.99]) for i in range(len(df))]
df['cost3'] = pd.Series([1,2,3]) # <== will pad length with `NaN`
df['cost4'] = [1,2,3] # <== this one will fail because not the same size
d
Are there any other reasons that pd.Series differs from passing a standard python list? Can a dataframe take any python iterable or are there restrictions on what can be passed to it? Finally, is using pd.Series the 'correct' way to add columns, or can it be used interchangably with other types?
List assign to dataframe here require the same length
For the pd.Series assign , it will use the index as key to match original DataFrame index, then fill the value with the same index in Series
df=pd.DataFrame([1,2,3],index=[9,8,7])
df['New']=pd.Series([1,2,3])
# the default index is range index , which is from 0 to n
# since the dataframe index dose not match the series, then will return NaN
df
Out[88]:
0 New
9 1 NaN
8 2 NaN
7 3 NaN
Different length with matched index
df['New']=pd.Series([1,2],index=[9,8])
df
Out[90]:
0 New
9 1 1.0
8 2 2.0
7 3 NaN

Pandas - slicing column values based on another column

How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.

pandas Series to Dataframe using Series indexes as columns

I have a Series, like this:
series = pd.Series({'a': 1, 'b': 2, 'c': 3})
I want to convert it to a dataframe like this:
a b c
0 1 2 3
pd.Series.to_frame() doesn't work, it got result like,
0
a 1
b 2
c 3
How can I construct a DataFrame from Series, with index of Series as columns?
You can also try this :
df = DataFrame(series).transpose()
Using the transpose() function you can interchange the indices and the columns.
The output looks like this :
a b c
0 1 2 3
You don't need the transposition step, just wrap your Series inside a list and pass it to the DataFrame constructor:
pd.DataFrame([series])
a b c
0 1 2 3
Alternatively, call Series.to_frame, then transpose using the shortcut .T:
series.to_frame().T
a b c
0 1 2 3
you can also try this:
a = pd.Series.to_frame(series)
a['id'] = list(a.index)
Explanation:
The 1st line convert the series into a single-column DataFrame.
The 2nd line add an column to this DataFrame with the value same as the index.
Try reset_index. It will convert your index into a column in your dataframe.
df = series.to_frame().reset_index()
This
pd.DataFrame([series]) #method 1
produces a slightly different result than
series.to_frame().T #method 2
With method 1, the elements in the resulted dataframe retain the same type. e.g. an int64 in series will be kept as an int64.
With method 2, the elements in the resulted dataframe become objects IF there is an object type element anywhere in the series. e.g. an int64 in series will be become an object type.
This difference may cause different behaviors in your subsequent operations depending on the version of pandas.

Occurence frequency from a list against each row in Pandas dataframe

Let say I have a list of 6 integers named ‘base’ and a dataframe of 100,000 rows with 6 columns of integers as well.
I need to create an additional column which show frequency of occurences of the list ‘base’ against each row in the dataframe data.
The sequence of integers both in the list ‘base’ and dataframe are to be ignored in this case.
The occurrence frequency can have a value ranging from 0 to 6.
0 means all 6 integers in list ‘base’ does not match any of 6 columns from a row in the dataframe.
Can anyone shed some light on this please ?
you can try this:
import pandas as pd
# create frame with six columns of ints
df = pd.DataFrame({'a':[1,2,3,4,10],
'b':[8,5,3,2,11],
'c':[3,7,1,8,8],
'd':[3,7,1,8,8],
'e':[3,1,1,8,8],
'f':[7,7,1,8,8]})
# list of ints
base =[1,2,3,4,5,6]
# define function to count membership of list
def base_count(y):
return sum(True for x in y if x in base)
# apply the function row wise using the axis =1 parameter
df.apply(base_count, axis=1)
outputs:
0 4
1 3
2 6
3 2
4 0
dtype: int64
then assign it to a new column:
df['g'] = df.apply(base_count, axis=1)

Categories

Resources