How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.
Related
I'm just getting into pandas and I am trying to add a new column to an existing dataframe.
I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.
The code section below illustrates what I mean. The commented part is what I need as an output.
I guess I need the .loc[] function.
Another, minor, question: is it bad practice to have a non-unique indexes?
import pandas as pd
d = {'key':['a', 'b', 'c'],
'bar':[1, 2, 3]}
d2 = {'key':['a', 'a', 'b'],
'other_data':['10', '20', '30']}
df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')
print df2
## other_data new_col
##key
##a 10 1
##a 20 1
##b 30 2
Use rename index by Series:
df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
Or map:
df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)
other_data new
key
a 10 1
a 20 1
b 30 2
If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.
You can use join
df2.join(df.set_index('key'))
other_data bar
key
a 10 1
a 20 1
b 30 2
One way to rename the column in the process
df2.join(df.set_index('key').bar.rename('new'))
other_data new
key
a 10 1
a 20 1
b 30 2
Another, minor, question: is it bad practice to have a non-unique
indexes?
It is not great practice, but depends on your needs and can be okay in some circumstances.
Issue 1: join operations
A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.
For example:
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
0 1 a b
a 0.73737 1.49073 0.73737 1.49073
a 0.73737 1.49073 -0.25562 -2.79859
a -0.25562 -2.79859 0.73737 1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583 1.17583 -0.93583 1.17583
b -0.93583 1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583 1.17583
Issue 2: performance
Unique-valued indices make certain operations efficient, as explained in this post.
When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).
A word on .loc
Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,
df = pd.DataFrame(np.random.randn(10,2),
index=2*list('abcde'))
print(df.loc['a'])
0 1
a 0.73737 1.49073
a -0.25562 -2.79859
With the help of .loc
df2['new'] = df.set_index('key').loc[df2.index]
Output :
other_data new
key
a 10 1
a 20 1
b 30 2
Using combine_first
In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
bar other_data
key
a 1.0 10
a 1.0 20
b 2.0 30
Or, using map
In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
other_data bar
key
a 10 1
a 20 1
b 30 2
I have a dataframe looking generated by:
df = pd.DataFrame([[100, ' tes t ', 3], [100, np.nan, 2], [101, ' test1', 3 ], [101,' ', 4]])
It looks like
0 1 2
0 100 tes t 3
1 100 NaN 2
2 101 test1 3
3 101 4
I would like to a fill column 1 "forward" with test and test1. I believe one approach would be to work with replacing whitespace by np.nan, but it is difficult since the words contain whitespace as well. I could also groupby column 0 and then use the first element of each group to fill forward. Could you provide me with some code for both alternatives I do not get it coded?
Additionally, I would like to add a column that contains the group means that is
the final dataframe should look like this
0 1 2 3
0 100 tes t 3 2.5
1 100 tes t 2 2.5
2 101 test1 3 3.5
3 101 test1 4 3.5
Could you also please advice how to accomplish something like this?
Many thanks please let me know in case you need further information.
IIUC, you could use str.strip and then check if the stripped string is empty.
Then, perform groupby operations and filling the Nans by the method ffill and calculating the means using groupby.transform function as shown:
df[1] = df[1].str.strip().dropna().apply(lambda x: np.NaN if len(x) == 0 else x)
df[1] = df.groupby(0)[1].fillna(method='ffill')
df[3] = df.groupby(0)[2].transform(lambda x: x.mean())
df
Note: If you must forward fill NaN values with first element of that group, then you must do this:
df.groupby(0)[1].apply(lambda x: x.fillna(x.iloc[0]))
Breakup of steps:
Since we want to apply the function only on strings, we drop all the NaN values present before, else we would be getting the TypeError due to both floats and string elements present in the column and complains of float having no method as len.
df[1].str.strip().dropna()
0 tes t # operates only on indices where strings are present(empty strings included)
2 test1
3
Name: 1, dtype: object
The reindexing part isn't a necessary step as it only computes on the indices where strings are present.
Also, the reset_index(drop=True) part was indeed unwanted as the groupby object returns a series after fillna which could be assigned back to column 1.
I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.
Here's an example of the columns of interest.
motif_len sequence
0 3 ATTATTATTATT
1 4 ATCTATCTATCT
2 3 ATCATCATCATC
I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
I've tried a few things with no luck.
>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]
Both make the motif column but all the values are NaN.
I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.
I also tried to create a series by iterating through each
Any ideas?
You can call apply on the df pass axis=1 to apply row-wise and use the column values to slice the str:
In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df
Out[5]:
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
So I created two dataframes from existing CSV files, both consisting of entirely numbers. The second dataframe consists of an index from 0 to 8783 and one column of numbers and I want to add it on as a new column to the first dataframe which has an index consisting of a month, day and hour. I tried using append, merge and concat and none worked and then tried simply using:
x1GBaverage['Power'] = x2_cut
where x1GBaverage is the first dataframe and x2_cut is the second. When I did this it added x2_cut on properly but all the values were entered as NaN instead of the numerical values that they should be. How should I be approaching this?
x1GBaverage['Power'] = x2_cut.values
problem solved :)
The thing about pandas is that values are implicitly linked to their indices unless you deliberately specify that you only need the values to be transferred over.
If they're the same row counts and you just want to tack it on the end, the indexes either need to match, or you need to just pass the underlying values. In the example below, columns 3 and 5 are the index matching & value versions, and 4 is what you're running into now:
In [58]: df = pd.DataFrame(np.random.random((3,3)))
In [59]: df
Out[59]:
0 1 2
0 0.670812 0.500688 0.136661
1 0.185841 0.239175 0.542369
2 0.351280 0.451193 0.436108
In [61]: df2 = pd.DataFrame(np.random.random((3,1)))
In [62]: df2
Out[62]:
0
0 0.638216
1 0.477159
2 0.205981
In [64]: df[3] = df2
In [66]: df.index = ['a', 'b', 'c']
In [68]: df[4] = df2
In [70]: df[5] = df2.values
In [71]: df
Out[71]:
0 1 2 3 4 5
a 0.670812 0.500688 0.136661 0.638216 NaN 0.638216
b 0.185841 0.239175 0.542369 0.477159 NaN 0.477159
c 0.351280 0.451193 0.436108 0.205981 NaN 0.205981
If the row counts differ, you'll need to use df.merge and let it know which columns it should be using to join the two frames.
I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.