So I am stuck on how to approach a data manipulation technique in pandas. I have an example dataframe below with a sum of 25 counts in each row.
I would like to merge column names by the reverse compliment sequence.
AA CC GG AT TT
4 7 0 9 5
3 8 5 5 2
8 6 2 8 1
The columns "AA" and "TT" are reverse compliments of each other as are "CC" and "GG"
AA/TT CC/GG AT
9 7 9
5 13 5
9 8 8
How can I match the reverse compliment of a column name and merge it with the name of another column.
Note: I already have a function to find the reverse compliment of a string
I'd suggest just creating a new frame using pd.concat:
new_df = pd.concat([df[['AA', 'TT']].sum(1).rename('AA/TT'),
df[['CC', 'GG']].sum(1).rename('CC/GG'),
df['AT']], axis=1)
>>> new_df
AA/TT CC/GG AT
0 9 7 9
1 5 13 5
2 9 8 8
More generally, you could do it in a list comprehension. Given the reverse compliments:
reverse_compliments = [['AA','TT'], ['CC','GG']]
Find those values in your original dataframe columns that are not in reverse compliments (There might be a better way here, but this works):
reverse_compliments.append(df.columns.difference(
pd.np.array(reverse_compliments)
.flatten()))
And use pd.concat with a list comprehension:
new_df = pd.concat([df[x].sum(1).rename('/'.join(x)) for x in reverse_compliments],
axis=1)
>>> new_df
AA/TT CC/GG AT
0 9 7 9
1 5 13 5
2 9 8 8
Related
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 6 months ago.
I have a Pandas DataFrame in Python such as this:
Group Pre/post Value
0 A Pre 3
1 A Pre 5
2 A Post 13
3 A Post 15
4 B Pre 7
5 B Pre 8
6 B Post 17
7 B Post 18
And I'd like to turn it into a different table such as:
Group Pre Post
0 A 3 13
1 A 5 15
2 B 7 17
3 B 8 18
I tried pivoting with df.pivot(index='Group', columns='Pre/post', values='Value') but since I have repeated values and order is important, it went traceback
Here is one way to do it, use list as an aggfunc in pivot_table, to collect the duplicate values for index and column as a list, then using explode split the list into multiple rows.
df.pivot_table(index='Group', columns='Pre/post', values='Value', aggfunc=list
).reset_index().explode(['Post','Pre'], ignore_index=True)
Pre/post Group Post Pre
0 A 13 3
1 A 15 5
2 B 17 7
3 B 18 8
I just started learning pandas and I was trying to figure out the easiest possible solution for the problem mentioned below.
Suppose, I've a dataframe like this ->
A B
6 7
8 9
5 6
7 8
Here, I'm selecting the minimum value cell from column 'A' as the starting point and updating the sequence in the new column 'C'. After sequencing dataframe must look like this ->
A B C
5 6 0
6 7 1
7 8 2
8 9 3
Is there any easy way to pick a cell from from column 'A' and match it with the matching cell in column 'B' and update the sequence respectively in column 'C'?
Some extra conditions ->
If 5 is present in column 'B' then I need to add another row like this -
A B C
0 5 0
5 6 1
......
Try sort_values:
df.sort_values('A').assign(C=np.arange(len(df)))
Output:
A B C
2 5 6 0
0 6 7 1
3 7 8 2
1 8 9 3
I'm not sure what you mean with the extra conditions though.
Assuming the following DataFrame:
key.0 key.1 key.2 topic
1 abc def ghi 8
2 xab xcd xef 9
How can I combine the values of all the key.* columns into a single column 'key', that's associated with the topic value corresponding to the key.* columns? This is the result I want:
topic key
1 8 abc
2 8 def
3 8 ghi
4 9 xab
5 9 xcd
6 9 xef
Note that the number of key.N columns is variable on some external N.
You can melt your dataframe:
>>> keys = [c for c in df if c.startswith('key.')]
>>> pd.melt(df, id_vars='topic', value_vars=keys, value_name='key')
topic variable key
0 8 key.0 abc
1 9 key.0 xab
2 8 key.1 def
3 9 key.1 xcd
4 8 key.2 ghi
5 9 key.2 xef
It also gives you the source of the key.
From v0.20, melt is a first class function of the pd.DataFrame class:
>>> df.melt('topic', value_name='key').drop('variable', 1)
topic key
0 8 abc
1 9 xab
2 8 def
3 9 xcd
4 8 ghi
5 9 xef
After trying various ways, I find the following is more or less intuitive, provided stack's magic is understood:
# keep topic as index, stack other columns 'against' it
stacked = df.set_index('topic').stack()
# set the name of the new series created
df = stacked.reset_index(name='key')
# drop the 'source' level (key.*)
df.drop('level_1', axis=1, inplace=True)
The resulting dataframe is as required:
topic key
0 8 abc
1 8 def
2 8 ghi
3 9 xab
4 9 xcd
5 9 xef
You may want to print intermediary results to understand the process in full. If you don't mind having more columns than needed, the key steps are set_index('topic'), stack() and reset_index(name='key').
OK , cause one of the current answer is mark as duplicated of this question, I will answer here.
By Using wide_to_long
pd.wide_to_long(df, ['key'], 'topic', 'age').reset_index().drop('age',1)
Out[123]:
topic key
0 8 abc
1 9 xab
2 8 def
3 9 xcd
4 8 ghi
5 9 xef
I have a pandas dataframe that I groupby, and then perform an aggregate calculation to get the mean for:
grouped = df.groupby(['year_month', 'company'])
means = grouped.agg({'size':['mean']})
Which gives me a dataframe back, but I can't seem to filter it to the specific company and year_month that I want:
means[(means['year_month']=='201412')]
gives me a KeyError
The issue is that you are grouping based on 'year_month' and 'company' . Hence in the means DataFrame, year_month and company would be part of the index (MutliIndex). You cannot access them as you access other columns.
One method to do this would be to get the values of the level 'year_month' of index . Example -
means.loc[means.index.get_level_values('year_month') == '201412']
Demo -
In [38]: df
Out[38]:
A B C
0 1 2 10
1 3 4 11
2 5 6 12
3 1 7 13
4 2 8 14
5 1 9 15
In [39]: means = df.groupby(['A','B']).mean()
In [40]: means
Out[40]:
C
A B
1 2 10
7 13
9 15
2 8 14
3 4 11
5 6 12
In [41]: means.loc[means.index.get_level_values('A') == 1]
Out[41]:
C
A B
1 2 10
7 13
9 15
As already pointed out, you will end up with a 2 level index. You could try to unstack the aggregated dataframe:
means = df.groupby(['year_month', 'company']).agg({'size':['mean']}).unstack(level=1)
This should give you a single 'year_month' index, 'company' as columns and your aggregate size as values. You can then slice by the index:
means.loc['201412']
I have the following pd.DataFrame:
AllData =
a#a.6 f#s.2 c#c.2 d#w.4 k#a.3
1 8 3 3 8
4 4 7 4 3
6 8 9 1 6
3 4 5 6 1
7 6 0 8 1
And I would like to create a new pd.DataFrame with only the columns whose names are keys in the following dictionary:
my_dict={a#a.6 : value1, c#c.2 : value2, d#w.4 : value5}
So the new DataFrame would be:
FilteredData =
a#a.6 c#c.2 d#w.4
1 3 3
4 7 4
6 9 1
3 5 6
7 0 8
What is the most efficient way of doing this?
I have tried to use:
FilteredData = AllData.filter(regex=my_dict.keys)
but unsurprisingly, this didn't work. Any suggestions/advice welcome
Cheers, Alex
You can also do this without the filter method at all like this:
FilteredData = AllData[my_dict.keys()]
Pandas dataframes have a method called filter that will return a new dataframe. Try this
FilteredData = AllData.filter(items=my_dict.keys())