Reorder the contents of my string in Python - python

This is the sample dataframe which I managed to make for better understanding.
import pandas as pd
import numpy as np
df = pd.DataFrame([["{postal2:express}", "{postal1:regular}", "{postal4:slow}", "{postal3:slow}","{address1:ABC}","{address4:VPN}","{address3:XYZ}","{address2:POL}"],["{postal1:regular}","{postal4:slow}","{address4:VPN}","{address3:XYZ}","{postal3:slow}","{address1:ABC}","{postal2:express}","{address2:POL}"]], columns=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'])
However the output which I need should e grouped like in the image:
targetdf

Related

How to speed up nested for loop with dataframe?

I have a dataframe like this:
test = pd.DataFrame({'id':['a','C','D','b','b','D','c','c','c'], 'text':['a','x','a','b','b','b','c','c','c']})
Using the following for-loop I can add x to a new_col. This for-loop works fine for the small dataframe. However, for dataframes that have thousands of rows, it will take many hours to process. Any suggestions to speed it up?
for index, row in test.iterrows():
if row['id'] == 'C':
if test['id'][index+1] =='D':
test['new_col'][index+1] = test['text'][index]
Try using shift() and conditions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['a', 'C', 'D', 'b', 'b', 'D', 'c', 'c', 'c'],
'text': ['a', 'x', 'a', 'b', 'b', 'b', 'c', 'c', 'c']})
df['temp_col'] = df['id'].shift()
df['new_col'] = np.where((df['id'] == 'D') & (df['temp_col'] == 'C'), df['text'].shift(), "")
del df['temp_col']
print(df)
We can also do it without a temporary column. (Thanks& credits to Prayson 🙂)
df['new_col'] = np.where((df['id'].eq('D')) & (df['id'].shift().eq('C')), df['text'].shift(), "")

How to merge multiple data frames rows into a list?

I have multiple data frames. I need to merge them all and then set one by one column from all df.
I make it simple for you.i have multiple lists .like
l1=[a,b,c]
l2=[d,e,f]
l3=[g,h,i]
I want my list such that give below.
list=[a,d,g,b,e,h,c,f,i]
I am using numpy array
np.array([l1,l2,l3]).ravel('F')
Out[537]: array(['a', 'd', 'g', 'b', 'e', 'h', 'c', 'f', 'i'], dtype='<U1')
since you mention pandas
pd.DataFrame([l1,l2,l3]).melt().value.tolist()
Out[543]: ['a', 'd', 'g', 'b', 'e', 'h', 'c', 'f', 'i']
l1=['a','b','c']
l2=['d','e','f']
l3=['g','h','i']
list1 = []
for i in range(len(l1)):
list1.append(l1[i])
list1.append(l2[i])
list1.append(l3[i])
print (list1)
list(itertools.chain.from_iterable(zip(l1, l2, l3)))
worked for me.

Frequent pattern mining in Python

I want to know how to get the absolute support and relative support of itemsets in python. Presently I have the following:
import pandas as pd
import pyfpgrowth
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from collections import Counter
dataset = [['a', 'b', 'c', 'd'],
['b', 'c', 'e', 'f'],
['a', 'd', 'e', 'f'],
['a', 'e', 'f'],
['b', 'd', 'f']
]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
print (df)
#print support
print(apriori(df, min_support = 0.0))
#print frequent itemset
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x:
len(x))
frequent_itemsets
print ("frequent itemset at min support = 0.6")
print(frequent_itemsets)
but I do not know how to return the absolute support and relative support.
The relative support is part of your frequen_itemsets DataFrame. You can get it from:
frequent_itemsets['support']
And you can calculate the absolute support multiplying support by the number of baskets:
frequent_itemsets['support']*len(dataset)

How to convert multiple fasta lines in a matrix in python?

I have a file (txt or fasta) like this. Each sequence is located only in a single line.
>Line1
ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC
>Line2
ATTGCGCTANANAGCTANANCGATAGANCACGAAAGAGATAGACTATAGC
>Line3
ATCGCGCTANANAGCTANANGGCTAGANCNCGAAAGNGATAGACTATAGC
>Line4
ATTGCGCTANANAGCTANANGGATAGANCACGAGAGAGATAGACTATAGC
>Line5
ATTGCGCTANANAGCTANANCGATAGANCACGATNGAGATAGACTATAGC
I have to get a matrix in which each position correspond to each of the letters (nucleotides) of the sequences. In this case a matrix of (5x50).
I've been dealing with numpy methods. I hope someone could help me.
If you are working with DNA sequence data in python, I would recommend using the Biopython library. You can install it with pip install biopython.
Here is how you would achieve your desired result:
from Bio import SeqIO
import os
import numpy as np
pathToFile = os.path.join("C:\\","Users","Kevin","Desktop","test.fasta") #windows machine
allSeqs = []
for seq_record in SeqIO.parse(pathToFile, """fasta"""):
allSeqs.append(seq_record.seq)
seqMat = np.array(allSeqs)
But in the for loop, each seq_record.seq is a Seq object, giving you the flexibility to perform operations on them.
In [5]: seqMat.shape
Out[5]: (5L, 50L)
You can slice your seqMat array however you like.
In [6]: seqMat[0]
Out[6]: array(['A', 'T', 'C', 'G', 'C', 'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A',
'G', 'C', 'T', 'A', 'N', 'A', 'N', 'A', 'G', 'C', 'T', 'A', 'G',
'A', 'N', 'C', 'A', 'C', 'G', 'A', 'T', 'A', 'G', 'A', 'G', 'A',
'G', 'A', 'G', 'A', 'C', 'T', 'A', 'T', 'A', 'G', 'C'],
dtype='|S1')
Highly recommend checking out the tutorial though!
I hope this short bit of code helps. You basically need to split the string into a character array. After that you just put everything into a matrix.
Line1 = "ATGC"
Line2 = "GCTA"
Matr1 = np.matrix([n for n in Line1], [n for n in Line2])
Matr1[0,0] will return the first element in your matrix.
One way of achieving the matrix is to read the content of the file and converting it into a list where each element of the list is the sequence present in each line.And then you can access your matrix as a 2D Data Structure.
Ex: [ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC, ATCGCGCTANANAGCTANANAGCTAGANCACGATAGAGAGAGACTATAGC]
filePath = "file path containing the sequence"
List that store the sequence as a matrix
listFasta =list ((open(filePath).read()).split("\n"))
for seq in listFasta:
for charac in seq:
print charac
Another way to access each element of your matrix
for seq in range(len(listFasta)):
for ch in range(len(listFasta[seq])):
print listFasta[seq][ch]

Stacked bar chart with differently ordered colors using matplotlib

I am a begginer of python. I am trying to make a horizontal barchart with differently ordered colors.
I have a data set like the one in the below:
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
The first list contains numerical data, and the second one contains the order of each data item. I need the second list here, because the order of A, B, C, and D is crucial for the dataset when presenting them in my case.
Using data like the above, I want to make a stacked bar chart like the picture in the below. It was made with MS Excel by me manually. What I hope to do now is to make this type of bar chart using Matplotlib with the dataset like the above one in a more automatic way. I also want to add a legend to the chart if possible.
Actually, I have totally got lost in trying this by myself. Any help will be very, very helpful.
Thank you very much for your attention!
It's a long program, but it works, I added one dummy data to distinguish rows count and columns count:
import numpy as np
from matplotlib import pyplot as plt
dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16},
{'A':35, 'B':45, 'C':66, 'D':50}]
data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D'],
['A', 'B', 'C', 'D']]
colors = ["r","g","b","y"]
names = sorted(dataset[0].keys())
values = np.array([[data[name] for name in order] for data,order in zip(dataset, data_orders)])
lefts = np.insert(np.cumsum(values, axis=1),0,0, axis=1)[:, :-1]
orders = np.array(data_orders)
bottoms = np.arange(len(data_orders))
for name, color in zip(names, colors):
idx = np.where(orders == name)
value = values[idx]
left = lefts[idx]
plt.bar(left=left, height=0.8, width=value, bottom=bottoms,
color=color, orientation="horizontal", label=name)
plt.yticks(bottoms+0.4, ["data %d" % (t+1) for t in bottoms])
plt.legend(loc="best", bbox_to_anchor=(1.0, 1.00))
plt.subplots_adjust(right=0.85)
plt.show()
the result figure is:
>>> dataset = [{'A':19, 'B':39, 'C':61, 'D':70},
{'A':34, 'B':68, 'C':32, 'D':38},
{'A':35, 'B':45, 'C':66, 'D':50},
{'A':23, 'B':23, 'C':21, 'D':16}]
>>> data_orders = [['A', 'B', 'C', 'D'],
['B', 'A', 'C', 'D'],
['A', 'B', 'D', 'C'],
['B', 'A', 'C', 'D']]
>>> for i,x in enumerate(data_orders):
for y in x:
#do something here with dataset[i][y] in matplotlib

Categories

Resources