convert columns of text to rows in python 3x - python

input text image
I am using the below code to convert the columns to rows in the text.
My requirement is to find the count of each character in each column in the text
b=[''.join(i) for i in zip(*a2.split())]
print(b)
I am getting below input
['CCACTCGT', 'GTGGCCCC', 'AGCACTGC', 'CCTGCAGA', 'TTTAACCA', 'CGTACCTC', 'CACCCCCA', 'CGCCCCTT', 'GCTCCATG', 'CCAAAGGA', 'GCTCGCCT', 'ACTCACCC', 'ATCCTGGG', 'GGAACGCT', 'ACATCCTG', 'CGGCTTGC', 'TCAACCCG', 'TACGCGTT', 'GTCATCGT', 'ACAGAACC', 'CCCCCCTC', 'CACCCTGT', 'CACTTCCG', 'CGACTTCC', 'AGCCTCGA', 'AACCTGCA', 'ACTTCGTG', 'GCCTTCGT', 'CCTCGTCG', 'TTGCGGTC', 'CTGAGTGA', 'GCTCGGTG', 'GTACACGC', 'GCCTGCGT', 'CGCCAGCG', 'GGATCGTA', 'CAGGCGGG', 'ATACCGCG', 'CCTTCGTC', 'CCCCTGAC', 'CGTCCCGC', 'CGCTAGTC', 'CGGCGCGG', 'CACCCCCC', 'TGCGCGTC', 'GACTCCGC', 'CCATCCAC', 'AGTCTTCG', 'CGCTGCGC', 'AATCTCCC', 'CACCACCC', 'TTGCGCTA', 'TCGTGCGC', 'CTTGGAGA', 'CGTAGTCG', 'CTTGCGCC', 'CCTAGCGC', 'ATTGGCGC', 'CCTCGGCC', 'TACCGCCG', 'CGCTCCGC', 'TAGCCTGC', 'CCTATTCC', 'ACAACCCA', 'GTGCCGGC']
You can see the last 5 columns in the text are not coming in the list.
Iam not able to figure it out why this is happening.Any help would be highly appreciated.
Also please suggest if there is any other way to achieve the same result.

zip returns as many tuples as there are items in the shortest iterable, so only full columns are returned. To get all columns you can use zip_longest, like this:
from itertools import zip_longest
b = [''.join(i) for i in zip_longest(*a2.split(), fillvalue='')]

Related

Trouble converting a pandas dataframe into a list with the right utf-8 encoding

I'm trying to convert a Pandas Dataframe into a list, which works but I have some issues with the encoding. I hope someone can give me advice on how to handle this problem. Right now, I'm using Python 2.7.
I'm loading an excel file and it loads correctly.
I'm using following code and I get following output:
germanStatesExcelFile='German_States.xlsx'
ePath_german_states=(os.path.dirname(__file__))+'/'+germanStatesExcelFile
german_states = pd.read_excel(ePath_german_states)
print("doc " + str(german_states))
Output:
states
0 baden-württemberg
1 bayern
2 hessen
3 rheinland-pfalz
4 saarland
5 nordrhein-westfalen
The next step is converting this Dataframe into a list, which I do with following code:
german_states = german_states['states'].tolist()
Output:
[u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']
It seems like the list is converting utf-8 not right. so i tried following step:
german_states = [x.encode('utf-8') for x in german_states]
Output:
['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
I would like to have following Output:
['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
Little late to the party, but if encoding to utf-8 like below doesn't work, you could use the unicodedata.normalize module
german_states_decoded = [x.encode('utf-8') for x in german_states]
If your strings only contain ascii characters, you could try python's in-built str, as below. This works with the strings you provided, but may not necessarily be the case.
Otherwise, there are a number of good answers to a similar question.
german_states = [u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']
german_states = list(map(str, german_states))
# ['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']

Ordering file in list by pattern python

I have a list with 240 figures all starting with the fig and the number of the fig.
Here is an example :
fig1-24-24-32
fig3-45-32-12
fig2-24-24-31
fig5-24-24-31
fig6-24-24-31
fig4-24-24-31
I would like to order that list by fig name:
fig1-24-24-32
fig2-24-24-31
fig3-45-32-12
fig4-24-24-31
fig5-24-24-31
fig6-24-24-31
I have tried :
print(glob.glob('fig*[1-241]*'))
However this does not work
This is what I get
UPDATE
Found the answer to my question here:
https://stackoverflow.com/a/2669120/6235069 Answer is given by #Mark Byers
I am assuming here that all the files start with the same 3-character long prefix ( does not have to be 'fig'; will not be taken into account) which in turn is followed by digits (one or many) until a dash ('-') is met.
If that is indeed the case, you can use the following:
sorted(my_files, key=lambda x: int(x.split('-')[0][3:]))
Note that my_files is a list containing all the filenames (basenames).
Output:
['fig1-24-24-32',
'fig2-24-24-31',
'fig3-45-32-12',
'fig4-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31']
Below code will do you job:
mylist=['fig1-24-24-32',
'fig3-45-32-12',
'fig2-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31',
'fig4-24-24-31']
updated_list=sorted(mylist)
Sorted will do your job until and unless you want to sort on the first 3 characters.
updated_list
['fig1-24-24-32',
'fig2-24-24-31',
'fig3-45-32-12',
'fig4-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31']

Process as string for each item in dataframe or list in Python

I'm trying something very simple on python.
zips = sempmme['Zip code'].unique()
I want to apply zipcode.isequal('12345') for each zips but I'm not sure how to do it in pythonic efficient way.
I tried 'zipcode.isequal(lambda x: x in zips)' and even for loop but I can't seem to get it.
for i in range(0, len(zips)):
#print(zips[i])
cities[i] = zipcode.isequal("" + zips[i])
It shows 'isequal() can only take string'. Needless to say, this is the first time I'm coding in Python. And figured the best way to learn is to take a project and figure it out.
EDIt:
output of repr(zips):
"array([u'25404', u'265056555', u'251772049', u'25177', u'26508', u'25262',\n u'26554', u'265053816', u'154741359', u'15461', u'26250',\n u'262413392', u'25443', u'26505', u'258809366', u'217331141',\n u'26757', u'26201', u'25419', u'25427', u'25401', u'26003',\n u'25428', u'26150', u'268479803', u'24426', u' ', u'25813',\n u'253099769', u'22603', u'25174', u'25984', u'25430', u'25438',\n u'268360008', u'254356541', u'26170', u'25971', u'24622', u'24986',\n u'26847', u'24957', u'25963', u'25064', u'260039425', u'25526',\n u'25523', u'26452', u'25143', u'26301', u'25285', u'26104',\n u'25951', u'25206', u'24740', u'252137436', u'25420', u'26330',\n u'24701', u'25309', u'25304', u'26408', u'25564', u'26753',\n u'15349', u'45767', u'25213', u'25168', u'25302', u'24931',\n u'26623', u'25704', u'26362', u'24966', u'250641730', u'26415',\n u'25130', u'26134', u'25413', u'26101', u'25193', u'26354',\n u'260031309', u'26651', u'24954', u'26180', u'256700145', u'26033',\n u'26444', u'25661', u'26555', u'264521704', u'25111', u'25043',\n u'26278', u'25560', u'25181', u'25854', u'259210233', u'24874',\n u'26181', u'24963', u'254381574', u'25557', u'26203', u'26836',\n u'255109768', u'25035', u'25214', u'26726', u'25132', u'25411',\n u'24853', u'26750', u'25071', u'25913', u'26374', u'25110',\n u'24901', u'25843', u'25880', u'26610', u'26456', u'41514',\n u'26684', u'25541', u'25311', u'26431', u'26241', u'26541',\n u'25162', u'25312', u'24801', u'26159', u'25239', u'255269325',\n u'26293', u'249460055', u'25149', u'26743', u'261871112', u'25315',\n u'25570', u'25123', u'254300341', u'25705', u'25421', u'24747',\n u'261709789', u'26438', u'26448', u'263011836', u'26041', u'25248',\n u'24739', u'25125', u'25510', u'26531', u'251860464', u'263690126',\n u'26205', u'25678', u'251238805', u'25320', u'249707005', u'25414',\n u'26133', u'263850384', u'26501', u'25405', u'25882', u'25244',\n u'25504', u'25635', u'24868', u'26143', u'25313', u'45769',\n u'24870', u'25508', u'26323', u'24832', u'25202', u'26451',\n u'25637', u'26288', u'26656', u'25670', u'25550', u'25059',\n u'456197853', u'249011225', u'25303', u'45680', u'26155', u'25002',\n u'25387', u'251771047', u'263230278', u'256250601', u'246051700',\n u'25045', u'25085', u'25011', u'25136', u'26405', u'25241',\n u'26070', u'25075', u'259181310', u'26105', u'25253', u'25275',\n u'24811', u'26287', u'25669', u'25159', u'26833', u'26378',\n u'24850', u'45760', u'26519', u'22802', u'25039', u'25403',\n u'26425', u'25625', u'254254109', u'253099281', u'258821226',\n u'255609701', u'252761627', u'25545', u'26546', u'25674',\n u'255701081', u'25547', u'257021403', u'25555', u'25113',\n u'255609730', u'255089543', u'25909', u'250489721', u'25958',\n u'25831', u'25825', u'25701', u'258479621', u'267630283', u'26588',\n u'24945', u'254280359', u'257029632', u'254253549', u'24869',\n u'25203', u'24847', u'248440000', u'25425', u'24614', u'26807',\n u'253069761', u'28104', u'26525', u'24910', u'25361', u'259813804',\n u'24808', u'253027228', u'26601', u'25801', u'25702', u'26208',\n u'255249621', u'25652', u'25033', u'26416', u'24712', u'25444',\n u'32707', u'259621513', u'25644', u'26034', u'262419617', u'25917',\n u'26062', u'25169', u'24731', u'254434652', u'25314', u'24620',\n u'75092', u'25306', u'26385'], dtype=object)"
Depending on what your goal in "applying zipcode.isequal for each zips"...
To return a list where each element is the return value of zipcode.isequal() of the elements in zips:
cities = [zipcode.isequal(str(zip)) for zip in zips]
or return a list containing the elements in zips for which zipcode.isequal() returns true:
cities = [zip for zip in zips if zipcode.isequal(str(zip))]
Edit: Given that zips does not consist entirely of numeric strings, you probably need to do an additional filter on either one:
cities = [zipcode.isequal(str(zip)) for zip in zips if zip.isdigit()]
cities = [zip for zip in zips if zip.isdigit() and zipcode.isequal(str(zip))]

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

Python: fast iteration through file

I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.
I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]
In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.
Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.

Categories

Resources