Ordering file in list by pattern python - python

I have a list with 240 figures all starting with the fig and the number of the fig.
Here is an example :
fig1-24-24-32
fig3-45-32-12
fig2-24-24-31
fig5-24-24-31
fig6-24-24-31
fig4-24-24-31
I would like to order that list by fig name:
fig1-24-24-32
fig2-24-24-31
fig3-45-32-12
fig4-24-24-31
fig5-24-24-31
fig6-24-24-31
I have tried :
print(glob.glob('fig*[1-241]*'))
However this does not work
This is what I get
UPDATE
Found the answer to my question here:
https://stackoverflow.com/a/2669120/6235069 Answer is given by #Mark Byers

I am assuming here that all the files start with the same 3-character long prefix ( does not have to be 'fig'; will not be taken into account) which in turn is followed by digits (one or many) until a dash ('-') is met.
If that is indeed the case, you can use the following:
sorted(my_files, key=lambda x: int(x.split('-')[0][3:]))
Note that my_files is a list containing all the filenames (basenames).
Output:
['fig1-24-24-32',
'fig2-24-24-31',
'fig3-45-32-12',
'fig4-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31']

Below code will do you job:
mylist=['fig1-24-24-32',
'fig3-45-32-12',
'fig2-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31',
'fig4-24-24-31']
updated_list=sorted(mylist)
Sorted will do your job until and unless you want to sort on the first 3 characters.
updated_list
['fig1-24-24-32',
'fig2-24-24-31',
'fig3-45-32-12',
'fig4-24-24-31',
'fig5-24-24-31',
'fig6-24-24-31']

Related

How do I store a large int in python without the newlines being counted by len()?

So I wanna store a long integer which is too big for one line in python. Do I just ignore PEP 8 and just make it longer than 120 characters? Cause if I do it like this:
num="""7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843
8586156078911294949545950173795833195285320880551112540698747158523863050715693290963295227443043557
6689664895044524452316173185640309871112172238311362229893423380308135336276614282806444486645238749
3035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776
6572733300105336788122023542180975125454059475224352584907711670556013604839586446706324415722155397
5369781797784617406495514929086256932197846862248283972241375657056057490261407972968652414535100474
8216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586
1786645835912456652947654568284891288314260769004224219022671055626321111109370544217506941658960408
0719840385096245544436298123098787992724428490918884580156166097919133875499200524063689912560717606
0588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450"""
and try to access a specific index of that integer or use len() on it I get a length of 1009 instead of the 1000 digits the number actually has. And putting everything into one line would make that line 1004 characters long which doesn't seem that great either.
I would use the following literal over multiple lines in parentheses for cleanliness:
num = (
'7316717653'
'1330624919'
'2251196744'
)
so that len(num) from the above example returns: 30
Another option you have is to put the number into another file (say number.txt) and read it at runtime:
number.txt

main.py
with open("number.txt", "r") as f:
number = f.read()
I wouldn't use this personally, but one option is to remove the newlines:
num = """
123
456
""".replace('\n', '')
print(repr(num)) # -> '123456'
There's lots of good answers already, but here's one that will give you a bit of extra convenience. You just have to put in a number and the size of the chunks per line, and you can reuse it for lots of long numbers, if needed:
Format your number into multiple strings using a for loop and string concatenation:
x = str(7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450)
y = []
y.append("long_num = (")
chunksize = 10
for i in range(0, len(x), chunksize ):
y.append("\t"+"\""+x[i:i+chunksize ]+"\"")
y.append(")")
for part in y:
print (part)
Outputs the following string that you can use in your code, referencing #blhsing's answer:
long_num = (
"7316717653"
"1330624919"
"2251196744"
"2657474235"
"5349194934"
"9698352031"
"2774506326"
"2395783180"
"1698480186"
...
) ```
You can take a look at this post Is there a way to implement methods like __len__ or __eq__ as classmethods?
Simple make a class for your long integer, and replace the len(self) function to not count \n

convert columns of text to rows in python 3x

input text image
I am using the below code to convert the columns to rows in the text.
My requirement is to find the count of each character in each column in the text
b=[''.join(i) for i in zip(*a2.split())]
print(b)
I am getting below input
['CCACTCGT', 'GTGGCCCC', 'AGCACTGC', 'CCTGCAGA', 'TTTAACCA', 'CGTACCTC', 'CACCCCCA', 'CGCCCCTT', 'GCTCCATG', 'CCAAAGGA', 'GCTCGCCT', 'ACTCACCC', 'ATCCTGGG', 'GGAACGCT', 'ACATCCTG', 'CGGCTTGC', 'TCAACCCG', 'TACGCGTT', 'GTCATCGT', 'ACAGAACC', 'CCCCCCTC', 'CACCCTGT', 'CACTTCCG', 'CGACTTCC', 'AGCCTCGA', 'AACCTGCA', 'ACTTCGTG', 'GCCTTCGT', 'CCTCGTCG', 'TTGCGGTC', 'CTGAGTGA', 'GCTCGGTG', 'GTACACGC', 'GCCTGCGT', 'CGCCAGCG', 'GGATCGTA', 'CAGGCGGG', 'ATACCGCG', 'CCTTCGTC', 'CCCCTGAC', 'CGTCCCGC', 'CGCTAGTC', 'CGGCGCGG', 'CACCCCCC', 'TGCGCGTC', 'GACTCCGC', 'CCATCCAC', 'AGTCTTCG', 'CGCTGCGC', 'AATCTCCC', 'CACCACCC', 'TTGCGCTA', 'TCGTGCGC', 'CTTGGAGA', 'CGTAGTCG', 'CTTGCGCC', 'CCTAGCGC', 'ATTGGCGC', 'CCTCGGCC', 'TACCGCCG', 'CGCTCCGC', 'TAGCCTGC', 'CCTATTCC', 'ACAACCCA', 'GTGCCGGC']
You can see the last 5 columns in the text are not coming in the list.
Iam not able to figure it out why this is happening.Any help would be highly appreciated.
Also please suggest if there is any other way to achieve the same result.
zip returns as many tuples as there are items in the shortest iterable, so only full columns are returned. To get all columns you can use zip_longest, like this:
from itertools import zip_longest
b = [''.join(i) for i in zip_longest(*a2.split(), fillvalue='')]

Process as string for each item in dataframe or list in Python

I'm trying something very simple on python.
zips = sempmme['Zip code'].unique()
I want to apply zipcode.isequal('12345') for each zips but I'm not sure how to do it in pythonic efficient way.
I tried 'zipcode.isequal(lambda x: x in zips)' and even for loop but I can't seem to get it.
for i in range(0, len(zips)):
#print(zips[i])
cities[i] = zipcode.isequal("" + zips[i])
It shows 'isequal() can only take string'. Needless to say, this is the first time I'm coding in Python. And figured the best way to learn is to take a project and figure it out.
EDIt:
output of repr(zips):
"array([u'25404', u'265056555', u'251772049', u'25177', u'26508', u'25262',\n u'26554', u'265053816', u'154741359', u'15461', u'26250',\n u'262413392', u'25443', u'26505', u'258809366', u'217331141',\n u'26757', u'26201', u'25419', u'25427', u'25401', u'26003',\n u'25428', u'26150', u'268479803', u'24426', u' ', u'25813',\n u'253099769', u'22603', u'25174', u'25984', u'25430', u'25438',\n u'268360008', u'254356541', u'26170', u'25971', u'24622', u'24986',\n u'26847', u'24957', u'25963', u'25064', u'260039425', u'25526',\n u'25523', u'26452', u'25143', u'26301', u'25285', u'26104',\n u'25951', u'25206', u'24740', u'252137436', u'25420', u'26330',\n u'24701', u'25309', u'25304', u'26408', u'25564', u'26753',\n u'15349', u'45767', u'25213', u'25168', u'25302', u'24931',\n u'26623', u'25704', u'26362', u'24966', u'250641730', u'26415',\n u'25130', u'26134', u'25413', u'26101', u'25193', u'26354',\n u'260031309', u'26651', u'24954', u'26180', u'256700145', u'26033',\n u'26444', u'25661', u'26555', u'264521704', u'25111', u'25043',\n u'26278', u'25560', u'25181', u'25854', u'259210233', u'24874',\n u'26181', u'24963', u'254381574', u'25557', u'26203', u'26836',\n u'255109768', u'25035', u'25214', u'26726', u'25132', u'25411',\n u'24853', u'26750', u'25071', u'25913', u'26374', u'25110',\n u'24901', u'25843', u'25880', u'26610', u'26456', u'41514',\n u'26684', u'25541', u'25311', u'26431', u'26241', u'26541',\n u'25162', u'25312', u'24801', u'26159', u'25239', u'255269325',\n u'26293', u'249460055', u'25149', u'26743', u'261871112', u'25315',\n u'25570', u'25123', u'254300341', u'25705', u'25421', u'24747',\n u'261709789', u'26438', u'26448', u'263011836', u'26041', u'25248',\n u'24739', u'25125', u'25510', u'26531', u'251860464', u'263690126',\n u'26205', u'25678', u'251238805', u'25320', u'249707005', u'25414',\n u'26133', u'263850384', u'26501', u'25405', u'25882', u'25244',\n u'25504', u'25635', u'24868', u'26143', u'25313', u'45769',\n u'24870', u'25508', u'26323', u'24832', u'25202', u'26451',\n u'25637', u'26288', u'26656', u'25670', u'25550', u'25059',\n u'456197853', u'249011225', u'25303', u'45680', u'26155', u'25002',\n u'25387', u'251771047', u'263230278', u'256250601', u'246051700',\n u'25045', u'25085', u'25011', u'25136', u'26405', u'25241',\n u'26070', u'25075', u'259181310', u'26105', u'25253', u'25275',\n u'24811', u'26287', u'25669', u'25159', u'26833', u'26378',\n u'24850', u'45760', u'26519', u'22802', u'25039', u'25403',\n u'26425', u'25625', u'254254109', u'253099281', u'258821226',\n u'255609701', u'252761627', u'25545', u'26546', u'25674',\n u'255701081', u'25547', u'257021403', u'25555', u'25113',\n u'255609730', u'255089543', u'25909', u'250489721', u'25958',\n u'25831', u'25825', u'25701', u'258479621', u'267630283', u'26588',\n u'24945', u'254280359', u'257029632', u'254253549', u'24869',\n u'25203', u'24847', u'248440000', u'25425', u'24614', u'26807',\n u'253069761', u'28104', u'26525', u'24910', u'25361', u'259813804',\n u'24808', u'253027228', u'26601', u'25801', u'25702', u'26208',\n u'255249621', u'25652', u'25033', u'26416', u'24712', u'25444',\n u'32707', u'259621513', u'25644', u'26034', u'262419617', u'25917',\n u'26062', u'25169', u'24731', u'254434652', u'25314', u'24620',\n u'75092', u'25306', u'26385'], dtype=object)"
Depending on what your goal in "applying zipcode.isequal for each zips"...
To return a list where each element is the return value of zipcode.isequal() of the elements in zips:
cities = [zipcode.isequal(str(zip)) for zip in zips]
or return a list containing the elements in zips for which zipcode.isequal() returns true:
cities = [zip for zip in zips if zipcode.isequal(str(zip))]
Edit: Given that zips does not consist entirely of numeric strings, you probably need to do an additional filter on either one:
cities = [zipcode.isequal(str(zip)) for zip in zips if zip.isdigit()]
cities = [zip for zip in zips if zip.isdigit() and zipcode.isequal(str(zip))]

Finding exon/ intron borders in a gene

I would like to go through a gene and get a list of 10bp long sequences containing the exon/intron borders from each feature.type =='mRNA'. It seems like I need to use compoundLocation, and the locations used in 'join' but I can not figure out how to do it, or find a tutorial.
Could anyone please give me an example or point me to a tutorial?
Assuming all the info in the exact format you show in the comment, and that you're looking for 20 bp on either side of each intro/exon boundary, something like this might be a start:
Edit: If you're actually starting from a GenBank record, then it's not much harder. Assuming that the full junction string you're looking for is in the CDS feature info, then:
for f in record.features:
if f.type == 'CDS':
jct_info = str(f.location)
converts the "location" information into a string and you can continue as below.
(There are ways to work directly with the location information without converting to a string - in particular you can use "extract" to pull the spliced sequence directly out of the parent sequence -- but the steps involved in what you want to do are faster and more easily done by converting to str and then int.)
import re
jct_info = "join{[0:229](+), [11680:11768](+), [11871:12135](+), [15277:15339](+), [16136:16416](+), [17220:17471](+), [17547:17671](+)"
jctP = re.compile("\[\d+\:\d+\]")
jcts = jctP.findall(jct_info)
jcts
['[0:229]', '[11680:11768]', '[11871:12135]', '[15277:15339]', '[16136:16416]', '[17220:17471]', '[17547:17671]']
Now you can loop through the list of start:end values, pull them out of the text and convert them to ints so that you can use them as sequence indexes. Something like this:
for jct in jcts:
(start,end) = jct.replace('[', '').replace(']', '').split(':')
try: # You need to account for going out of index, e.g. where start = 0
start_20_20 = seq[int(start)-20:int(start)+20]
except IndexError:
# do your alternatives e.g. start = int(start)

find missing numeric from ALPHANUMERIC - Python

How would I write a function in Python to determine if a list of filenames matches a given pattern and which files are missing from that pattern? For example:
Input ->
KUMAR.3.txt
KUMAR.4.txt
KUMAR.6.txt
KUMAR.7.txt
KUMAR.9.txt
KUMAR.10.txt
KUMAR.11.txt
KUMAR.13.txt
KUMAR.15.txt
KUMAR.16.txt
Desired Output-->
KUMAR.5.txt
KUMAR.8.txt
KUMAR.12.txt
KUMAR.14.txt
Input -->
KUMAR3.txt
KUMAR4.txt
KUMAR6.txt
KUMAR7.txt
KUMAR9.txt
KUMAR10.txt
KUMAR11.txt
KUMAR13.txt
KUMAR15.txt
KUMAR16.txt
Desired Output -->
KUMAR5.txt
KUMAR8.txt
KUMAR12.txt
KUMAR14.txt
You can approach this as:
Convert the filenames to appropriate integers.
Find the missing numbers.
Combine the missing numbers with the filename template as output.
For (1), if the file structure is predictable, then this is easy.
def to_num(s, start=6):
return int(s[start:s.index('.txt')])
Given:
lst = ['KUMAR.3.txt', 'KUMAR.4.txt', 'KUMAR.6.txt', 'KUMAR.7.txt',
'KUMAR.9.txt', 'KUMAR.10.txt', 'KUMAR.11.txt', 'KUMAR.13.txt',
'KUMAR.15.txt', 'KUMAR.16.txt']
you can get a list of known numbers by: map(to_num, lst). Of course, to look for gaps, you only really need the minimum and maximum. Combine that with the range function and you get all the numbers that you should see, and then remove the numbers you've got. Sets are helpful here.
def find_gaps(int_list):
return sorted(set(range(min(int_list), max(int_list))) - set(int_list))
Putting it all together:
missing = find_gaps(map(to_num, lst))
for i in missing:
print 'KUMAR.%d.txt' % i
Assuming the patterns are relatively static, this is easy enough with a regex:
import re
inlist = "KUMAR.3.txt KUMAR.4.txt KUMAR.6.txt KUMAR.7.txt KUMAR.9.txt KUMAR.10.txt KUMAR.11.txt KUMAR.13.txt KUMAR.15.txt KUMAR.16.txt".split()
def get_count(s):
return int(re.match('.*\.(\d+)\..*', s).groups()[0])
mincount = get_count(inlist[0])
maxcount = get_count(inlist[-1])
values = set(map(get_count, inlist))
for ii in range (mincount, maxcount):
if ii not in values:
print 'KUMAR.%d.txt' % ii

Categories

Resources