How to cycle through indices - python

so in this script I am writing to learn python, I would like to just put a wildcard instead of rewriting this whole block just to change line 2. what would be the most efficient way to consolidate this into a loop, where it will just use all d.entries[0-99].content and repeat until finished? if, while, for?
also my try /except does not perform as expected
what gives?
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
print (d.entries[3].title)
sr = str(d.entries[3].content)
spl1 = sr.split("<p>")
ss = str(spl1)
spl2 = ss.split("</p>")
try:
st = str(spl2[0])
# print(st)
except:
binascii.Error
st = str(spl2[1])
print(st)
#st = str(spl2[0])
spl3 =st.split("', '")
stringnow=str(spl3[1])
b64s1 = stringnow.encode('ascii')
b64s2 = base64.b64decode(b64s1)
stringnew = b64s2.decode('ascii')
print(stringnew)
## but line 15 does nothing, how to fix and also loop through all d.entries[?].content

The loop part is done simply by doing the following"
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
# loop from 0 to 99
# range(100) goes from 0 and up to and not including 100
for i in range(100):
print (d.entries[i].title)
sr = str(d.entries[i].content)
<< the rest of your code here>>
The data returned from d.entries[i].content is a dictionary but you are converting to a string so you may want to see if you are doing what you really want too. Also when you use .split() it produces a list of the split items but you convert to a string once again (a few time). You may want to relook at that part of the code.

I haven't used regex much but decided to just to play and got this to work. I retrieved the contents of the 'value' key from the dictionary. Then used regex to get the base64 info. I only tried it for the first 5 items (i.e., I changed range(100) to range(5). Hope it helps. If not, I enjoyed doing this. Oh, I left all of the print statements I used as I was working down the code.
import feedparser, base64
from urlextract import URLExtract
import re
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
for i in range(100):
print (d.entries[i].title)
# .contents is a list.
# print("---------")
# print (type(d.entries[i].content))
print (d.entries[i].content)
print("---------")
# gets the contents of key 'value' in the dictionary that is the 1st item in the list.
string_value = d.entries[3].content[0]['value']
print(string_value)
print("---------")
# this assumes there is always a space between the 1st </p> and the 2nd <p>
# grabs text between using re.search
pattern = "<p>(.*?)</p>"
substring = re.search(pattern, string_value).group(1)
print(substring)
print("---------")
print("---------")
print("---------")
# rest of your code here

Related

how to stop repeating same text in loops python

from re import I
from requests import get
res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
kek = []
for x in res:
kek.append(x)
lnk = res[kek[0]]['downloads']
anime_name = res[kek[0]]['show']
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{anime_name}:\n\n{quality}: {links}\n\n"
print(data)
in this code how can i prevent repeating of anime name
if i add this outside of the loop only 1 link be printed
you can separate you string, 1st half outside the loop, 2nd inside the loop:
print(f"{anime_name}:\n\n")
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{quality}: {links}\n\n"
print(data)
Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict)
from requests import get
data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
for show, info in data.items():
print(show, '\n')
for download in info['downloads']:
print(download['magnet'])
print(download['res'])
print('\n')
Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.

Get list from string with exec in python

I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]
Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation
What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer
You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]

Python: Replacing a string with unique replacements

I'm reading a file and I need to replace certain empty tags ([[Image:]]).
The problem is every replacement has to be unique.
Here's the code:
import re
import codecs
re_imagematch = re.compile('(\[\[Image:([^\]]+)?\]\])')
wf = codecs.open('converted.wiki', "r", "utf-8")
wikilines = wf.readlines()
wf.close()
imgidx = 0
for i in range(0,len(wikilines)):
if re_imagematch.search(wikilines[i]):
print 'MATCH #######################################################'
print wikilines[i]
wikilines[i] = re_imagematch.sub('[[Image:%s_%s.%s]]' % ('outname', imgidx, 'extension'), wikilines[i])
print wikilines[i]
imgidx += 1
This does not work, as there can be many tags in one line:
Here's the input file.
[[Image:]][[Image:]]
[[Image:]]
This is what the output should look like:
[[Image:outname_0.extension]][Image:outname_1.extension]]
[[Image:outname_2.extension]]
This is what it currently looks likeö
[[Image:outname_0.extension]][Image:outname_0.extension]]
[[Image:outname_1.extension]]
I tried using a replacement function, the problem is this function gets only called once per line using re.sub.
You can use itertools.count here and take some advantage of the fact that default arguments are calculated when function is created and value of mutable default arguments can persist between function calls.
from itertools import count
def rep(m, cnt=count()):
return '[[Image:%s_%s.%s]]' % ('outname', next(cnt) , 'extension')
This function will be invoked for each match found and it'll use a new value for each replacement.
So, you simply need to change this line in your code:
wikilines[i] = re_imagematch.sub(rep, wikilines[i])
Demo:
def rep(m, count=count()):
return str(next(count))
>>> re.sub(r'a', rep, 'aaa')
'012'
To get the current counter value:
>>> from copy import copy
>>> next(copy(rep.__defaults__[0])) - 1
2
I'd use a simple string replacement wrapped in a while loop:
s = '[[Image:]][[Image:]]\n[[Image:]]'
pattern = '[[Image:]]'
i = 0
while s.find(pattern) >= 0:
s = s.replace(pattern, '[[Image:outname_' + str(i) + '.extension]]', 1)
i += 1
print s

parsing a .srt file with regex

I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.
Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]
Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result
number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.
Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )
for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")
None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1

error comparing sequences - string interpreted as number

I'm trying to do something similar with my previous question.
My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers.
alignment file can be found here - phylip file
the problem is when I try to do this:
records = list(SeqIO.parse(file(filename),'phylip'))
I get this error:
ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000
I don't understand why because this is the second file I'm creating and the first one worked perfectly..
Below is the code used to build the alignment file:
fl.write('\t')
fl.write(str(161))
fl.write('\t')
fl.write(str(size))
fl.write('\n')
for i in info_plex:
if 'ref' in i[0]:
i[0] = 'H37Rv'
fl.write(str(i[0]))
num = 10 - len(i[0])
fl.write(' ' * num)
for x in i[1:]:
fl.write(str(x))
fl.write('\n')
So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string..
Any ideas?
Thank you!
Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line.
P.S. It would have been a good idea to include the version of Biopython in your question.
The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that
« are essentially strings of letters like AGTACACTGGT, which seems very
natural since this is the most common way that sequences are seen in
biological file formats. »
« There are two important differences between Seq objects and standard
Python strings. (...)
First of all, they have different methods. (...)
Secondly, the Seq object has an important
attribute, alphabet, which is an object describing what the individual
characters making up the sequence string “mean”, and how they should
be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a
protein sequence that happens to be rich in Alanines, Glycines,
Cysteines and Threonines?
The alphabet object is perhaps the important thing that makes the Seq
object more than just a string. The currently available alphabets for
Biopython are defined in the Bio.Alphabet module. »
http://biopython.org/DIST/docs/tutorial/Tutorial.html
The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them.
.
So, you must use another method. Not try to plate an inadapted method on a different problem.
Here's my way:
from itertools import groupby
from operator import itemgetter
import re
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print 'len(records) == %s\n' % len(records)
n = 0
for seq,equal in groupby(records, itemgetter(1)):
ids = tuple(x[0] for x in equal)
if len(ids)>1:
print '>%s :\n%s' % (','.join(ids), seq)
else:
n+=1
print '\nNumber of unique occurences : %s' % n
result
len(records) == 165
>154995,168481 :
0000000000001000000010000100000001000000000000000
>123031,74772 :
0000000000001111000101100011100000100010000000000
>176816,178586,80016 :
0100000000000010010010000010110011100000000000000
>129575,45329 :
0100000000101101100000101110001000000100000000000
Number of unique occurences : 156
.
Edit
I've understood MY problem: I let 'fasta' instead of 'phylip' in my code.
'phylip' is a valid value for the attribute alphabet, with it it works fine
records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip'))
def seq_getter(s): return str(s.seq)
records.sort(key=seq_getter)
ecr = []
for seq,equal in groupby(records, seq_getter):
ids = tuple(s.id for s in equal)
if len(ids)>1:
ecr.append( '>%s\n%s' % (','.join(ids),seq) )
print '\n'.join(ecr)
produces
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>154995,168481
0000000000001000000010000100000001000000000000000
>123031,74772
0000000000001111000101100011100000100010000000000
>176816,178586,80016
0100000000000010010010000010110011100000000000000
>129575,45329
0100000000101101100000101110001000000100000000000
There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is.
.
But my code isn't useless. See:
from time import clock
from itertools import groupby
from operator import itemgetter
import re
from Bio import SeqIO
def seq_getter(s): return str(s.seq)
t0 = clock()
with open('pastie-2486250.rb') as f:
records = list(SeqIO.parse(f,'phylip'))
records.sort(key=seq_getter)
print clock()-t0,'seconds'
t0 = clock()
regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE)
with open('pastie-2486250.rb') as f:
records = regx.findall(f.read())
records.sort(key=itemgetter(1))
print clock()-t0,'seconds'
result
12.4826178327 seconds
0.228640588399 seconds
ratio = 55 !

Categories

Resources