How to cycle through indices - python
so in this script I am writing to learn python, I would like to just put a wildcard instead of rewriting this whole block just to change line 2. what would be the most efficient way to consolidate this into a loop, where it will just use all d.entries[0-99].content and repeat until finished? if, while, for?
also my try /except does not perform as expected
what gives?
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
print (d.entries[3].title)
sr = str(d.entries[3].content)
spl1 = sr.split("<p>")
ss = str(spl1)
spl2 = ss.split("</p>")
try:
st = str(spl2[0])
# print(st)
except:
binascii.Error
st = str(spl2[1])
print(st)
#st = str(spl2[0])
spl3 =st.split("', '")
stringnow=str(spl3[1])
b64s1 = stringnow.encode('ascii')
b64s2 = base64.b64decode(b64s1)
stringnew = b64s2.decode('ascii')
print(stringnew)
## but line 15 does nothing, how to fix and also loop through all d.entries[?].content
The loop part is done simply by doing the following"
import feedparser, base64
from urlextract import URLExtract
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
# loop from 0 to 99
# range(100) goes from 0 and up to and not including 100
for i in range(100):
print (d.entries[i].title)
sr = str(d.entries[i].content)
<< the rest of your code here>>
The data returned from d.entries[i].content is a dictionary but you are converting to a string so you may want to see if you are doing what you really want too. Also when you use .split() it produces a list of the split items but you convert to a string once again (a few time). You may want to relook at that part of the code.
I haven't used regex much but decided to just to play and got this to work. I retrieved the contents of the 'value' key from the dictionary. Then used regex to get the base64 info. I only tried it for the first 5 items (i.e., I changed range(100) to range(5). Hope it helps. If not, I enjoyed doing this. Oh, I left all of the print statements I used as I was working down the code.
import feedparser, base64
from urlextract import URLExtract
import re
d = feedparser.parse('https://www.reddit.com/r/PkgLinks.rss')
for i in range(100):
print (d.entries[i].title)
# .contents is a list.
# print("---------")
# print (type(d.entries[i].content))
print (d.entries[i].content)
print("---------")
# gets the contents of key 'value' in the dictionary that is the 1st item in the list.
string_value = d.entries[3].content[0]['value']
print(string_value)
print("---------")
# this assumes there is always a space between the 1st </p> and the 2nd <p>
# grabs text between using re.search
pattern = "<p>(.*?)</p>"
substring = re.search(pattern, string_value).group(1)
print(substring)
print("---------")
print("---------")
print("---------")
# rest of your code here
Related
how to stop repeating same text in loops python
from re import I from requests import get res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json() kek = [] for x in res: kek.append(x) lnk = res[kek[0]]['downloads'] anime_name = res[kek[0]]['show'] for x in lnk: quality = x['res'] links = x['magnet'] data = f"{anime_name}:\n\n{quality}: {links}\n\n" print(data) in this code how can i prevent repeating of anime name if i add this outside of the loop only 1 link be printed
you can separate you string, 1st half outside the loop, 2nd inside the loop: print(f"{anime_name}:\n\n") for x in lnk: quality = x['res'] links = x['magnet'] data = f"{quality}: {links}\n\n" print(data)
Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict) from requests import get data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json() for show, info in data.items(): print(show, '\n') for download in info['downloads']: print(download['magnet']) print(download['res']) print('\n') Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.
Get list from string with exec in python
I have: "[15765,22832,15289,15016,15017]" I want: [15765,22832,15289,15016,15017] What should I do to convert this string to list? P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string. P.S. №2. My initial code was: import urllib.request, re f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js") lines = f.readlines() for line in lines: m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251')) if m is not None: varname = m.group(1) if varname == "aEmitentIds": aEmitentIds = line #its type is 'bytes', not 'string' I need to get list from line line from web page looks like [15765, 22832, 15289, 15016, 15017]
Assuming s is your string, you can just use split and then cast each number to integer: s = [int(number) for number in s[1:-1].split(',')] For detailed information about split function: Python3 split documentation
What you have is a stringified list. You could use a json parser to parse that information into the corresponding list import json test_str = "[15765,22832,15289,15016,15017]" l = json.loads(test_str) # List that you need. Or another way to do this would be to use ast import ast test_str = "[15765,22832,15289,15016,15017]" data = ast.literal_eval(test_str) The result is [15765, 22832, 15289, 15016, 15017] To understand why using eval() is bad practice you could refer to this answer
You can also use regex to pull out numeric values from the string as follows: import re lst = "[15765,22832,15289,15016,15017]" lst = [int(number) for number in re.findall('\d+',lst)] Output of the above code is, [15765, 22832, 15289, 15016, 15017]
Python: Replacing a string with unique replacements
I'm reading a file and I need to replace certain empty tags ([[Image:]]). The problem is every replacement has to be unique. Here's the code: import re import codecs re_imagematch = re.compile('(\[\[Image:([^\]]+)?\]\])') wf = codecs.open('converted.wiki', "r", "utf-8") wikilines = wf.readlines() wf.close() imgidx = 0 for i in range(0,len(wikilines)): if re_imagematch.search(wikilines[i]): print 'MATCH #######################################################' print wikilines[i] wikilines[i] = re_imagematch.sub('[[Image:%s_%s.%s]]' % ('outname', imgidx, 'extension'), wikilines[i]) print wikilines[i] imgidx += 1 This does not work, as there can be many tags in one line: Here's the input file. [[Image:]][[Image:]] [[Image:]] This is what the output should look like: [[Image:outname_0.extension]][Image:outname_1.extension]] [[Image:outname_2.extension]] This is what it currently looks likeö [[Image:outname_0.extension]][Image:outname_0.extension]] [[Image:outname_1.extension]] I tried using a replacement function, the problem is this function gets only called once per line using re.sub.
You can use itertools.count here and take some advantage of the fact that default arguments are calculated when function is created and value of mutable default arguments can persist between function calls. from itertools import count def rep(m, cnt=count()): return '[[Image:%s_%s.%s]]' % ('outname', next(cnt) , 'extension') This function will be invoked for each match found and it'll use a new value for each replacement. So, you simply need to change this line in your code: wikilines[i] = re_imagematch.sub(rep, wikilines[i]) Demo: def rep(m, count=count()): return str(next(count)) >>> re.sub(r'a', rep, 'aaa') '012' To get the current counter value: >>> from copy import copy >>> next(copy(rep.__defaults__[0])) - 1 2
I'd use a simple string replacement wrapped in a while loop: s = '[[Image:]][[Image:]]\n[[Image:]]' pattern = '[[Image:]]' i = 0 while s.find(pattern) >= 0: s = s.replace(pattern, '[[Image:outname_' + str(i) + '.extension]]', 1) i += 1 print s
parsing a .srt file with regex
I am doing a small script in python, but since I am quite new I got stuck in one part: I need to get timing and text from a .srt file. For example, from 1 00:00:01,000 --> 00:00:04,074 Subtitles downloaded from www.OpenSubtitles.org I need to get: 00:00:01,000 --> 00:00:04,074 and Subtitles downloaded from www.OpenSubtitles.org. I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing: ( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+ but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.
Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like: an integer starting at 1, monotonically increasing start --> stop timing one or more lines of subtitle content a blank line ... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code. So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together. from itertools import groupby # "chunk" our input file, delimited by blank lines with open(filename) as f: res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b] For example, using the example on the SRT doc page, I get: res Out[60]: [['1\n', '00:02:17,440 --> 00:02:20,375\n', "Senator, we're making\n", 'our final approach into Coruscant.\n'], ['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']] And I could further transform that into a list of meaningful objects: from collections import namedtuple Subtitle = namedtuple('Subtitle', 'number start end content') subs = [] for sub in res: if len(sub) >= 3: # not strictly necessary, but better safe than sorry sub = [x.strip() for x in sub] number, start_end, *content = sub # py3 syntax start, end = start_end.split(' --> ') subs.append(Subtitle(number, start, end, content)) subs Out[65]: [Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']), Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]
Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky. import re f = file.open(yoursrtfile) # Parse the file content content = f.read() # Find all result in content # The first big (__) retrieve the timing, \s+ match all timing in between, # The (.+) means retrieve any text content after that. result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content) # Just print out the result list. I recommend you do some formatting here. print result
number:^[0-9]+$ Time: ^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$ string: *[a-zA-Z]+* hope this help.
Thanks #roippi for this excellent parser. It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project) from __future__ import print_function, division from itertools import groupby from collections import namedtuple # prepare - adapt to you needs or use sys.argv inputname = 'FR.srt' outputname = 'FR.stl' stlheader = """ $FontName = Arial $FontSize = 34 $HorzAlign = Center $VertAlign = Bottom """ def converttime(sttime): "convert from srt time format (0...999) to stl one (0...25)" st = sttime.split(',') return "%s:%02d"%(st[0], round(25*float(st[1]) /1000)) # load with open(inputname,'r') as f: res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b] # parse Subtitle = namedtuple('Subtitle', 'number start end content') subs = [] for sub in res: if len(sub) >= 3: # not strictly necessary, but better safe than sorry sub = [x.strip() for x in sub] number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax start, end = start_end.split(' --> ') subs.append(Subtitle(number, start, end, content)) # write with open(outputname,'w') as F: F.write(stlheader) for sub in subs: F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )
for time: pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")
None of the pure REGEx solution above worked for the real life srt files. Let's take a look of the following SRT patterned text : 1 00:02:17,440 --> 00:02:20,375 Some multi lined text This is a second line 2 00:02:20,476 --> 00:02:22,501 as well as a single line 3 00:03:20,476 --> 00:03:22,501 should be able to parse unicoded text too こんにちは Take a note that : text may contain unicode characters. Text can consist of several lines. every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted Here is the working regex : \d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?)) https://regex101.com/r/qICmEM/1
error comparing sequences - string interpreted as number
I'm trying to do something similar with my previous question. My purpose is to join all sequences that are equal. But this time instead of letters, I have numbers. alignment file can be found here - phylip file the problem is when I try to do this: records = list(SeqIO.parse(file(filename),'phylip')) I get this error: ValueError: Sequence 1 length 49, expected length 1001000000100000100000001000000000000000 I don't understand why because this is the second file I'm creating and the first one worked perfectly.. Below is the code used to build the alignment file: fl.write('\t') fl.write(str(161)) fl.write('\t') fl.write(str(size)) fl.write('\n') for i in info_plex: if 'ref' in i[0]: i[0] = 'H37Rv' fl.write(str(i[0])) num = 10 - len(i[0]) fl.write(' ' * num) for x in i[1:]: fl.write(str(x)) fl.write('\n') So it shouldn't interpret 1001000000100000100000001000000000000000 as a number since its a string.. Any ideas? Thank you!
Your PHYLIP file is broken. The header says 161 sequences but there are 166. Fixing that the current version of Biopython seems to load your file fine. Maybe use len(info_plex) when creating the header line. P.S. It would have been a good idea to include the version of Biopython in your question.
The code of Kevin Jacobs in your former question employs Biopython that uses sequences of type Seq that « are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological file formats. » « There are two important differences between Seq objects and standard Python strings. (...) First of all, they have different methods. (...) Secondly, the Seq object has an important attribute, alphabet, which is an object describing what the individual characters making up the sequence string “mean”, and how they should be interpreted. For example, is AGTACACTGGT a DNA sequence, or just a protein sequence that happens to be rich in Alanines, Glycines, Cysteines and Threonines? The alphabet object is perhaps the important thing that makes the Seq object more than just a string. The currently available alphabets for Biopython are defined in the Bio.Alphabet module. » http://biopython.org/DIST/docs/tutorial/Tutorial.html The reason of your problem is simply that SeqIO.parse() can't create Seq objects from a file containing characters for which there is no alphabet attribute able to manage them. . So, you must use another method. Not try to plate an inadapted method on a different problem. Here's my way: from itertools import groupby from operator import itemgetter import re regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print 'len(records) == %s\n' % len(records) n = 0 for seq,equal in groupby(records, itemgetter(1)): ids = tuple(x[0] for x in equal) if len(ids)>1: print '>%s :\n%s' % (','.join(ids), seq) else: n+=1 print '\nNumber of unique occurences : %s' % n result len(records) == 165 >154995,168481 : 0000000000001000000010000100000001000000000000000 >123031,74772 : 0000000000001111000101100011100000100010000000000 >176816,178586,80016 : 0100000000000010010010000010110011100000000000000 >129575,45329 : 0100000000101101100000101110001000000100000000000 Number of unique occurences : 156 . Edit I've understood MY problem: I let 'fasta' instead of 'phylip' in my code. 'phylip' is a valid value for the attribute alphabet, with it it works fine records = list(SeqIO.parse(file('pastie-2486250.rb'),'phylip')) def seq_getter(s): return str(s.seq) records.sort(key=seq_getter) ecr = [] for seq,equal in groupby(records, seq_getter): ids = tuple(s.id for s in equal) if len(ids)>1: ecr.append( '>%s\n%s' % (','.join(ids),seq) ) print '\n'.join(ecr) produces ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, >154995,168481 0000000000001000000010000100000001000000000000000 >123031,74772 0000000000001111000101100011100000100010000000000 >176816,178586,80016 0100000000000010010010000010110011100000000000000 >129575,45329 0100000000101101100000101110001000000100000000000 There is an incredible amount of characters ,,,,,,,,,,,,,,,, before the interesting data, I wonder what it is. . But my code isn't useless. See: from time import clock from itertools import groupby from operator import itemgetter import re from Bio import SeqIO def seq_getter(s): return str(s.seq) t0 = clock() with open('pastie-2486250.rb') as f: records = list(SeqIO.parse(f,'phylip')) records.sort(key=seq_getter) print clock()-t0,'seconds' t0 = clock() regx = re.compile('^(\d+)[ \t]+([01]+)',re.MULTILINE) with open('pastie-2486250.rb') as f: records = regx.findall(f.read()) records.sort(key=itemgetter(1)) print clock()-t0,'seconds' result 12.4826178327 seconds 0.228640588399 seconds ratio = 55 !