Rewriting code with an array - python

I have a file X_true that consists of sentences like these:
evid emerg interview show done deal
munich hamburg train crash wednesday first gener ice model power two electr power locomot capac 759 passeng
one report earlier week said older two boy upset girlfriend broken polic confirm
jordan previous said
Now instead of storing these sentences in a file, I wish to put them in an array(List of strings) to work with them throughout the code. So the array would look something like this:
['evid emerg interview show done deal',
'munich hamburg train crash wednesday first gener ice model power two electr power locomot capac 759 passeng',
'one report earlier week said older two boy upset girlfriend broken polic confirm',
'jordan previous said']
Earlier when working with the file, this was the code I was using:
def run(command):
output = subprocess.check_output(command, shell=True)
return output
row = run('cat '+'/Users/mink/X_true.txt'+" | wc -l").split()[0]
Now when I working with X_true as an array, how can I write an equivalent statement for the row assignment above?

len(X_true_array) ,where X_true_array is the array of ur file content represented by array.
because before then u use wc -l to get the line count of ur file,and in here u can represent the line count through the count of array item.

So I understand this correctly, you just want to read in a file and store each line as an element of an array?
X_true = []
with open("X_true.txt") as f:
for line in f:
X_true.append(line.strip())
Another option (thanks #roeland):
with open("X_true.txt") as f:
X_true = list(map(str.strip, f))

with open(X_true.txt) as f:
X_true= f.readlines()
or with stripping the newline character:
X_true= [line.rstrip('\n') for line in open(X_true.txt)]
Refer Input and Ouput:

Try this:
Using readlines
X_true = open("x_true.txt").readlines()
Using read:
X_true = open("x_true.txt").read().split("\n")
Using List comprehension:
X_true = [line.rstrip() for line in open("x_true.txt")]

with open(X_true.txt) as f:
array_of_lines = f.readlines()
array_of_lines will look like your example above. Note: it will still have the newline characters at the end of each string in the array. Those can be removed with string.strip() if they're a concern.

Related

who does IndexError: list index out of range appears ? i did some test still cant find out

Im working on a simple project python to practice , im trying to retreive data from file and do some test on a value
in my case i do retreive data as table from a file , and i do test the last value of the table if its true i add the whole line in another file
Here my data
AE300812 AFROUKH HAMZA 21 admis
AE400928 VIEGO SAN 22 refuse
AE400599 IBN KHYAT mohammed 22 admis
B305050 BOUNNEDI SALEM 39 refuse
here my code :
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
contenu = fichier.read()
tab = contenu.split()
for i in range(0,len(tab),5):
if tab[i+4]=="admis":
fichier2.write(tab[i]+" "+tab[i+1]+" "+tab[i+2]+" "+tab[i+3]+" "+tab[i+4]+" "+"\n")
fichier.close()
And here the following error :
if tab[i+4]=="admis":
IndexError: list index out of range
You look at tab[i+4], so you have to make sure you stop the loop before that, e.g. with range(0, len(tab)-4, 5). The step=5 alone does not guarantee that you have a full "block" of 5 elements left.
But why does this occur, since each of the lines has 5 elements? They don't! Notice how one line has 6 elements (maybe a double name?), so if you just read and then split, you will run out of sync with the lines. Better iterate lines, and then split each line individually. Also, the actual separator seems to be either a tab \t or double-spaces, not entirely clear from your data. Just split() will split at any whitespace.
Something like this (not tested):
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
for line in fichier:
tab = line.strip().split(" ") # actual separator seems to be tab or double-space
if tab[4]=="admis":
fichier2.write(tab[0]+" "+tab[1]+" "+tab[2]+" "+tab[3]+" "+tab[4]+" "+"\n")
Depending on what you actually want to do, you might also try this:
with open("concours.txt","r") as fichier, open("admis.txt","w") as fichier2:
for line in fichier:
if line.strip().endswith("admis"):
fichier2.write(line)
This should just copy the admis lines to the second file, with the origial double-space separator.

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Using multiple genfromtxt on a single file

I'm fairly new to Python and am currently having problems with handling my input file reads. Basically I want my code to take an input file, where the relevant info is contained in blocks of 4 lines. For my specific purpose, I only care about the info in lines 1-3 of each block.
A two-block example of the input I'm dealing with, looks like:
#Header line 1
#Header line 2
'Mn 1', 5130.0059, -2.765, 5.4052, 2.5, 7.8214, 1.5, 1.310, 2.390, 0.500, 8.530,-5.360,-7.630,
' LS 3d6.(5D).4p z6F*'
' LS 3d6.(5D).4d e6F'
'K07 A Kurucz MnI 2007 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 1 K07 Mn '
'Fe 2', 5130.0127, -5.368, 7.7059, 2.5, 10.1221, 2.5, 1.030, 0.860, 0.940, 8.510,-6.540,-7.900,
' LS 3d6.(3F2).4p y4F*'
' LS 3d5.4s2 2F2'
'RU Kurucz FeII 2013 4 K13 5 RU 4 K13 4 K13 4 K13 4 K13 4 K13 4 K13 4 K13 Fe+ '
I would prefer to store the info from each of these three lines in separate arrays. Since the entries are a mix of strings and floats, I'm using Numpy.genfromtxt to read the input file, as follows:
import itertools
import numpy as np
with open(input_file) as f_in:
#Opening file, reading every fourth line starting with line 2.
data = np.genfromtxt(itertools.islice(f_in,2,None,4),dtype=None,delimiter=",")
#Storing lower transition designation:
low = np.genfromtxt(itertools.islice(f_in,3,None,4),dtype=str)
#Storing upper transition designation:
up = np.genfromtxt(itertools.islice(f_in,4,None,4),dtype=str)
Upon executing the code, genfromtxt correctly reads the information from the file the first time. However, for the second and third call to genfromtxt, I get the following warning
UserWarning: genfromtxt: Empty input file: "<itertools.islice object at 0x102d7a1b0>"
warnings.warn('genfromtxt: Empty input file: "%s"' % fname)
Whereas this is only a warning, the arrays returned by the second and third call of genfromtxt are empty, and not containing strings as expected. If I comment out the second and third call of genfromtxt, the code behaves as expected.
As far as I understand, the above should be working, and I'm a bit at a loss as to why it doesn't. Ideas?
After the first genfromtext (well, really the islice), the file iterator has reached the end of the file. Thus the warnings and empty arrays: the second two islice calls are using an empty iterator.
You'll want to read the file into memory line-by-line with f_in.readlines() as in hpaulj's answer, or add f_in.seek(0) before your subsequent reads, to reset the file pointer back to the beginning of the input. This is a slightly more memory-friendly solution, which could be important if those files are really huge.
# Note: Untested code follows
with open(input_file) as f_in:
np.genfromtxt(itertools.islice(f_in,2,None,4),dtype=None,delimiter=",")
f_in.seek(0) # Set the file pointer back to the beginning
low = np.genfromtxt(itertools.islice(f_in,3,None,4),dtype=str)
f_in.seek(0) # Set the file pointer back to the beginning
up = np.genfromtxt(itertools.islice(f_in,4,None,4),dtype=str)
Try this:
with open(input_file) as f_in:
#Opening file, reading every fourth line starting with line 2.
lines = f_in.readlines()
data = np.genfromtxt(lines[2::4],dtype=None,delimiter=",")
#Storing lower transition designation:
low = np.genfromtxt(lines[3::4],dtype=str)
#Storing upper transition designation:
up = np.genfromtxt(lines[4::4],dtype=str)
I haven't used islice much, but the itertools tend to be generators, which iterate through to the end. You have to be careful when calling them repeatedly. You might be able to make islice work with tee or repeat. But the simplest, I think is to get a list of lines, and selected the relevant ones with ordinary indexing.
Example with tee:
with open('myfile.txt') as f:
its = itertools.tee(f,2)
print(list(itertools.islice(its[0],0,None,2)))
print(list(itertools.islice(its[1],1,None,2)))
Now the file is read once, but can be iterated through twice.

how to sort a list by the nth element in v2.3?

This is a simple script I wrote:
#!/usr/bin/env python
file = open('readFile.txt', 'r')
lines = file.readlines()
file.close()
del file
sortedList = sorted(lines, key=lambda lines: lines.split('\t')[-2])
file = open('outfile.txt', 'w')
for line in sortedList:
file.write(line)
file.close()
del file
to rewrite a file like this:
161788 group_monitor.sgmops 4530 1293840320 1293840152
161789 group_atlas.atlas053 22350 1293840262 1293840152
161790 group_alice.alice017 210 1293840254 1293840159
161791 group_lhcb.pltlhc15 108277 1293949235 1293840159
161792 group_atlas.sgmatlas 35349 1293840251 1293840160
(where the last two fields are epoch time) ordered by the next to last field to this:
161792 group_atlas.sgmatlas 35349 1293840251 1293840160
161790 group_alice.alice017 210 1293840254 1293840159
161789 group_atlas.atlas053 22350 1293840262 1293840152
161788 group_monitor.sgmops 4530 1293840320 1293840152
161791 group_lhcb.pltlhc15 108277 1293949235 1293840159
As you can see, I used sorted(), which was introduced in v2.4, how can I rewrite the script for v2.3, so that it does that same thing.
In addition, I want to convert the epoch time to the human-readable format, so the resultant file looks like this:
161792 group_atlas.sgmatlas 35349 01/01/11 00:04:11 01/01/11 00:02:40
161790 group_alice.alice017 210 01/01/11 00:04:14 01/01/11 00:02:39
161789 group_atlas.atlas053 22350 01/01/11 00:04:22 01/01/11 00:02:32
I know, this strftime("%d/%m/%y %H:%M:%S", gmtime()) can be used to convert the epoch time but I just can't figure out how can I apply that to the script to rewrite the file in that format.
Comments? Advice treasured!
#Mark: Update
In some cases, the epoch time comes as 3600, which is to indicate an unfinished business. I wanted to print aborted instead of 01/01/1970 for such a line. So, I changed the format_seconds_since_epoch() like this:
def format_seconds_since_epoch(t):
if t == 3600:
return "aborted"
else:
return strftime("%d/%m/%y %H:%M:%S",datetime.fromtimestamp(t).timetuple())
which solved the problem. Is it the best that can be done in this regard? Cheers!!
file = open('readFile.txt', 'r')
lines = file.readlines()
file.close()
del file
lines = [line.split(' ') for line in lines]
lines.sort(lambda x,y: cmp(x[2], y[2])
lines = [' '.join(line) for line in lines]
In reply to your final query, you can create a datetime object from a time_t-like "seconds since the epoch" value using datetime.fromtimestamp, e.g.
from datetime import datetime
from time import strftime
def format_seconds_since_epoch(t):
return strftime("%d/%m/%y %H:%M:%S",datetime.fromtimestamp(t).timetuple())
print format_seconds_since_epoch(1293840160)
So, putting that together with a slightly modified version of pynator's answer, you script might look like:
#!/usr/bin/env python
from datetime import datetime
from time import strftime
import os
def format_seconds_since_epoch(t):
return strftime("%d/%m/%y %H:%M:%S",datetime.fromtimestamp(t).timetuple())
fin = open('readFile.txt', 'r')
lines = fin.readlines()
fin.close()
del fin
split_lines = [ line.split("\t") for line in lines ]
split_lines.sort( lambda a, b: cmp(int(a[-2]),int(b[-2])) )
fout = open('outfile.txt', 'w')
for split_line in split_lines:
for i in (-2,-1):
split_line[i] = format_seconds_since_epoch(int(split_line[i]))
fout.write("\t".join(split_line)+os.linesep)
fout.close()
del fout
Note that using file as a variable name is a bad idea, since it shadows the built-in file type, so I changed them to fin and fout. (Even though you are deling the variables afterwards, it's still good style to avoid the name file, I think.)
In reply to your further question about the special "3600" value, your solution is fine. Personally, I would probably keep the format_seconds_since_epoch function as it is, so that it doesn't have a surprising special case and is more generally useful. You could create an additional wrapper function with the special case, or just change the split_line[i] = format_seconds_since_epoch(int(split_line[i])) line to:
entry = int(split_line[i])
if entry == 3600:
split_line[i] = "aborted"
else:
split_line[i] = format_seconds_since_epoch(entry)
... however I don't think there's much in the difference.
Incidentally, if this is more than a one-off task, I would see if you can use a later version of Python in the 2 series than 2.3, which is rather old now - they have lots of nice features that help one to write cleaner scripts.

Categories

Resources