Separate lines in Python - python

I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]

With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.

Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.

It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)

Related

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

I have a file with lines in this format:
CALSPHERE 1
1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996
2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319
CALSPHERE 2
1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990
2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421
..etc.
I would like to parse this into a dictionary of the format:
{CALSPHERE 1:(1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996, 2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319),
CALSPHERE 2:(1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990, 2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421),...}
I'm puzzled as to how to parse this, so that every third line is the key, with the following two lines forming a tuple for the value. What would be the best way to do this in python?
I've attempted to add some logic for "every third line" though it seems kind of convoluted; something like
with open(r"file") as f:
i = 3
for line in f:
if i%3=0:
key = line
else:
#not sure what to do with the next lines here
If your file always have the same distribution (i.e: the 'CALSPHERE' word -or any other that you want it as your dictionary key-, followed by two lines), you can achieve what you want by doing something as follows:
with open(filename) as file:
lines = file.read().splitlines()
d = dict()
for i in range(0, len(lines), 3):
d[lines[i].strip()] = (lines[i + 1], lines[i + 2])
Output:
{
'CALSPHERE 1': ('1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996', '2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319'),
'CALSPHERE 2': ('1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990', '2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421')
}
Assuming that your content is in file.txt you can use the following.
It shall work for any number of CALSPHERE keyword occurrences and also various number of entries between.
with open('file.txt') as inp:
buffer = []
for line in inp:
# remove newline
copy = line.replace('\n','')
# check if next entry
if 'CALSPHERE' in copy:
buffer.append([])
# add line
buffer[-1].append(copy)
# put the output into dictionary
res = {}
for chunk in buffer:
# safety check
if len(chunk) > 1:
res[chunk[0]] = tuple( chunk[1:] )
print(res)

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Formatting output of CSV data?

I'm fairly new to python and made something that had this output:
(The text is in a csv file so so:
1,A
2,B
3,C etc)
Number Letter
1 A
2 B
3 C
26 Z
Unfortunately, I spent a good amount of time making it using a complicated method in which I manually made spaces like this:
Updated Code rn
fx = int(input('Number?\n'))
f=open('nums.txt','r')
lines=f.readlines()
line = lines[fx - 1]
with open('nums.txt','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
NUM, LTR, SMB = line.rsplit(',', 1)
print(NUM.ljust(13) + LTR.ljust(13) + SMB)
How do I get it to make 3 columns? Right now it comes up with a
ValueError: not enough values to unpack (expected 3, got 2)
So is there a simpler method of achieving this that doesn't move the strings around like this:
Number Letter
1 A
2 B
3 C
26 Z #< string moves with spaces.
For simple alignment, you can use ljust or rjust. There is also no need to read the entire file for each line you want to process:
with open('numberletter','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
number, letter = line.rsplit(',', 1)
print(number.ljust(13) + letter)
For more complex output formatting, look at str.format() and the formatting syntax
You can use sys module for that.
import sys
a=[1,"A"]
sys.stdout.write("%-6s %-50s " % (a[0],a[1]))

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

How can I increase the i in a for?

I want to increase in 6 the value of i to read a file that have a question with 4 answers and a character of the correct answer, for example:
A
Which sport uses the term LOVE ?
Tennis
Golf
Football
Swimming
B
What is the German word for WATER ?
Wodar
Wasser
Werkip
Waski
My code:
fd = open(dFile)
lineas=fd.readlines()
fd.close()
for i in range(len(lineas)):
print "CA:"+lineas[i]+"Q:"+lineas[i+1]+"A1:"+lineas[i+2]+"A2:"+lineas[i+3]+"A3:"+lineas[i+4]+"A4:"+lineas[i+5];
i=i+6
Try using the "step size" argument to range or xrange:
fd = open(dFile)
lineas = fd.readlines()
fd.close()
for i in xrange(0, len(lineas), 6):
print "CA:"+lineas[i]+"Q:"+lineas[i+1]+"A1:"+lineas[i+2]+"A2:"+lineas[i+3]+"A3:"+lineas[i+4]+"A4:"+lineas[i+5];
range has an optional step argument.
for i in range(0, 10, 3):
print i # Prints 0, 3, 6, 9
For your case, use a step size of 6.
for i in range(0, len(lineas), 6):
The following code snippets illustrate a way of treating each group of six lines from the input file as a tuple, which removes some of the clumsy lineas[i], lineas[i+1] etc indexing (at the cost of a clumsy-looking zip statement). The first snippet of code is just to create a test file containing numbered lines.
with open('eh','w') as fo:
for i in range(19):
fo.write('{}\n'.format(i))
...
with open('eh') as fi:
for (ca,q,a1,a2,a3,a4) in zip(*[iter(fi.readlines())]*6):
print "CA:"+ca+"Q:"+q+"A1:"+a1+"A2:"+a2+"A3:"+a3+"A4:"+a4
This produces output like
CA:0
Q:1
A1:2
A2:3
A3:4
A4:5
CA:6
Q:7
A1:8
(etc)
Note, I would ordinarily write something like
print 'CA: {} Q: {} A1: {} A2: {} A3: {} A4: {}'.format(ca,q,a1,a2,a3,a4)
instead of
print "CA:"+ca+"Q:"+q+"A1:"+a1+"A2:"+a2+"A3:"+a3+"A4:"+a4

Categories

Resources