Reading like "list = [][][]" lists from couple of txt files - python

I have 9 txt files. Every text is like:
[(0.0, 32.633221, 39.91769),
(8.32, 32.633717, 39.917892),
(25.35, 32.633945, 39.917538),
(25.93, 32.634262, 39.916946),
(7.24, 32.634888, 39.91674),
(0.0, 32.635014, 39.916737),
(15.31, 32.635242, 39.916569),
(22.12, 32.635727, 39.916176)....
I want to create a new text file that contains only the first elements of every element. I mean like:
list_firsttxtfile = [(0.0), (8.32), (25.35), (25.93),... ]

Another way of doing it.
list_firsttxtfile = []
with open("mytextfile.txt", "r") as f:
data = f.read()
parts = data.split("\)")
for part in parts:
sections = part.split(",")
list_firsttxtfile.append(sections[0])

If every line of your file contains at least one float, you can extract it using regular expression:
import re
first_float = re.compile(r'[-+]?\d*\.?\d+')
with open("firstfile.txt") as txtfile:
# if line.strip() skips blank lines if any
list_first_txt_file = [float(first_float.search(line).group()) for line in txtfile if line.strip()]
Be aware that if your file contains lines where the regex doesn't match it will fail with AttributeError: 'NoneType' object has no attribute 'group'. You can change the if part of the list-comprehension to if first_float.search(line) to avoid that but it will run the regex on the same line twice which might not be great if your files are big.
If you are double sure that there is no blank lines and that each line will have a match, then you can remove the if entirely.

Related

Using Regex to search a plaintext file line by line and cherry pick lines based on matches

I'm trying to read a plaintext file line by line, cherry pick lines that begin with a pattern of any six digits. Pass those to a list and then write that list row by row to a .csv file.
Here's an example of a line I'm trying to match in the file:
**000003** ANW2248_08_DESOLATE-WASTELAND-3. A9 C 00:55:25:17 00:55:47:12 10:00:00:00 10:00:21:20
And here is a link to two images, one showing the above line in context of the rest of the file and the expected result: https://imgur.com/a/XHjt9e1
import csv
identifier = re.compile(r'^(\d\d\d\d\d\d)')
matched_line = []
with open('file.edl', 'r') as file:
reader = csv.reader(file)
for line in reader:
line = str(line)
if identifier.search(line) == True:
matched_line.append(line)
else: continue
with open('file.csv', 'w') as outputEDL:
print('Copying EDL contents into .csv file for reformatting...')
outputEDL.write(str(matched_line))
Expected result would be the reader gets to a line, searches using the regex, then if the result of the search finds the series of 6 numbers at the beginning, it appends that entire line to the matched_line list.
What I'm actually getting is, once I write what reader has read to a .csv file, it has only picked out [], so the regex search obviously isn't functioning properly in the way I've written this code. Any tips on how to better form it to achieve what I'm trying to do would be greatly appreciated.
Thank you.
Some more examples of expected input/output would better help with solving this problem but from what I can see you are trying to write each line within a text file that contains a timestamp to a csv. In that case here is some psuedo code that might help you solve your problem as well as a separate regex match function to make your code more readable
import re
def match_time(line):
pattern = re.compile(r'(?:\d+[:]\d+[:]\d+[:]\d+)+')
result = pattern.findall(line)
return " ".join(result)
This will return a string of the entire timecode if a match is found
lines = []
with open('yourfile', 'r') as txtfile:
with open('yourfile', 'w') as csvfile:
for line in txtfile:
res = match_line(line)
#alternatively you can test if res in line which might be better
if res != "":
lines.append(line)
for item in lines:
csvfile.write(line)
Opens a text file for reading, if the line contains a timecode, appends the line to a list, then iterates that list and writes the line to the csv.

Concatenate lines with previous line based on number of letters in first column

New to coding and trying to figure out how to fix a broken csv file to make be able to work with it properly.
So the file has been exported from a case management system and contains fields for username, casenr, time spent, notes and date.
The problem is that occasional notes have newlines in them and when exporting the csv the tooling does not contain quotation marks to define it as a string within the field.
see below example:
user;case;hours;note;date;
tnn;123;4;solved problem;2017-11-27;
tnn;124;2;random comment;2017-11-27;
tnn;125;3;I am writing a comment
that contains new lines
without quotation marks;2017-11-28;
HJL;129;8;trying to concatenate lines to re form the broken csv;2017-11-29;
I would like to concatenate lines 3,4 and 5 to show the following:
tnn;125;3;I am writing a comment that contains new lines without quotation marks;2017-11-28;
Since every line starts with a username (always 3 letters) I thought I would be able to iterate the lines to find which lines do not start with a username and concatenate that with the previous line.
It is not really working as expected though.
This is what I have got so far:
import re
with open('Rapp.txt', 'r') as f:
for line in f:
previous = line #keep current line in variable to join next line
if not re.match(r'^[A-Za-z]{3}', line): #regex to match 3 letters
print(previous.join(line))
Script shows no output just finishes silently, any thoughts?
I think I would go a slightly different way:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
for line in f:
if not re.search("\d{4}-\d{1,2}-\d{1,2};\n", line):
line = re.sub("\n", "", line)
all_the_data = "".join([all_the_data, line])
print (all_the_data)
There a several ways to do this each with pros and cons, but I think this keeps it simple.
Loop the file as you have done and if the line doesn't end in a date and ; take off the carriage return and stuff it into all_the_data. That way you don't have to play with looking back 'up' the file. Again, lots of way to do this. If you would rather use the logic of starts with 3 letters and a ; and looking back, this works:
import re
all_the_data = ""
with open('Rapp.txt', 'r') as f:
all_the_data = ""
for line in f:
if not re.search("^[A-Za-z]{3};", line):
all_the_data = re.sub("\n$", "", all_the_data)
all_the_data = "".join([all_the_data, line])
print ("results:")
print (all_the_data)
Pretty much what was asked for. The logic being if the current line doesn't start right, take out the previous line's carriage return from all_the_data.
If you need help playing with the regex itself, this site is great: http://regex101.com
The regex in your code matches to all the lines (string) in the txt (finds a valid match to the pattern). The if condition is never true and hence nothing prints.
with open('./Rapp.txt', 'r') as f:
join_words = []
for line in f:
line = line.strip()
if len(line) > 3 and ";" in line[0:4] and len(join_words) > 0:
print(';'.join(join_words))
join_words = []
join_words.append(line)
else:
join_words.append(line)
print(";".join(join_words))
I've tried to not use regex here to keep it a little clear if possible. But, regex is a better option.
A simple way would be to use a generator that acts as a filter on the original file. That filter would concatenate a line to the previous one if it has not a semicolon (;) in its 4th column. Code could be:
def preprocess(fd):
previous = next(fd)
for line in fd:
if line[3] == ';':
yield previous
previous = line
else:
previous = previous.strip() + " " + line
yield previous # don't forget last line!
You could then use:
with open(test.txt) as fd:
rd = csv.DictReader(preprocess(fd))
for row in rd:
...
The trick here is that the csv module only requires on object that returns a line each time next function is applied to it, so a generator is appropriate.
But this is only a workaround and the correct way would be that the previous step directly produces a correct CSV file.

Iterate over a CSV file Python

I have a CSV file that looks like this
a,b,c
d1,g4,4m
t,35,6y
mm,5,m
I'm trying to replace all the m's and y's preceded by a number with 'month' and 'year' respectively. I'm using the following script.
import re,csv
out = open ("out.csv", "wb")
file = "in.csv"
with open(file, 'r') as f:
reader = csv.reader(f)
for ss in reader:
s = str(ss)
month_pair = (re.compile('(\d\s*)m'), 'months')
year_pair = (re.compile('(\d\s*)y'), 'years')
def substitute(s, pairs):
for (pattern, substitution) in pairs:
match = pattern.search(s)
if match:
s = pattern.sub(match.group(1)+substitution, s)
return s
pairs = [month_pair, year_pair]
print (substitute(s, pairs))
It does replace but it does that only on the last row, ignoring the ones before it. How can I have it iterate over all the rows and write to another csv file?
You can use positive look-behind :
>>> re.sub(r'(?<=\d)m','months',s)
'a,b,c\nd1,g4,4months\nt,35,6y\nmm,5,m'
>>> re.sub(r'(?<=\d)y','years',s)
'a,b,c\nd1,g4,4m\nt,35,6years\nmm,5,m'
In this line
print (substitute(s, pairs))
your variable s is only the last line in your file. Note how you update s in your file reading to be the current line.
Solutions (choose one):
You could try another for-loop to iterate over all lines.
Or move the substitution into the for-loop where you read the lines of the file. This is definitely the better solution!
You can easily lookup how to write a new file or change the file you are working on.

Python using re module to parse an imported text file

def regexread():
import re
result = ''
savefileagain = open('sliceeverfile3.txt','w')
#text=open('emeverslicefile4.txt','r')
text='09,11,14,34,44,10,11, 27886637, 0\n561, Tue, 5,Feb,2013, 06,25,31,40,45,06,07, 19070109, 0\n560, Fri, 1,Feb,2013, 05,21,34,37,38,01,06, 13063500, 0\n559, Tue,29,Jan,2013,'
pattern='\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
#with open('emeverslicefile4.txt') as text:
f = re.findall(pattern,text)
for item in f:
print(item)
savefileagain.write(item)
#savefileagain.close()
The above function as written parses the text and returns sets of seven numbers. I have three problems.
Firstly the 'read' file which contains exactly the same text as text='09,...etc' returns a TypeError expected string or buffer, which I cannot solve even by reading some of the posts.
Secondly, when I try to write results to the 'write' file, nothing is returned and
thirdly, I am not sure how to get the same output that I get with the print statement, which is three lines of seven numbers each which is the output that I want.
This should do the trick:
import re
filename = 'sliceeverfile3.txt'
pattern = '\d\d,\d\d,\d\d,\d\d,\d\d,\d\d,\d\d'
new_file = []
# Make sure file gets closed after being iterated
with open(filename, 'r') as f:
# Read the file contents and generate a list with each line
lines = f.readlines()
# Iterate each line
for line in lines:
# Regex applied to each line
match = re.search(pattern, line)
if match:
# Make sure to add \n to display correctly when we write it back
new_line = match.group() + '\n'
print new_line
new_file.append(new_line)
with open(filename, 'w') as f:
# go to start of file
f.seek(0)
# actually write the lines
f.writelines(new_file)
You're sort of on the right track...
You'll iterate over the file:
How to iterate over the file in python
and apply the regex to each line. The link above should really answer all 3 of your questions when you realize you're trying to write 'item', which doesn't exist outside of that loop.

Deleting Relative Lines with Regex

Using pdftotext, a text file was created that includes footers from the source pdf. The footers get in the way of other parsing that needs to be done. The format of the footer is as follows:
This is important text.
9
Title 2012 and 2013
\fCompany
Important text begins again.
The line for Company is the only one that does not recur elsewhere in the file. It appears as \x0cCompany\n. I would like to search for this line and remove it and the preceding three lines (the page number, title, and a blank line) based on where the \x0cCompany\n appears. This is what I have so far:
report = open('file.txt').readlines()
data = range(len(report))
name = []
for line_i in data:
line = report[line_i]
if re.match('.*\\x0cCompany', line ):
name.append(report[line_i])
print name
This allows me to make a list storing which line numbers have this occurrence, but I do not understand how to delete these lines as well as the three preceding lines. It seems I need to create some other loop based on this loop but I cannot make it work.
Instead of iterating through and getting the indices of that lines you want to delete, iterate through your lines and append only the lines that you want to keep.
It would also be more efficient to iterate your actual file object, rather than putting it all into one list:
keeplines = []
with open('file.txt') as b:
for line in b:
if re.match('.*\\x0cCompany', line):
keeplines = keeplines[:-3] #shave off the preceding lines
else:
keeplines.append(line)
file = open('file.txt', 'w'):
for line in keeplines:
file.write(line)

Categories

Resources