Creating column data from multiple sources with varying formats in python

Creating column data from multiple sources with varying formats in python - python

So as part of my code, I'm reading file paths that have varying names, but tend to stick to the following format
p(number)_(temperature)C
What I've done with those paths is separate it into 2 columns (along with 2 more columns with actual data) so I end up with a row that looks like this:
p2 18 some number some number
However, I've found a few folders that use the following format:
p(number number)_(temperature)C
As it stands, for the first case, I use the following code to separate the file path into the proper columns:
def finale():
for root, dirs, files in os.walk('/Users/Bashe/Desktop/12/'):
file_name = os.path.join(root,"Graph_Info.txt")
file_name_out = os.path.join(root,"Graph.txt")
file = os.path.join(root, "StDev.txt")
if os.path.exists(os.path.join(root,"Graph_Info.txt")):
with open(file_name) as fh, open(file) as th, open(file_name_out,"w") as fh_out:
first_line = fh.readline()
values = eval(first_line)
for value, line in zip(values, fh):
first_column = value[0:2]
second_column = value[3:5]
third_column = line.strip()
fourth_column = th.readline().strip()
fh_out.write("%s\t%s\t%s\t%s\n" % (first_column, second_column, third_column, fourth_column))
else:
pass
I've played around with things and found that if I make the following changes, the program works properly.
first_column = value[0:3]
second_column = value[4:6]
Is there a way I can get the program to look and see what the file path is and act accordingly?

welcome to the fabulous world of regex.
import re
#..........
#case 0
if re.match(r"p\(\d+\).*", path) :
#stuff
#case 1
elif re.match(r"p\(\d+\s\d+\).*", path):
#other stuff

>>> for line in s.splitlines():
... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups()
... print first, " +",second
...
22 + 66
33 44 + 44
23 33 + 22

Related

Compare content of two files

I am comaring 2 files with same names in 2 directories as follows: This is my pseudocode and I have also included the code I wrote for this program.
dir1 = "/home/1"
dir2 = "/home/2"
loop through all the files in dir1 and dir2
if the name of file in dir1 is same as the name then:
for those particular files:
read all the lines in file1 in dir1 and file2 in dir2
#the file1.txt dir1 has below data( just an example my files data are different):
20 30 40
2 7 8
#file1.txt in dir2 has below data:
31 41 51
11 14 14
#I want to now compare these files in the following way:
#compare each line in file1.txt in dir with each line of
file1.txt in dir2
(i.e first line of file1.txt in dir1 with 1st, 2nd, 3rd..last
line of file1.txt in dir2....
second line with 1st,2nd,3rd...last line and so on)
If the difference of all the corresponding elements(i.e 20-
31<=10 and 30-41<=10 and 40-51<=10 then
print the 1st line of file1.txt of dir1)
do the same thing for each line of each files print out the
result.
This is my code:
dir1 = "/home/one"
dir2 = "/home/two"
for file in os.listdir(dir1):
file2 = os.path.join(dir2, file)
if os.path.exists(file2):
file1 = os.path.join(dir1, file)
print(file)
with open(file1, "r") as f1, open(file2, "r") as f2:
# how to do the comparison?
# how to compare first four element of line 1 of f1 with all the first four
#element of each line of f2 ?**
same = True
while True:
line1 = f1.readline().split()[:4]
print(line1)
line2 = f2.readline().split()[:4]
print(line2)
# one way to compare but is not very logical and thorough
if all(float(el1)-float(el2) < 11 for el1, el2 in zip(line2, line1)):
print(el1,el2)
same = True
if len(line1) == 0:
break
if same:
print(line1)
#print(el1, el2)
#print(line2)
else:
print("files are different")
I think I am missing something, because it doesn't print all the similar lines as it should.
If input file are as follows:
file1.txt in dir1:
10 20 30
100 200 300
1000 2000 3000
file1.txt in dir2:
15 30 40
120 215 315
27 25 35
Expected output:
10 20 30
(as 1st line of file1.txt in dir1 satisfies the condition with only the first line of file1.txt in dir2 where 10-15<=10 and 20-30<=10 and 30-40=10<=10 ( consider all modulus)

This problem is fairly tricky as there is a number of requirements. If I understand correctly, they are:
get a list of files in first directory
for each file find file with same name in second directory
get corresponding line in paired files
compare corresponding value in each line
if all values in a line match criteria print line in first file
It can be easier to create such a script if you break the problem down into smaller problems and test each piece. Python allows you to do this with functions for each piece.
As an example you could break your problem down as follows:
from pathlib import Path
def get_matching_file(known_file, search_dir):
f2_file = search_dir.joinpath(known_file.name)
if f2_file.exists():
return f2_file
raise FileNotFoundError
def get_lines(file_loc):
result = []
lines = file_loc.read_text().splitlines()
for row in lines:
row_as = []
for num in row.split():
row_as.append(int(num))
result.append(row_as)
return result
def compare_content(f1_rows, f2_rows):
result = []
for row1 in f1_rows:
for row2 in f2_rows:
row_result = []
for v1, v2 in zip(row1, row2):
# print(f'{v2} - {v1} = {v2 - v1}')
row_result.append(abs(v2 - v1) <= 10)
if all(row_result):
result.append(row1)
return result
def print_result(result):
for line in result:
print(' '.join(str(int(num)) for num in line))
def main(dir1, dir2):
for file_one in dir1.glob('*.txt'):
file_two = get_matching_file(file_one, dir2)
f1_content = get_lines(file_one)
f2_content = get_lines(file_two)
result = compare_content(f1_content, f2_content)
print_result(result)
if __name__ == '__main__':
main(dir1=Path(r'/Temp/text_compare/one'),
dir2=Path(r'/Temp/text_compare/two'))
The above gave me the output you were looking for of:
10 20 30
Structuring the code this way then allows you to test small parts of your script at a time. For example if I wanted to just test the compare_content function I could change the code at the bottom to read:
if __name__ == '__main__':
test_result = compare_content(f1_rows=[[10, 20, 30]],
f2_rows=[[15, 30, 40]])
print('Test 1:', test_result)
test_result = compare_content(f1_rows=[[100, 200, 300]],
f2_rows=[[120, 215, 315]])
print('Test 2:', test_result)
which would give the output:
Test 1: [[10, 20, 30]]
Test 2: []
Or test reading values from file:
if __name__ == '__main__':
test_result = get_lines(Path(r'/Temp/text_compare/one/log.txt'))
print(test_result)
Gave the output of:
[[10, 20, 30], [100, 200, 300], [1000, 2000, 3000]]
This also allows you to optimize just parts of your script. For example, you might want to use Python's csv module to simplify the reading of the files. This would require the change of just one function. For example:
def get_lines(file_loc):
with file_loc.open() as ssv:
space_reader = csv.reader(ssv,
delimiter=' ',
quoting=csv.QUOTE_NONNUMERIC)
return list(space_reader)

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!

As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)

Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

Data Deletion in File Using Python

As new programmer in Python Programming Language, I thought to create a student Database Management System in Python. But while deleting the Data from the file I got stuck and I thought to apply these steps to the file to delete the characters but how shall I Implement it? I have developed my code but it's not working.
The algorithm:
STEP 1: Create an additional file and open the current file in reading mode and open the new file in writing mode
STEP 2: Read and copy the Data to the newly created file except for the line we want to delete
STEP 3: Close both the file and remove the old file and rename the newly created file with the deleted filename
But while implementing it I got stuck on how to implement as it is not remaining the same.
Here is the code which I wrote:
def delete():
rollno = int(input('\n Enter The Roll number : '))
f = open('BCAstudents3.txt','r')
f1 = open('temp.txt','a+')
for line in f:
fo = line.split()
if fo:
if fo[3] != rollno:
f1.write(str(str(fo).replace('[','').replace(']','').replace("'","").replace(",","")))
f.close()
f1.close()
os.remove('BCAstudents3.txt')
os.rename('temp.txt','BCAstudents3.txt')
The Data From the Original File Looks Like This :
Roll Number = 1 Name : Alex Section = C Optimisation Technique = 99 Maths III = 99 Operating System = 99 Software Engneering = 99 Computer Graphics = 99 {Here Line change is present but it is not showing while typing on to stackoverflow } Roll Number = 2 Name : Shay Section = C Optimisation Technique = 99 Maths III = 99 Operating System = 99 Software Engneering = 99 Computer Graphics = 99`
and the Resullt after The Deletion is this :
Roll Number = 1 Name : Alex Section = C Optimisation Technique = 99 Maths III = 99 Operating System = 99 Software Engneering = 99 Computer Graphics = 99Roll Number = 2 Name : Shay Section = C Optimisation Technique = 99 Maths III = 99 Operating System = 99 Software Engneering = 99 Computer Graphics = 99
and I also want to give comma after the end of the data But don't have any idea that how to do this one

I modified your code and it should work how you wanted. A couple of things to consider:
Your original text file seems to indicate that there are line breaks for each Roll Number. I assumed that with my answer.
Because you are reading a text file, there are no integers so fo[3] would not ever match rollno if you are converting the input to an int.
I wasn't sure exactly where you wanted the comma. After each line? Or just at the very end.
I wasn't sure if you wanted new lines for each Roll Number.
def delete():
rollno = input('\n Enter The Roll number : ')
f = open('BCAstudents3.txt','r')
f1 = open('temp.txt','a+')
for line in f:
fo = line.split()
if fo:
if fo[3] != rollno:
newline = " ".join(fo) + ","
#print(newline)
f1.write(newline)
f.close()
f1.close()
os.remove('BCAstudents3.txt')
os.rename('temp.txt','BCAstudents3.txt')

I made your programm a little simpler.
Hopefully you can use it:
def delete():
line = input("Line you want to delete: ")
line = int(line)
line -= 1
file = open("file.txt","r")
data = file.readlines()
del data[line]
file = open("file.txt","w")
for line in data:
file.write(line)
file.close()

Python sum certain values from multiple text files

I have multiple text files that contain multiple lines of floats and each line has two floats separated by white space, like this: 1.123 456.789123. My task is to sum floats after white space from each text file. This has to be done for all lines. For example, if I have 3 text files:
1.213 1.1
23.33 1
0.123 2.2
23139 0
30.3123 3.3
44.4444 444
Now the sum of numbers on the first lines should be 1.1 + 2.2 + 3.3 = 6.6. And the sum of numbers on second lines should be 1 + 0 + 444 = 445. I tried something like this:
def foo(folder_path):
contents = os.listdir(folder_path)
for file in contents:
path = os.path.join(folder_path, file)
with open(path, "r") as data:
rows = data.readlines()
for row in rows:
value = row.split()
second_float = float(value[1])
return sum(second_float)
When I run my code I get this error: TypeError: 'float' object is not iterable. I've been pulling my hair out with this, and don't know what to do can anyone help?

Here is how I would do it:
def open_file(file_name):
with open(file_name) as f:
for line in f:
yield line.strip().split() # Remove the newlines and split on spaces
files = ('text1.txt', 'text2.txt', 'text3.txt')
result = list(zip(*(open_file(f) for f in files)))
print(*result, sep='\n')
# result is now equal to:
# [
# (['1.213', '1.1'], ['0.123', '2.2'], ['30.3123', '3.3']),
# (['23.33', '1'], ['23139', '0'], ['44.4444', '444'])
# ]
for lst in result:
print(sum(float(x[1]) for x in lst)) # 6.6 and 445.0
It may be more logical to type cast the values to float inside open_file such as:
yield [float(x) for x in line.strip().split()]
but I that is up to you on how you want to change it.
See it in action.
-- Edit --
Note that the above solution loads all the files into memory before doing the math (I do this so I can print the result), but because of how the open_file generator works you don't need to do that, here is a more memory friendly version:
# More memory friendly solution:
# Note that the `result` iterator will be consumed by the `for` loop.
files = ('text1.txt', 'text2.txt', 'text3.txt')
result = zip(*(open_file(f) for f in files))
for lst in result:
print(sum(float(x[1]) for x in lst))

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Sorry for my previous post, I had no idea what I was doing. I am trying to cut out certain ranges of lines in a given input file and print that range to a separate file. This input file looks like:
18
generated by VMD
C 1.514895 -3.887949 2.104134
C 2.371076 -2.780954 1.718424
C 3.561071 -3.004933 1.087316
C 4.080424 -4.331872 1.114878
C 3.289761 -5.434047 1.607808
C 2.018473 -5.142150 2.078551
C 3.997237 -6.725186 1.709355
C 5.235126 -6.905640 1.295296
C 5.923666 -5.844841 0.553037
O 6.955216 -5.826197 -0.042920
O 5.269004 -4.590026 0.590033
H 4.054002 -2.184680 0.654838
H 1.389704 -5.910354 2.488783
H 5.814723 -7.796634 1.451618
O 1.825325 -1.537706 1.986256
H 2.319215 -0.796042 1.550394
H 3.390707 -7.564847 2.136680
H 0.535358 -3.663175 2.483943
18
generated by VMD
C 1.519866 -3.892621 2.109595
I would like to print every 100th frame starting from the first frame into its own file named "snapshot0.xyz" (The first frame is frame 0).
For example, the above input shows two snapshots. I would like to print out lines 1:20 into its own file named snapshot0.xyz and then skip 100 (2000 lines) snapshots and print out snapshot1.xyz (with the 100th snapshot). My attempt was in python, but you can choose either grep, awk, sed, or Python.
My input file: frames.dat
1 #!/usr/bin/Python
2
3
4
5 mest = open('frames.dat', 'r')
6 test = mest.read().strip().split('\n')
7
8 for i in range(len(test)):
9 if test[i] == '18':
10 f = open("out"+`i`+".dat", "w")
11 for j in range(19):
12 print >> f, test[j]
13 f.close()

I suggest using the csv module for this input.
import csv
def strip_empty_columns(line):
return filter(lambda s: s.strip() != "", line)
def is_count(line):
return len(line) == 1 and line[0].strip().isdigit()
def is_float(s):
try:
float(s.strip())
return True
except ValueError:
return False
def is_data_line(line):
return len(line) == 4 and is_float(line[1]) and is_float(line[2]) and is_float(line[3])
with open('frames.dat', 'r') as mest:
r = csv.reader(mest, delimiter=' ')
current_count = 0
frame_nr = 0
outfile = None
for line in r:
line = strip_empty_columns(line)
if is_count(line):
if frame_nr % 100 == 0:
outfile = open("snapshot%d.xyz" % frame_nr, "w+")
elif outfile:
outfile.close()
outfile = None
frame_nr += 1 # increment the frame counter every time you see this header line like '18'
elif is_data_line(line):
if outfile:
outfile.write(" ".join(line) + "\n")
The opening post mentions to write every 100th frame to an output file named snapshot0.xyz. I assume the 0 should be a counter, ot you would continously overwrite the file. I updated the code with a frame_nr counter and a few lines which open/close an output file depending on the frame_nr and write data if an output file is open.

This might work for you (GNU sed and csplit):
sed -rn '/^18/{x;/x{100}/z;s/^/x/;x};G;/\nx$/P' file | csplit -f snapshot -b '%d.xyz' -z - '/^18/' '{*}'
Filter every 100th frame using sed and pass that file to csplit to create the individual files.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating column data from multiple sources with varying formats in python - python

welcome to the fabulous world of regex. import re #.......... #case 0 if re.match(r"p\(\d+\).", path) : #stuff #case 1 elif re.match(r"p\(\d+\s\d+\).", path): #other stuff

>>> for line in s.splitlines(): ... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups() ... print first, " +",second ... 22 + 66 33 44 + 44 23 33 + 22

Related

Compare content of two files

python regex: Parsing file name

Data Deletion in File Using Python

Python sum certain values from multiple text files

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating column data from multiple sources with varying formats in python - python

welcome to the fabulous world of regex. import re #.......... #case 0 if re.match(r"p\(\d+\).*", path) : #stuff #case 1 elif re.match(r"p\(\d+\s\d+\).*", path): #other stuff

>>> for line in s.splitlines(): ... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups() ... print first, " +",second ... 22 + 66 33 44 + 44 23 33 + 22

Related

Compare content of two files

python regex: Parsing file name

Data Deletion in File Using Python

Python sum certain values from multiple text files

How do I print a range of lines after a specific pattern into separate files when this pattern appears several times in an input file

Categories

Resources

welcome to the fabulous world of regex. import re #.......... #case 0 if re.match(r"p\(\d+\).", path) : #stuff #case 1 elif re.match(r"p\(\d+\s\d+\).", path): #other stuff