Python extract infos from file

Python extract infos from file - python

I have a text file with the size of all files on different servers with extension *.AAA I would like to extract the filename + size from each servers that are bigger than 20 GB. I know how to extract a line from a file and display it but here is my example and what I would like to Achieve.
The example of the file itself:
Pad 1001
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:07 AM 894,889,984 File1.AAA
05/25/2015 07:18 AM 25,673,969,664 File2.AAA
02/11/2016 02:07 AM 17,879,040 File3.AAA
05/25/2015 07:18 AM 12,386,304 File4.AAA
10/13/2008 10:29 AM 1,186,988,032 File3.AAA_oct13
02/15/2016 11:15 AM 2,799,263,744 File5.AAA
6 File(s) 30,585,376,768 bytes
0 Dir(s) 28,585,127,936 bytes free
Pad 1002
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:08 AM 1,379,815,424 File1.AAA
02/11/2016 02:08 AM 18,542,592 File3.AAA
02/15/2016 12:41 AM 853,659,648 File5.AAA
3 File(s) 2,252,017,664 bytes
0 Dir(s) 49,306,902,528 bytes free
Here is what I would like as my output The Pad# and the file that is bigger than 20GB:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
I will eventually put this in a excel spreadsheet but this I know how.
Any Ideas?
Thank you

The following should get you started:
import re
output = []
with open('input.txt') as f_input:
text = f_input.read()
for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)
for line, length in file_list:
length = int(length.replace(',', ''))
if length > 2e10: # Or your choice of what 20GB is
output.append((pad, line))
print output
This would display a list with one tuple entry as follows:
[('Pad 1001', '05/25/2015 07:18 AM 25,673,969,664 File2.AAA')]

[EDIT] Here is my approach:
import re
result = []
with open('txtfile.txt', 'r') as f:
content = [line.strip() for line in f.readlines()]
for line in content:
m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
result.append(line)
print re.sub('Pad\s+\d+$', '', ' '.join(result))
Output is:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA

Related

Reading from a txt file in Python

I have this data (Remark: don't consider this data a json file consider it a normal txt file). :
{"tstp":1383173780727,"ststates":[{"nb":901,"state":"open","freebk":6,"freebs":14},{"nb":903,"state":"open","freebk":2,"freebs":18}]}{"tstp":1383173852184,"ststates":[{"nb":901,"state":"open","freebk":6,"freebs":14}]}
I want to take all the values inside the first tstp only and stop when reaching the other tstp.
What I am trying to do is to create a file for each tstp and inside this file, it will have nb, state, freebk, freebs as columns in this file.
expected output:
first tstp file:
nb state freebk freebs
901 open 6 14
903 open 2 18
second tstp file:
nb state freebk freebs
901 open 6 14
this output is for the first tstp I want to create a different file for each tstp in my data so for the provided data 2 files will be created ( because we have only 2 tstp in the data)
Remark: don't consider this data a json file consider it a normal txt file.

This below approach will help you with all types of data available for "tstp" which may have spaces in between.
I used regex for properly capturing starting of each JSON to prepare a valid data. (Also works If your data is unorganized in your file.)
import re
import ast
# Reading Content from Text File
with open("text.txt", "r") as file:
data = file.read()
# Transforming Data into Json for better value collection
regex = r'{[\s]*"tstp"'
replaced_content = ',{"tstp"'
# replacing starting of every {json} dictionary with ,{json}
data = re.sub(regex, replaced_content, data)
data = "[" + data.strip()[1:] + "]" # removing First unnecessary comma (,)
data = ast.literal_eval(data) # converting string to list of Json
# Preparing data for File
headings_data = "nb state freebk freebs"
for count, json in enumerate(data, start=1):
# Remove this part with row = "" if you dont want tstp value in file.
row = "File - {0}\n\n".format(json["tstp"])
row += headings_data
for item in json["ststates"]:
row += "\n{0} {1} {2} {3}".format(
item["nb"], item["state"], item["freebk"], item["freebs"])
# Preparing different file for each tstp
filename = "file-{0}.txt".format(count)
with open(filename, "w") as file:
file.write(row)
Output:
File 1
File - 1383173780727
nb state freebk freebs
901 open 6 14
903 open 2 18
File 2
File - 1383173852184
nb state freebk freebs
901 open 6 14
And So on.... for total number of "tstp" entries.
Note: We cannot replace "}{" in every situation. Maybe, in your data the brackets may placed in different lines.

Well, it looks like }{ is a nice separator for the entries, so let's (ab)use that fact. Better formatting of the output is left as an exercise to the reader.
import ast
# (0) could be read with f.read()
data = """{"tstp":1383173780727,"ststates":[{"nb":901,"state":"open","freebk":6,"freebs":14},{"nb":903,"state":"open","freebk":2,"freebs":18}]}{"tstp":1383173852184,"ststates":[{"nb":901,"state":"open","freebk":6,"freebs":14}]}"""
# (1) split data by ´}{`
entries = data.replace("}{", "}\n{").splitlines()
# (2) read each entry (since we were told it's not JSON,
# don't use JSON but ast.literal_eval, but the effect is the same)
entries = [ast.literal_eval(ent) for ent in entries]
# (3) print out some ststates!
for ent in entries:
print("nb\tstate\tfreebk\tfreebs")
for ststate in ent.get("ststates", []):
print("{nb}\t{state}\t{freebk}\t{freebs}".format_map(ststate))
print("---")
The output is
nb state freebk freebs
901 open 6 14
903 open 2 18
---
nb state freebk freebs
901 open 6 14
---

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!

As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)

Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

Python : to add serial number in a text file and header

I am a python noob ,imagine i have a .txt file which contains
123456789 1234 apple\wasdsa\sgfgf\sgf\rgfd.csv
124555669 6547 mango\sdf\hjt\sthsdth\eth.txt
564984565 58475 ksfjk\hjkf\tkohj\fdgs.opp
and the list continues.But i need it to format it like this with the header and with the serial numbers which will keep on increment according to the number of lines :
Sr.no. MD5 Size Path
1 123456789 1234 apple\wasdsa\sgfgf\sgf\rgfd.csv
2 124555669 6547 mango\sdf\hjt\sthsdth\eth.txt
3 564984565 58475 ksfjk\hjkf\tkohj\fdgs.opp
I am not able to overwrite it on the same .txt file , and also i am not able to generate the serial number. Please help me.

You could use
data = """
123456789 1234 test123
124555669 6547 test456
564984565 58475 test789
"""
header = "Sr.no. MD5 Size Path\n"
output = header + "\n".join(
"{}\t{}".format(line_number, line)
for line_number, line in enumerate(
(item for item in data.split("\n") if item), 1))
print(output)
Which would yield
Sr.no. MD5 Size Path
1 123456789 1234 test123
2 124555669 6547 test456
3 564984565 58475 test789
Question is if these escaped characters are really in the actual string?

Python: How to display the top numbers from text files using regex

My assignment is to display the top views from two different text files. The text files are formatted as 'file' followed by pathfolder, views, open/close. What I'm having trouble with is displaying the top views AND the titles of the path_folders have to be in alphabetical order just in case if the views were the same.
I've already used glob to read the two different files. I am even using regex to make sure the files are read the way it is supposed to. I also know I can use the sort/sorted to make it in alphabetical order. My main concern is mostly displaying the top views from the text files.
Here are my files:
file1.txt
file Marvel/GuardiansOfGalaxy 300 1
file DC/Batman 504 1
file GameOfThrones 900 0
file DC/Superman 200 1
file Marvel/CaptainAmerica 342 0
file2.txt
file Science/Biology 200 1
file Math/Calculus 342 0
file Psychology 324 1
file Anthropology 234 0
file Science/Chemistry 444 1
**(As you can tell by the format, the third tab is the views)
The output should look like this:
file GameOfThrones 900 0
file DC/Batman 504 1
file Science/Chemistry 444 1
file Marvel/CaptainAmerica 342 0
file Math/Calculus 342 0
...
Aside from that here is the function I am currently working on to display the top views :
records = dict(re.findall(r"files (.+) (\d+).+", files))
main_dict = {}
for file in records:
print(file)
#idk how to display the top views
return main_dict

Extracting the sorting criteria
First, you need to get the information by which you want to sort out of each line.
You can use this regex to extract views and the path from your lines:
>>> import re
>>> criteria_re = re.compile(r'file (?P<path>\S*) (?P<views>\d*) \d*')
>>> m = criteria_re.match('file GameOfThrones 900 0')
>>> res = (int(m.group('views')), m.group('path'))
>>> res
(900, 'GameOfThrones')
Sorting
Now the whole thing just needs to be applied to your file collection. Since we don't want a default search, we need to set the key parameter of the search function to help it know what exactly we want to sort by:
def sort_files(files):
lines = []
for file in records:
for line in open(file):
m = criteria_re.match(line)
# maybe do some error handling here, in case the regex doesn't match
lines.append((line, (-int(m.group('views')), m.group('path'))))
# taking the negative view count makes the comparison later a
# bit more simple, since we can just sort be descending order
# for both view as well as alphabetical path order
# the sorting criteria were only tagging along to help with the order, so
# we can discard them in the result
return [line for line, criterion in sorted(lines, key=lambda x: x[1])]

You can use the following code:
#open the 2 files in read mode
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
data = f1.read() + f2.read() #store the content of the two files in a string variable
lines = data.split('\n') #split each line to generate a list
#do the sorting in reverse mode, based on the 3rd word, in your case number of views
print(sorted(lines[:-1], reverse=True, key=lambda x:int(x.split()[2])))
output:
['file GameOfThrones 900 0', 'file DC/Batman 504 1', 'file Science/Chemistry 444 1', 'file Marvel/CaptainAmerica 342 0', 'file Math/Calculus 342 0', 'file Psychology 324 1', 'file Marvel/GuardiansOfGalaxy 300 1', 'file Anthropology 234 0', 'file DC/Superman 200 1', 'file Science/Biology 200 1']

Continuing from the comment I made above:
Read both the files and store their lines in a list
Flatten the list
Sort the list by the views in the string
Hence:
list.txt:
file Marvel/GuardiansOfGalaxy 300 1
file DC/Batman 504 1
file GameOfThrones 900 0
file DC/Superman 200 1
file Marvel/CaptainAmerica 342 0
list2.txt:
file Science/Biology 200 1
file Math/Calculus 342 0
file Psychology 324 1
file Anthropology 234 0
file Science/Chemistry 444 1
And:
fileOne = 'list.txt'
fileTwo = 'list2.txt'
result = []
with open (fileOne, 'r') as file1Obj, open(fileTwo, 'r') as file2Obj:
result.append(file1Obj.readlines())
result.append(file2Obj.readlines())
result = sum(result, []) # flattening the nested list
result = [i.split('\n', 1)[0] for i in result] # removing the \n char
print(sorted(result, reverse=True, key = lambda x: int(x.split()[2]))) # sorting by the view
OUTPUT:
[
'file GameOfThrones 900 0', 'file DC/Batman 504 1', 'file Science/Chemistry 444 1',
'file Marvel/CaptainAmerica 342 0', 'file Math/Calculus 342 0',
'file Psychology 324 1', 'file Marvel/GuardiansOfGalaxy 300 1',
'file Anthropology 234 0', 'file DC/Superman 200 1', 'file Science/Biology 200 1'
]
Shorter-version:
with open (fileOne, 'r') as file1Obj, open(fileTwo, 'r') as file2Obj: result = file1Obj.readlines() + file2Obj.readlines()
print(list(i.split('\n', 1)[0] for i in sorted(result, reverse=True, key = lambda x: int(x.split()[2])))) # sorting by the view

Replacing a string in a file in python

What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?

Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python extract infos from file - python

Related

Reading from a txt file in Python

python regex: Parsing file name

Python : to add serial number in a text file and header

Python: How to display the top numbers from text files using regex

Replacing a string in a file in python

Categories

Resources