Python: How to display the top numbers from text files using regex

Python: How to display the top numbers from text files using regex - python

My assignment is to display the top views from two different text files. The text files are formatted as 'file' followed by pathfolder, views, open/close. What I'm having trouble with is displaying the top views AND the titles of the path_folders have to be in alphabetical order just in case if the views were the same.
I've already used glob to read the two different files. I am even using regex to make sure the files are read the way it is supposed to. I also know I can use the sort/sorted to make it in alphabetical order. My main concern is mostly displaying the top views from the text files.
Here are my files:
file1.txt
file Marvel/GuardiansOfGalaxy 300 1
file DC/Batman 504 1
file GameOfThrones 900 0
file DC/Superman 200 1
file Marvel/CaptainAmerica 342 0
file2.txt
file Science/Biology 200 1
file Math/Calculus 342 0
file Psychology 324 1
file Anthropology 234 0
file Science/Chemistry 444 1
**(As you can tell by the format, the third tab is the views)
The output should look like this:
file GameOfThrones 900 0
file DC/Batman 504 1
file Science/Chemistry 444 1
file Marvel/CaptainAmerica 342 0
file Math/Calculus 342 0
...
Aside from that here is the function I am currently working on to display the top views :
records = dict(re.findall(r"files (.+) (\d+).+", files))
main_dict = {}
for file in records:
print(file)
#idk how to display the top views
return main_dict

Extracting the sorting criteria
First, you need to get the information by which you want to sort out of each line.
You can use this regex to extract views and the path from your lines:
>>> import re
>>> criteria_re = re.compile(r'file (?P<path>\S*) (?P<views>\d*) \d*')
>>> m = criteria_re.match('file GameOfThrones 900 0')
>>> res = (int(m.group('views')), m.group('path'))
>>> res
(900, 'GameOfThrones')
Sorting
Now the whole thing just needs to be applied to your file collection. Since we don't want a default search, we need to set the key parameter of the search function to help it know what exactly we want to sort by:
def sort_files(files):
lines = []
for file in records:
for line in open(file):
m = criteria_re.match(line)
# maybe do some error handling here, in case the regex doesn't match
lines.append((line, (-int(m.group('views')), m.group('path'))))
# taking the negative view count makes the comparison later a
# bit more simple, since we can just sort be descending order
# for both view as well as alphabetical path order
# the sorting criteria were only tagging along to help with the order, so
# we can discard them in the result
return [line for line, criterion in sorted(lines, key=lambda x: x[1])]

You can use the following code:
#open the 2 files in read mode
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
data = f1.read() + f2.read() #store the content of the two files in a string variable
lines = data.split('\n') #split each line to generate a list
#do the sorting in reverse mode, based on the 3rd word, in your case number of views
print(sorted(lines[:-1], reverse=True, key=lambda x:int(x.split()[2])))
output:
['file GameOfThrones 900 0', 'file DC/Batman 504 1', 'file Science/Chemistry 444 1', 'file Marvel/CaptainAmerica 342 0', 'file Math/Calculus 342 0', 'file Psychology 324 1', 'file Marvel/GuardiansOfGalaxy 300 1', 'file Anthropology 234 0', 'file DC/Superman 200 1', 'file Science/Biology 200 1']

Continuing from the comment I made above:
Read both the files and store their lines in a list
Flatten the list
Sort the list by the views in the string
Hence:
list.txt:
file Marvel/GuardiansOfGalaxy 300 1
file DC/Batman 504 1
file GameOfThrones 900 0
file DC/Superman 200 1
file Marvel/CaptainAmerica 342 0
list2.txt:
file Science/Biology 200 1
file Math/Calculus 342 0
file Psychology 324 1
file Anthropology 234 0
file Science/Chemistry 444 1
And:
fileOne = 'list.txt'
fileTwo = 'list2.txt'
result = []
with open (fileOne, 'r') as file1Obj, open(fileTwo, 'r') as file2Obj:
result.append(file1Obj.readlines())
result.append(file2Obj.readlines())
result = sum(result, []) # flattening the nested list
result = [i.split('\n', 1)[0] for i in result] # removing the \n char
print(sorted(result, reverse=True, key = lambda x: int(x.split()[2]))) # sorting by the view
OUTPUT:
[
'file GameOfThrones 900 0', 'file DC/Batman 504 1', 'file Science/Chemistry 444 1',
'file Marvel/CaptainAmerica 342 0', 'file Math/Calculus 342 0',
'file Psychology 324 1', 'file Marvel/GuardiansOfGalaxy 300 1',
'file Anthropology 234 0', 'file DC/Superman 200 1', 'file Science/Biology 200 1'
]
Shorter-version:
with open (fileOne, 'r') as file1Obj, open(fileTwo, 'r') as file2Obj: result = file1Obj.readlines() + file2Obj.readlines()
print(list(i.split('\n', 1)[0] for i in sorted(result, reverse=True, key = lambda x: int(x.split()[2])))) # sorting by the view

Related

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

I have a file with lines in this format:
CALSPHERE 1
1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996
2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319
CALSPHERE 2
1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990
2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421
..etc.
I would like to parse this into a dictionary of the format:
{CALSPHERE 1:(1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996, 2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319),
CALSPHERE 2:(1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990, 2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421),...}
I'm puzzled as to how to parse this, so that every third line is the key, with the following two lines forming a tuple for the value. What would be the best way to do this in python?
I've attempted to add some logic for "every third line" though it seems kind of convoluted; something like
with open(r"file") as f:
i = 3
for line in f:
if i%3=0:
key = line
else:
#not sure what to do with the next lines here

If your file always have the same distribution (i.e: the 'CALSPHERE' word -or any other that you want it as your dictionary key-, followed by two lines), you can achieve what you want by doing something as follows:
with open(filename) as file:
lines = file.read().splitlines()
d = dict()
for i in range(0, len(lines), 3):
d[lines[i].strip()] = (lines[i + 1], lines[i + 2])
Output:
{
'CALSPHERE 1': ('1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996', '2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319'),
'CALSPHERE 2': ('1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990', '2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421')
}

Assuming that your content is in file.txt you can use the following.
It shall work for any number of CALSPHERE keyword occurrences and also various number of entries between.
with open('file.txt') as inp:
buffer = []
for line in inp:
# remove newline
copy = line.replace('\n','')
# check if next entry
if 'CALSPHERE' in copy:
buffer.append([])
# add line
buffer[-1].append(copy)
# put the output into dictionary
res = {}
for chunk in buffer:
# safety check
if len(chunk) > 1:
res[chunk[0]] = tuple( chunk[1:] )
print(res)

Python extract infos from file

I have a text file with the size of all files on different servers with extension *.AAA I would like to extract the filename + size from each servers that are bigger than 20 GB. I know how to extract a line from a file and display it but here is my example and what I would like to Achieve.
The example of the file itself:
Pad 1001
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:07 AM 894,889,984 File1.AAA
05/25/2015 07:18 AM 25,673,969,664 File2.AAA
02/11/2016 02:07 AM 17,879,040 File3.AAA
05/25/2015 07:18 AM 12,386,304 File4.AAA
10/13/2008 10:29 AM 1,186,988,032 File3.AAA_oct13
02/15/2016 11:15 AM 2,799,263,744 File5.AAA
6 File(s) 30,585,376,768 bytes
0 Dir(s) 28,585,127,936 bytes free
Pad 1002
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:08 AM 1,379,815,424 File1.AAA
02/11/2016 02:08 AM 18,542,592 File3.AAA
02/15/2016 12:41 AM 853,659,648 File5.AAA
3 File(s) 2,252,017,664 bytes
0 Dir(s) 49,306,902,528 bytes free
Here is what I would like as my output The Pad# and the file that is bigger than 20GB:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
I will eventually put this in a excel spreadsheet but this I know how.
Any Ideas?
Thank you

The following should get you started:
import re
output = []
with open('input.txt') as f_input:
text = f_input.read()
for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)
for line, length in file_list:
length = int(length.replace(',', ''))
if length > 2e10: # Or your choice of what 20GB is
output.append((pad, line))
print output
This would display a list with one tuple entry as follows:
[('Pad 1001', '05/25/2015 07:18 AM 25,673,969,664 File2.AAA')]

[EDIT] Here is my approach:
import re
result = []
with open('txtfile.txt', 'r') as f:
content = [line.strip() for line in f.readlines()]
for line in content:
m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
result.append(line)
print re.sub('Pad\s+\d+$', '', ' '.join(result))
Output is:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA

Find and print same elements in a loop

I have a huge input file that looks like this,
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
c2135 TRIUR3_3-P1 124 210 89.66
c2135 EMT17965 25 117 70.97
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
I would like to iterate inside each c, see if it has same elements in line[1] and print in 2 separate files
c that contain same elements and
that do not have same elements.
In case of c1914, since it has 2 same elements and 1 is not, it goes to file 2. So desired 2 output files will look like this, file1.txt
c1638 MLOC_8.3 27 101 72.00
c1638 MLOC_8.3 25 117 70.97
file2.txt
c651 OS05T0-00 492 749 29.07
c651 OS01T0-00 1141 1311 55.00
c2135 TRIUR3_3-P1 124 210 89.66
c1914 OS02T0-00 2 109 80.56
c1914 OS02T0-00 111 155 93.33
c1914 OS08T0-00 528 617 50.00
This is what I tried,
oh1=open('result.txt','w')
oh2=open('result2.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=line.split()
protein=new_list[1]
for i in range(1,len(protein)):
(p, c) = protein[i-1], protein[i]
if c == p:
new_list.append(protein)
oh1.write(line)
else:
oh2.write(line)

If I understand you correctly, you want to send all lines for your input file that have a first element txt1 to your first output file if the second element txt2 of all those lines is the same; otherwise all those lines go to the second output file. Here is a program that does that.
from collections import defaultdict
# Read in file line-by-line for the first time
# Build up dictionary of txt1 to set of txt2 s
txt1totxt2 = defaultdict(set)
f=open('file.txt','r')
for line in f:
lst = line.split()
txt1=lst[0]
txt2=lst[1]
txt1totxt2[txt1].add(txt2);
# The dictionary tells us whether the second text
# is unique or not. If it's unique the set has
# just one element; otherwise the set has > 1 elts.
# Read in file for second time, sending each line
# to the appropriate output file
f.seek(0)
oh1=open('result1.txt','w')
oh2=open('result2.txt','w')
for line in f:
lst = line.split()
txt1=lst[0]
if len(txt1totxt2[txt1]) == 1:
oh1.write(line)
else:
oh2.write(line)
The program logic is very simple. For each txt it builds up a set of txt2s that it sees. When you're done reading the file, if the set has just one element, then you know that the txt2s are unique; if the set has more than one element, then there are at least two txt2s. Note that this means that if you only have one line in the input file with a particular txt1, it will always be sent to the first output file. There are ways round this if this is not the behaviour you want.
Note also that because the file is large, I've read it in line-by-line: lines=f.readlines() in your original program reads the whole file into memory at a time. I've stepped through it twice: the second time does the output. If this increases the run time then you can restore the lines=f.readlines() instead of reading it a second time. However the program as is should be much more robust to very large files. Conversely if your files are very large indeed, it would be worth looking at the program to reduce the memory usage further (the dictionary txt1totxt2 could be replaced with something more optimal, albeit more complicated, if necessary).
Edit: there was a good point in comments (now deleted) about the memory cost of this algorithm. To elaborate, the memory usage could be high, but on the other hand it isn't as severe as storing the whole file: rather txt1totxt2 is a dictionary from the first text in each line to a set of the second text, which is of the order of (size of unique first text) * (average size of unique second text for each unique first text). This is likely to be a lot smaller than the file size, but the approach may require further optimization. The approach here is to get something simple going first -- this can then be iterated to optimize further if necessary.

Try this...
import collections
parsed_data = collections.OrderedDict()
with open("input.txt", "r") as fd:
for line in fd.readlines():
line_data = line.split()
key = line_data[0]
key2 = line_data[1]
if not parsed_data.has_key(key):
parsed_data[key] = collections.OrderedDict()
if not parsed_data[key].has_key(key2):
parsed_data[key][key2] = [line]
else:
parsed_data[key][key2].append(line)
# now process the parsed data and write result files
fsimilar = open("similar.txt", "w")
fdifferent = open("different.txt", "w")
for key in parsed_data:
if len(parsed_data[key]) == 1:
f = fsimilar
else:
f = fdifferent
for key2 in parsed_data[key]:
for line in parsed_data[key][key2]:
f.write(line)
fsimilar.close()
fdifferent.close()
Hope this helps

Creating column data from multiple sources with varying formats in python

So as part of my code, I'm reading file paths that have varying names, but tend to stick to the following format
p(number)_(temperature)C
What I've done with those paths is separate it into 2 columns (along with 2 more columns with actual data) so I end up with a row that looks like this:
p2 18 some number some number
However, I've found a few folders that use the following format:
p(number number)_(temperature)C
As it stands, for the first case, I use the following code to separate the file path into the proper columns:
def finale():
for root, dirs, files in os.walk('/Users/Bashe/Desktop/12/'):
file_name = os.path.join(root,"Graph_Info.txt")
file_name_out = os.path.join(root,"Graph.txt")
file = os.path.join(root, "StDev.txt")
if os.path.exists(os.path.join(root,"Graph_Info.txt")):
with open(file_name) as fh, open(file) as th, open(file_name_out,"w") as fh_out:
first_line = fh.readline()
values = eval(first_line)
for value, line in zip(values, fh):
first_column = value[0:2]
second_column = value[3:5]
third_column = line.strip()
fourth_column = th.readline().strip()
fh_out.write("%s\t%s\t%s\t%s\n" % (first_column, second_column, third_column, fourth_column))
else:
pass
I've played around with things and found that if I make the following changes, the program works properly.
first_column = value[0:3]
second_column = value[4:6]
Is there a way I can get the program to look and see what the file path is and act accordingly?

welcome to the fabulous world of regex.
import re
#..........
#case 0
if re.match(r"p\(\d+\).*", path) :
#stuff
#case 1
elif re.match(r"p\(\d+\s\d+\).*", path):
#other stuff

>>> for line in s.splitlines():
... first,second = re.search("p([0-9 ]+)_(\d+)C",line).groups()
... print first, " +",second
...
22 + 66
33 44 + 44
23 33 + 22

Word match in multiple files

I have a corpus of words like these. There are more than 3000 words. But there are 2 files:
File #1:
#fabulous 7.526 2301 2
#excellent 7.247 2612 3
#superb 7.199 1660 2
#perfection 7.099 3004 4
#terrific 6.922 629 1
#magnificent 6.672 490 1
File #2:
) #perfect 6.021 511 2
? #great 5.995 249 1
! #magnificent 5.979 245 1
) #ideal 5.925 232 1
day #great 5.867 219 1
bed #perfect 5.858 217 1
) #heavenly 5.73 191 1
night #perfect 5.671 180 1
night #great 5.654 177 1
. #partytime 5.427 141 1
I have many sentences like this, more than 3000 lines like below:
superb, All I know is the road for that Lomardi start at TONIGHT!!!! We will set a record for a pre-season MNF I can guarantee it, perfection.
All Blue and White fam, we r meeting at Golden Corral for dinner to night at 6pm....great
I have to go through every line and do the following task:
1) find if those corpus of words match anywhere in the sentences
2) find if those corpus of words match leading and trailing of sentences
I am able to do part 2) and not part 1). I can do it but finding a efficient way.
I have the following code:
for line in sys.stdin:
(id,num,senti,words) = re.split("\t+",line.strip())
sentence = re.split("\s+", words.strip().lower())
for line1 in f1: #f1 is the file containing all corpus of words like File #1
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found)
wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found)
for line in sys.stdin:
(id,num,senti,words) = re.split("\t+",line.strip())
sentence = re.split("\s+", words.strip().lower())
for line1 in f1: #f1 is the file containing all corpus of words like File #1
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail"] = found if re.match(sentence[(len(sentence)-1)],term2.lower()) else not(found)
wordanalysis["lead"] = found if re.match(sentence[0],term2.lower()) else not(found)
for line1 in f2: #f2 is the file containing all corpus of words like File #2
(term2,sentimentScore,numPos,numNeg) = re.split("\t", line1.strip())
wordanalysis["trail_2"] = found if re.match(sentence[(len(sentence)-1)],term.lower()) else not(found)
wordanalysis["lead_2"] = found if re.match(sentence[0],term.lower()) else not(found)
Am I doing this right? Is there a better way to do it.

this is a classic map reduce problem, if you want to get serious about efficiency you should consider something like: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
and if you are too lazy / have too few resources to set your own hadoop environment you can try a ready made one http://aws.amazon.com/elasticmapreduce/
feel free to post your code here after its done :) it will be nice to see how it is translated into a mapreduce algorithm...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: How to display the top numbers from text files using regex - python

Related

How to parse a text file into a dictionary in python with key on one line followed by two lines of values

Python extract infos from file

Find and print same elements in a loop

Creating column data from multiple sources with varying formats in python

Word match in multiple files

Categories

Resources