I have two files which look exactly the same:
file1
1 in seattle today the secretary of education richard riley delivered his address
1 one of the things he focused on as the president had done
1 abc's michele norris has been investigating this
2 we're going to take a closer look tonight at the difficulty of getting meaningful
file2
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
When I run this code:
result=defaultdict(list)
with open("onthis.txt","r") as filer:
for line in filer:
label, sentence= line.strip().split(' ', 1)
result[label].append(sentence)
It works perfectly for file1 but gives me a value error for file2:
label, sentence= line.strip().split(' ', 1)
ValueError: need more than 1 value to unpack
I don't seem to catch the reason when they are both in the same format.
So, I just removed the empty lines by this terminal command:
sed '/^$/d' onthis.txt > trial
But the same error appears.
They can't be exactly the same. My guess is that there is an empty / white-space-only line somewhere in your second file, most likely right at the end.
The error is telling you that when it is performing the split, there are no spaces to split on so only one value is being returned, rather than a value for both label and sentence.
Based on your edit I suspect you might still have "empty" lines in your text file. Well I probably better should say: lines filled with nothing but white spaces.
I've extended your example file:
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
4 bar
5 qun
It's probably not clear but the line between 3 foo and 4 bar is filled by a couple of white spaces while the lines between 4 bar 5 qun are "just" new lines (\n).
Notice the output of sed '/^$/d'
1 in seattl today the secretari of educ richard riley deliv hi address
1 one of the thing he focus on a the presid had done
1 abc michel norri ha been investig thi
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
4 bar
5 qun
The empty lines are truly removed - no doubt. But the pseudo-empty white space lines is still there. Running your python script will throw an error when reaching this line:
2 we'r go to take a closer look tonight at the difficulti of get meaning
3 foo
Traceback (most recent call last):
File "python.py", line 9, in <module>
label, sentence= line.strip().split(' ', 1)
ValueError: need more than 1 value to unpack
So my suggestion would be to extend your script by one line, making it skip empty lines in your input file.
for line in filer:
if not line.strip(): continue
Doing so has the positive side effect you don't have to prepare your input files with some sed-magic before.
Based on the above that you have provided (with a tweak). This seems to give the expected result.
result = {}
with open("test.txt", "r") as filer:
for line in filer:
label, sentence = line.strip().split(' ', 1)
try:
result[label].append(sentence)
except KeyError:
result[label] = [sentence]
Output:
{'2': ["we'r go to take a closer look tonight at the difficulti of get meaning"], '1': ['in seattl today the secretari of educ richard riley deliv hi address', 'one of the thing he focus on a the presid had done', 'abc michel norri ha been investig thi']}
So this must mean that we there is something missing from what you have provided. I think that if the above doesn't give you what you need then more info is required
Related
So I have two .txt files that I'm trying to match up. The first .txt file is just lines of about 12,500 names.
John Smith
Jane Smith
Joe Smith
The second .txt file also contains lines with names (that might repeat) but also extra info, about 17GB total.
584 19423 john smith John Smith 79946792 5 5 11 2016-06-24
584 19434 john smith John Smith 79923732 5 4 11 2018-03-14
584 19423 jane smith Jane Smith 79946792 5 5 11 2016-06-24
My goal is to find all the names from File 1 in File 2, and then spit out the File 2 lines that contain any of those File 1 names.
Here is my python code:
with open("Documents/File1.txt", "r") as t:
terms = [x.rstrip('\n') for x in t]
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
if any([term in line for term in terms]):
w.write(line)
So this code definitely works, but it has been running for 3 days (and is still going!!!). I did some back-of-the-envelope calculations, and I'm very worried that my algorithm is computationally intractable (or hyper inefficient) given the size of the data.
Would anyone be able to provide feedback re: (1) whether this is actually intractable and/or extremely inefficient and if so (2) what an alternative algorithm might be?
Thank you!!
First, when testing membership, set and dict are going to be much, much faster, so terms should be a set:
with open("Documents/File1.txt", "r") as t:
terms = set(line.strip() for line in t)
Next, I would split each line into a list, and check if the name is in the set, not if members of the set are in the line, which is O(N) where N is the length of each line. This way you can directly pick out the column numbers (via slicing) that contain the first and last name:
with open("Documents/File2.txt", "r") as f, open("Documents/matched.txt","w") as w:
for line in f:
# split the line on whitespace
names = line.split()
# Your names seem to occur here
name = ' '.join(names[4:6])
if name in terms:
w.write(line)
Students.txt
64 Mary Ryan
89 Michael Murphy
22 Pepe
78 Jenny Smith
57 Patrick James McMahon
89 John Kelly
22 Pepe
74 John C. Reilly
My code
f = open("students.txt","r")
for line in f:
words = line.strip().split()
mark = (words[0])
name = " ".join(words[1:])
for i in (mark):
print(i)
The output im getting is
6
4
8
9
2
2
7
8
etc...
My expected output is
64
80
22
78
etc..
Just curious to know how I would print the whole integer, not just a single integer at a time.
Any help would be more than appreciative.
As I can see you have some integer with a string in the text file. You wanted to know about your code will output only full Integer.
You can use the code
f = open("Students.txt","r")
for line in f:
l = line.split(" ")
print(l[0])
In Python, when you do this:
for i in (mark):
print(i)
and mark is of type string, you are asking Python to iterate over each character in the string. So, if your string contains space-separated integers and you iterate over the string, you'll get one integer at a time.
I believe in your code the line
mark = (words[0])name = " ".join(words[1:])
is a typo. If you fix that we can help you with what's missing (it's most likely a statement like mark = something.split(), but not sure what something is based on the code).
You should be using context managers when you open files so that they are automatically closed for you when the scope ends. Also mark should be a list to which you append the first element of the line split. All together it will look like this:
with open("students.txt","r") as f:
mark = []
for line in f:
mark.append(line.strip().split()[0])
for i in mark:
print(i)
The line
for i in (mark):
is same as this because mark is a string:
for i in mark:
I believe you want to make mark an element of some iterable, which you can create a tuple with single item by:
for i in (mark,):
and this should give what you want.
in your line:
line.strip().split()
you're not telling the sting to split based on a space. Try the following:
str(line).strip().split(" ")
A quick one with list comprehensions:
with open("students.txt","r") as f:
mark = [line.strip().split()[0] for line in f]
for i in mark:
print(i)
First of all, I'm very new to MapReduce (just this week in fact) and doing it as part of a course I'm currently on so forgive me if I am making basic errors.
I have tried searching for a answer to my problem but I'm finding anything of relevance.
I have a text file of lines where the data is simple, for example:
Reg1, Yes
Reg2, No
Reg3, Yes
Reg4, Yes
Reg5, Yes
Reg6, Yes
Reg7, Yes
Reg8, No
Reg9, Yes
Reg10, Yes
Reg11, Yes
Reg12, Yes
Reg13, Yes
Reg14, No
Reg15, Yes
The first thing I wanted to do is count the yes and no - this part is working fine but using a second model to pipe the 'reg' words to a text file if it is a 'No'. I have read somewhere it is better to look at the lines rather than words in this situation, which makes sense.
Below is my attempt at gaining a mapper that does this:
import sys
for line in sys.stdin:
line = line.strip()
lines = line.split()
for line in lines:
if 'Yes' in line:
sys.stdout.write('%s\t%s\n' % (line,1))
else:
sys.stderr.write('%s\t%s\n' % (line,1))
print('%s\t%s' % (line, 1))
but the resulting output is:
Reg1, 1
Reg2, 1
No 1
Reg3, 1
Reg4, 1
Reg5, 1
Reg6, 1
Reg7, 1
Reg8, 1
No 1
Reg9, 1
Reg10, 1
Reg11, 1
Reg12, 1
Reg13, 1
Reg14, 1
No 1
Reg15, 1
whereas I just want my output to be:
Reg2, No
Reg8, No
Reg14, No
Can anyone please give a pointer on where I am going wrong? This bit of work is only for theoretical purposes that is why I am using Python (plus this is what the tutor demonstrated in)
Thanks in advance.
No need to split the lines into words.
The in operator can identify a sub-string within a string.
and then you also don't need to do so much printing, eventually your code would be
import sys
for line in sys.stdin:
line = line.strip()
if 'Yes' in line:
# print(line) # we don't want to print the Yes lines
pass
# but if we want to leave the IF unchanged, then a pass instruction needs to fill it
else:
print(line)
# if you want results to be pipe-able, comment line above, uncomment line below
#sys.stdout.write(line)
Okay, so. Bit of an annoying one this. I have a file with multiple 'sections' in it. The file may like this:
"""
Begin presentation. Welcome the new guest. Do this, do that
[+] Start group 1
+derrek
+bob
+james
+harry
[+] Start group 2
+Will
+Paul
+Mark
+Eric
Hello and welcome to this years presentation of the "New Show" feature me your host Troy Mcleur. Something blah blah blah
"""
So, my question is, is it possible to write some Python to parse both the first and second groups of names, so you only print them? So, the output would only be:
[+] Start group 1
derrek
bob
james
harry
[+] Start group 2
Will
Paul
Mark
Eric
At the moment, the code I currently have is this:
for line in file:
if 'Start Group' in line:
print line
break
for line in file:
if 'Start Group' in line:
break
print line
This only prints Group 1 though, it wont print the next group. Also, occasionally some files may have between 2 and 9 groups, so I'd need it to iterate through and find all the instances of Group, and print all the names within them.
This could work:
from __future__ import print_function
show = False
for line in fobj:
if line.strip().startswith('[+]'):
print()
show = True
elif not line.strip():
show = False
if show:
print(line, end='')
Output:
[+] Start group 1
+derrek
+bob
+james
+harry
[+] Start group 2
+Will
+Paul
+Mark
+Eric
I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.
If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.
If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.