MapReduce Python, Can't seem to pipe criteria to text file - python

First of all, I'm very new to MapReduce (just this week in fact) and doing it as part of a course I'm currently on so forgive me if I am making basic errors.
I have tried searching for a answer to my problem but I'm finding anything of relevance.
I have a text file of lines where the data is simple, for example:
Reg1, Yes
Reg2, No
Reg3, Yes
Reg4, Yes
Reg5, Yes
Reg6, Yes
Reg7, Yes
Reg8, No
Reg9, Yes
Reg10, Yes
Reg11, Yes
Reg12, Yes
Reg13, Yes
Reg14, No
Reg15, Yes
The first thing I wanted to do is count the yes and no - this part is working fine but using a second model to pipe the 'reg' words to a text file if it is a 'No'. I have read somewhere it is better to look at the lines rather than words in this situation, which makes sense.
Below is my attempt at gaining a mapper that does this:
import sys
for line in sys.stdin:
line = line.strip()
lines = line.split()
for line in lines:
if 'Yes' in line:
sys.stdout.write('%s\t%s\n' % (line,1))
else:
sys.stderr.write('%s\t%s\n' % (line,1))
print('%s\t%s' % (line, 1))
but the resulting output is:
Reg1, 1
Reg2, 1
No 1
Reg3, 1
Reg4, 1
Reg5, 1
Reg6, 1
Reg7, 1
Reg8, 1
No 1
Reg9, 1
Reg10, 1
Reg11, 1
Reg12, 1
Reg13, 1
Reg14, 1
No 1
Reg15, 1
whereas I just want my output to be:
Reg2, No
Reg8, No
Reg14, No
Can anyone please give a pointer on where I am going wrong? This bit of work is only for theoretical purposes that is why I am using Python (plus this is what the tutor demonstrated in)
Thanks in advance.

No need to split the lines into words.
The in operator can identify a sub-string within a string.
and then you also don't need to do so much printing, eventually your code would be
import sys
for line in sys.stdin:
line = line.strip()
if 'Yes' in line:
# print(line) # we don't want to print the Yes lines
pass
# but if we want to leave the IF unchanged, then a pass instruction needs to fill it
else:
print(line)
# if you want results to be pipe-able, comment line above, uncomment line below
#sys.stdout.write(line)

Related

Using python to search for strings in a file and use output to group the content of the second folder

I tried writing a python code that search for one/more strings in file1.txt, and then then make a change to the findall output (e.g., change cap0001 to 1). Next the code use the modfied output to group the content of file2.txt based on matches to column "capNo" in File2.txt.
File1.txt:
>cap00001 supr2
x2shh qewrrw
dsfff rggfdd
>cap00002 supr5
dadamic adertsy
waeee ddccmet
File2.txt
Ref capNo qual
AM1 1 Good
AM8 1 Good
AM7 2 Poor
AM2 2 Good
AM9 2 Good
AM6 3 Poor
AM1 3 Poor
AM2 3 Good
Require output:
capNo counts
1 2
2 3
The following code did not work for me:
import re
With open("File1.txt","r") as InFile1:
for line in InFile1:
match=re.findall(r'cap\d+',line)
if len(match) > 0:
match=match.remove(cap0000)
With open("File2.txt","r") as InFile2:
df=InFile2.read()
df2=df.groupby(match)["capNo"].value_counts()
print(df2)
How can I get this code working? Thanks
Change the Withs to with
Call the read function:
e.g.
with open('File1.txt') as f:
InFile1 = f.read()
# Do something with InFile1
In your code df is a string - you can't call groupby on it (did you mean to convert it to a pandas DataFrame?)

How to read ONLY 1 word in python?

I've created an empty text file, and saved some stuff to it. This is what I saved:
Saish ddd TestUser ForTestUse
There is a space before these words. Anyways, I wanted to know how to read only 1 WORD in the text file using python. This is the code I used:
#Uncommenting the line below the line does literally nothing.
import time
#import mmap, re
print("Loading Data...")
time.sleep(2)
with open("User_Data.txt") as f:
lines = f.read() ##Assume the sample file has 3 lines
first = lines.split(None, 1)[0]
print(first)
print("Type user number 1 - 4 for using different user.")
ans = input('Is the name above correct?(y/1 - 4) ')
if ans == 'y':
print("Ok! You will be called", first)
elif ans == '1':
print("You are already registered to", first)
elif ans == '2':
print('Switching to accounts...')
time.sleep(0.5)
with open("User_Data.txt") as f:
lines = f.read() ##Assume the sample file has 3 lines
second = lines.split(None, 2)[2]
print(second)
#Fix the passord issue! Very important as this is SECURITY!!!
when I run the code, my output is:
Loading Data...
Saish
Type user number 1 - 4 for using different user.
Is the name above correct?(y/1 - 4) 2
Switching to accounts...
TestUser ForTestUse
as you can see, it diplays both "TestUser" and "ForTestUse" while I only want it to display "TestUser".
When you give a limit to split(), all the items from that limit to the end are combined. So if you do
lines = 'Saish ddd TestUser ForTestUse'
split = lines.split(None, 2)
the result is
['Saish', 'ddd', 'TestUser ForTestUse']
If you just want the third word, don't give a limit to split().
second = lines.split()[2]
You can use it directly without passing any None
lines.split()[2]
I understand your passing (None, 2) because you want to get None if there is no value at index 2,
A simple way to check if the index is available in the list
Python 2
2 in zip(*enumerate(lines.split()))[0]
Python 3
2 in list(zip(*enumerate(lines.split())))[0]

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5
You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()
def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"
#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

How do I use index iteration to search in a list in Python?

This is for an assignment which I've nearly finished. So the goal is to be able to search the list based on CID, which is the first value in each line of the txt file.
The text file contains the following records, and is tab delimited:
0001 001 -- -- 1234.00 -- -- 148.08 148.08 13.21 1395.29
0002 011 -- 100.00 12000.00 -- 5.00 1440.00 1445.00 414.15 13959.15
0003 111 100.00 1000.00 1000.00 8.00 50.00 120.00 178.00 17.70 2295.70
0004 110 1200.00 100.00 -- 96.00 5.00 -- 101.00 6.15 1407.15
0005 101 100.00 -- 1300.00 8.00 -- 156.00 164.00 15.60 1579.60
0006 100 1200.00 -- -- 96.00 -- -- 96.00 5.40 1301.40
0007 010 -- 1500.00 -- -- 75.00 -- 75.00 2.25 1577.25
0008 001 -- -- 1000.00 -- -- 120.00 120.00 9.00 1129.00
0009 111 1000.00 1000.00 1000.00 80.00 50.00 120.00 250.00 28.50 3278.50
0010 111 100.00 10000.00 1000.00 8.00 500.00 120.00 628.00 123.90 11851.90
Text file can be found here.
I'm new to Python, and haven't got my head around it yet. I need to be able to somehow dynamically fill in lines[0] with other index positions. For example...'0002' is found in index [0], 0002 is found if I change to lines[1] and so forth. I've tried various whiles, enumerating, list-comprehension, but most of that is beyond my understanding. Or maybe there's an easier way to display the line for a particular 'customer'?
with open('customer.txt', 'r') as file:
for line in file:
lines = file.read().split('\n')
search = input("Please enter a CID to search for: ")
if search in lines[0]:
print(search, "was found in the database.")
CID = lines[0]
print(CID)
else:
print(search, "does not exist in the database.")
Not sure, are the lines supposed to be split into fields somehow?
search = input("Please enter a CID to search for: ")
with open('customer.txt', 'r') as file:
for line in file:
fields = line.split('\t')
if fields[0] == search:
print(search, "was found in the database.")
CID = fields[0]
print(line)
break
else:
print(search, "does not exist in the database.")
Here's how I think you should solve this problem. Comments below the code.
_MAX_CID = 9999
while True:
search = input("Please enter a CID to search for: ")
try:
cid = int(search)
except ValueError:
print("Please enter a valid number")
continue
if not 0 <= cid <= _MAX_CID:
print("Please enter a number within the range 0..%d"% _MAX_CID)
continue
else:
# number is good
break
with open("customer.txt", "r") as f:
for line in f:
if not line.strip():
continue # completely blank line so skip it
fields = line.split()
try:
line_cid = int(fields[0])
except ValueError:
continue # invalid line so skip it
if cid == line_cid:
print("%d was found in the database." % cid)
print(line.strip())
break
else:
# NOTE! This "else" goes with the "for"! This case
# will be executed if the for loop runs to the end
# without breaking. We break when the CID matches
# so this code runs when CID never matched.
print("%d does not exist in the database." % cid)
Instead of searching for a text match, we are parsing the user's input as a number and searching for a numeric match. So, if the user enters 0, a text match would match every single line of your example file, but a numeric match won't match anything!
We take input, then convert it to an integer. Then we check it to see if it makes sense (isn't negative or too large). If it fails any test we keep looping, making the user re-enter. Once it's a valid number we break out of the loop and continue. (Your teacher may not like the way I use break here. If it makes your teacher happier, add a variable called done that is initially set to False, and set it to True when the input validates, and make the loop while not done:).
You seem a bit confused about input. When you open a file, you get back an object that represents the opened file. You can do several things this object. One thing you can do is use method functions like .readlines() or .read(), but another thing you can do is just iterate it. To iterate it you just put it in a for loop; when you do that, each loop iteration gets one line of input from the file. So my code sample sets the variable line to a line from the file each time. If you use the .read() method, you slurp in the entire file into memory, all at once, which isn't needed; and then your loop isn't looping over lines of the file. Usually you should use the for line in f: sort of loop; sometimes you need to slurp the file with f.read(); you never do both at the same time.
It's a small point, but file is a built-in type in Python, and by assigning to that you are rebinding the name, and "shadowing" the built-in type. Why not simply use f as I did in my program? Or, use something like in_file. When I have both an input file and an output file at the same time I usually use in_file and out_file.
Once we have the line, we can split it into fields using the .split() method function. Then the code forces the 0th field to an integer and checks for an exact match.
This code checks the input lines, and if they don't work, silently skips the line. Is that what you want? Maybe not! Maybe it would be better for the code to blow up if the database file is malformed. Then instead of using the continue statement, maybe you would want to put in a raise statement, and raise an exception. Maybe define your own MalformedDatabase exception, which should be a subclass of ValueError I think.
This code uses a pretty unique feature of Python, the else statement on a for loop. This is for code that is only executed when the loop runs all the way to the end, without an early exit. When the loop finds the customer ID, it exits early with a break statement; when the customer ID is never found, the loop runs to the end and this code executes.
This code will actually work okay with Python 2.x, but the error checking isn't quite adequate. If you run it under Python 3.x it is pretty well-checked. I'm assuming you are using Python 3.x to run this. If you run this with Python 2.x, enter xxx or crazy junk like 0zz and you will get different exceptions than just the ValueError being tested! (If you actually wanted to use this with Python 2.x, you should change input() to raw_input(), or catch more exceptions in the try/except.)
Another approach. Since the file is tab delimited, you can use the csv module as well.
This approach, unlike #gnibbler's answer, will read the entire file and then search its contents (so it will load the file in memory).
import csv
with open('customer.txt') as file:
reader = csv.reader(file, delimiter='\t')
lines = list(reader)
search = input('Please enter the id: ')
result = [line for line in lines if search in line]
print '\t'.join(*result) if result else 'Not Found'

Python - Parsing Conundrum

I have searched high and low for a resolution to this situation, and tested a few different methods, but I haven't had any luck thus far. Basically, I have a file with data in the following format that I need to convert into a CSV:
(previously known as CyberWay Pte Ltd)
0 2019
01.com
0 1975
1 TRAVEL.COM
0 228
1&1 Internet
97 606
1&1 Internet AG
0 1347
1-800-HOSTING
0 8
1Velocity
0 28
1st Class Internet Solutions
0 375
2iC Systems
0 192
I've tried using re.sub and replacing the whitespace between the numbers on every other line with a comma, but haven't had any success so far. I admit that I normally parse from CSVs, so raw text has been a bit of a challenge for me. I would need to maintain the string formats that are above each respective set of numbers.
I'd prefer the CSV to be formatted as such:
foo bar
0,8
foo bar
0,9
foo bar
0,10
foo bar
0,11
There's about 50,000 entries, so manually editing this would take an obscene amount of time.
If anyone has any suggestions, I'd be most grateful.
Thank you very much.
If you just want to replace whitespace with comma, you can just do:
line = ','.join(line.split())
You'll have to do this only on every other line, but from your question it sounds like you already figured out how to work with every other line.
If I have correctly understood your requirement, you need a strip() on all lines and a split based on whitespace on even lines (lines starting from 1):
import re
fp = open("csv.txt", "r")
while True:
line = fp.readline()
if '' == line:
break
line = line.strip()
fields = re.split("\s+", fp.readline().strip())
print "\"%s\",%s,%s" % ( line, fields[0], fields[1] )
fp.close()
The output is a CSV (you might need to escape quotes if they occur in your input):
"Content of odd line",Number1,Number2
I do not understand the 'foo,bar' you place as header on your example's odd lines, though.

Categories

Resources