I'm trying to read a log file from a github url, add some geographic info using the IP as a lookup key, and then write some log info and the geographic info to a file. I've got the reading from and writing to file from the log, but I'm not sure what lib to use for looking up coordinates and such from an IP address, nor how to really go about this part. I found the regex module, and by the time I started to understand it, I found out it's deprecated. Here's what I've, got, any help would be great.
import urllib2
apacheLog = 'https://raw.githubusercontent.com/myAccessLog.log'
data = urllib2.urlopen(apacheLog)
for line in data:
with open('C:\LogCopy.txt','a') as f:
f.write(line)
The re module isn't deprecated, and is part of the standard library. Edit: here's the link for the 2.7 module
Your for loop is opening and closing the file at each iteration. Probably not a big deal but it might be faster for large files to open the file once and write what needs to be written. Just swap the locations of the for and with lines.
So
data = urllib2.urlopen(apacheLog)
for line in data:
with open('C:\LogCopy.txt','a') as f: # probably need a double backslash
f.write(line)
becomes
data = urllib2.urlopen(apacheLog)
with open('C:\LogCopy.txt','a') as f: # probably need a double backslash
for line in data.splitlines():
f.write(line) # might need a newline character
# f.write(line + '\n')
Similar question regarding geolocation Python library
Best of luck!
Edit: added the data.splitlines() call after reading Piotr Kempa's answer
Well the first part is simple. Just use for line in data.split('\n') assuming the lines end with a normal newline (they should).
Then you use the re module (import re) - I hope it was still in use in python 2.7... You can extract the IP address with something like re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line), look up the re.search() function for details how to use it.
As for locating the IP geographically, it was already asked I think, try this question: What python libraries can tell me approximate location and time zone given an IP address?
Related
First of all I have done a great research and tried many approaches to look for solution but maybe I am doing it wrong and was not able to find a solution.
My data:
https://knsim.com/test.txt
The problem is that I want to fill all the occurrence of 'Idea_Print' in a new file (not just the name 'Idea_Print' but the complete line of log.
I have tried many approaches but had not success.
My recent code:
file = open(filename, 'r')
for line in file:
if 'Idea_Print' in line:
print(line)
But this didn't work for me. I even tried using re but no success.
Thanks for help.
I tried running your code with my setup and it worked fine. Most likely, you are using the wrong filename in open().
In file = open(filename, 'r'), make sure that filename is the exact same name as your saved data, including extension. Ensure that the file you are trying to access is in the same folder as your python file. If it is not in the same folder, try including the full path to the file.
Also, at the end of your program make sure you call file.close() to allow other programs to access it correctly.
Another thing that could be causing problems is that your variable is called file. This is a built-in name in Python, so you probably don't want to overwrite it. Try changing it to something like data_file.
Edit: Looking at that file it appears that there is a space between every character. That means you will need to use 'E p i s o d e _ 2 0' instead of 'Episode_20'. However, it appears that even that does not appear in the file. Maybe you should double-check that it is the text that you are looking for.
This question already has answers here:
Unable to read huge (20GB) file from CPython
(2 answers)
Closed 9 years ago.
I'm a Python newbie and had a quick question regarding memory usage when reading large text files. I have a ~13GB csv I'm trying to read line-by-line following the Python documentation and more experienced Python user's advice to not use readlines() in order to avoid loading the entire file into memory.
When trying to read a line from the file I get the error below and am not sure what might be causing it. Besides this error, I also notice my PC's memory usage is excessively high. This was a little surprising since my understanding of the readline function is that it only loads a single line from the file at a time into memory.
For reference, I'm using Continuum Analytic's Anaconda distribution of Python 2.7 and PyScripter as my IDE for debugging and testing. Any help or insight is appreciated.
with open(R'C:\temp\datasets\a13GBfile.csv','r') as f:
foo = f.readline(); #<-- Err: SystemError: ..\Objects\stringobject.c:3902 bad argument to internal function
UPDATE:
Thank you all for the quick, informative and very helpful feedback, I reviewed the referenced link which is exactly the problem I was having. After applying the documented 'rU' option mode I was able to read lines from the file like normal. I didn't notice this mode mentioned in the documentation link I was referencing initially and neglected to look at the details for the open function first. Thanks again.
Unix text files end each line with \n.
Windows text files end each line with \r\n.
When you open a file in text mode, 'r', Python assumes it has the native line endings for your platform.
So, if you open a Unix text file on Windows, Python will look for \r\n sequences to split the lines. But there won't be any, so it'll treat your whole file is one giant 13-billion-character line. So that readline() call ends up trying to read the whole thing into memory.
The fix for this is to use universal newlines mode, by opening the file in mode rU. As explained in the docs for open:
supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.
So, instead of searching for \r\n sequences to split the lines, it looks for \r\n, or \n, or \r. And there are millions of \n. So, the problem is solved.
A different way to fix this is to use binary mode, 'rb'. In this mode, Python doesn't do any conversion at all, and assumes all lines end in \n, no matter what platform you're on.
On its own, this is pretty hacky—it means you'll end up with an extra \r on the end of every line in a Windows text file.
But it means you can pass the file on to a higher-level file reader like csv that wants binary files, so it can parse them the way it wants to. On top of magically solving this problem for you, a higher-level library will also probably make the rest of your code a lot simpler and more robust. For example, it might look something like this:
with open(R'C:\temp\datasets\a13GBfile.csv','rb') as f:
for row in csv.reader(f):
# do stuff
Now each row is automatically split on commas, except that commas that are inside quotes or escaped in the appropriate way don't count, and so on, so all you need to deal with is a list of column values.
I do appreciate this question has been asked million of time, but I can't figure out while attempting to read a .txt file line by line I get the entire file read in one go.
This is my little snippet
num = 0
with open(inStream, "r") as f:
for line in f:
num += 1
print line + " ..."
print num
Having a look at the open function there is anything that suggest a second param to limit the reading as that is just the "mode" to pen the file.
So I can only guess there are same problem with my file, but this is a txt file, with entry line by line.
Any hint?
Without a little more information, it's hard to be absolutely sure… but most likely, your problem is inappropriate line endings.
For example, on a modern Mac OS X system, lines in text files end with '\n' newline characters. So, when you do for line in f:, Python breaks the text file on '\n' characters.
But on classic Mac OS 9, lines in text files ended with '\r' instead. If you have some ancient classic Mac text files lying around, and you give one to Python, it will go looking for '\n' characters and not find any, so it'll think the whole file is one giant line.
(Of course in real life, Windows is a problem more often than classic Mac OS, but I used this example because it's simpler.)
Python 2: Fortunately, Python has a feature called "universal newlines". For full details, see the link, but the short version is that adding "U" onto the end of the mode when opening a text file means Python will read any of the three standard line-ending conventions (and give them to your code as Unix-style '\n').
In other words, just change one line:
with open(inStream, "rU") as f:
Python 3: Universal newlines are part of the standard behavior; adding "U" has no effect and is deprecated.
From what I've researched, csv.writeRow should take in a list, and then write it to the given csv file. Here's what I tried:
from csv import writer
with open('Test.csv', 'wb') as file:
csvFile, count = writer(file), 0
titles = ["Hello", "World", "My", "Name", "Is", "Simon"]
csvFile.writerow(titles)
I'm just trying to write it so that each word is in a different column.
When I open the file that it creates, however, I get the following message:
After pressing to continue anyways, I get a message saying that the file is either corrupted, or is a SYLK file. I can then open the file, but only after going through two error messages everytime I open the file.
Why is this?
Thanks!
It's a documented issue that Excel will assume a csv file is SYLK if the first two characters are 'ID'.
Venturing into the realm of opinion - it shouldn't, but Excel thinks it knows better than the extension. To be fair, people expect it to be able to figure out cases where the extension really is wrong, but in a case like this assuming the extension is wrong, and then further assuming the file is corrupt when it doesn't appear corrupt if interpreted according to the extension is just mind-boggling.
#John Y points out:
One thing to watch out for: The "workaround" given by the Microsoft issue linked to by #PeterDeGlopper is to (manually) prepend an apostrophe into the file. (This is also advice commonly found on the Web, including StackOverflow, to try to force CSV digits to be treated as strings rather than numbers.) This is not what I'd call good advice, as that injects a literal apostrophe into your data.
#DSM suggests using quoting=csv.QUOTE_NONNUMERIC on the writer. Excel is not confused by a file beginning with "ID" rather than ID, so if the other tools that are going to work with the CSV accept that quoting level this is probably the best solution other than just ignoring Excel's confusion.
I'm running into a problem that I haven't seen anyone on StackOverflow encounter or even google for that matter.
My main goal is to be able to replace occurences of a string in the file with another string. Is there a way there a way to be able to acess all of the lines in the file.
The problem is that when I try to read in a large text file (1-2 gb) of text, python only reads a subset of it.
For example, I'll do a really simply command such as:
newfile = open("newfile.txt","w")
f = open("filename.txt","r")
for line in f:
replaced = line.replace("string1", "string2")
newfile.write(replaced)
And it only writes the first 382 mb of the original file. Has anyone encountered this problem previously?
I tried a few different solutions such as using:
import fileinput
for i, line in enumerate(fileinput.input("filename.txt", inplace=1)
sys.stdout.write(line.replace("string1", "string2")
But it has the same effect. Nor does reading the file in chunks such as using
f.read(10000)
I've narrowed it down to mostly likely being a reading in problem and not a writing problem because it happens for simply printing out lines. I know that there are more lines. When I open it in a full text editor such as Vim, I can see what the last line should be, and it is not the last line that python prints.
Can anyone offer any advice or things to try?
I'm currently using a 32-bit version of Windows XP with 3.25 gb of ram, and running Python 2.7
Try:
f = open("filename.txt", "rb")
On Windows, rb means open file in binary mode. According to the docs, text mode vs. binary mode only has an impact on end-of-line characters. But (if I remember correctly) I believe opening files in text mode on Windows also does something with EOF (hex 1A).
You can also specify the mode when using fileinput:
fileinput.input("filename.txt", inplace=1, mode="rb")
Are you sure the problem is with reading and not with writing out?
Do you close the file that is written to, either explicitly newfile.close() or using the with construct?
Not closing the output file is often the source of such problems when buffering is going on somewhere. If that's the case in your setting too, closing should fix your initial solutions.
If you use the file like this:
with open("filename.txt") as f:
for line in f:
newfile.write(line.replace("string1", "string2"))
It should only read into memory one line at a time, unless you keep a reference to that line in memory.
After each line is read it will be up to pythons garbage collector to get rid of it. Give this a try and see if it works for you :)
Found to solution thanks to Gareth Latty. Using an iterator:
def read_in_chunks(file, chunk_size=1000):
while True:
data = file.read(chunk_size)
if not data: break
yield data
This answer was posted as an edit to the question Python Does Not Read Entire Text File by the OP user1297872 under CC BY-SA 3.0.