Read until empty line and save as new file - python

Read until empty line and save as new file - python - python

first of all, completely new to python, so this question might be easy.
I want to read a text file and save it as a new file, separate the header in emails from the body. The place where the header ends is the empty line under "X-UID: 81" (as seen in the image). Not all emails have the "X-UID:" so the empty line is the place where I want to separate it. Is there an easy way to do this?
My code currently looks like this:
with open("1.txt") as fReader:
corpus = fReader.read()
loc = corpus.find("X-UID")
print(corpus[:loc])
This sort of works, but I can't separate at the empty line. And don't know how to save as new file
Example email

One way to do this is to read the entire file as one string, and split it by X-UID: 81 (provided that this substring is present, of course), like so:
parts = s.split('X-UID: 81')
header, body = parts[0], parts[1]
If the file doesn't contain X-UID: 81, you could just split() by double newline (\n\n) with maxsplit=1 to make sure it doesn't split further on the newlines in the email body:
parts = s.split('\n\n', maxsplit=1)
header, body = parts[0], parts[1]

Related

Replace blank values with string

I need to manipulate a csv file in a way to go into the csv file look for blank fields between c0-c5 in my example csv file. with the csv file where ever there are blanks I would like to replace the blank with any verbage i want, like "not found"
the only thing for code I have so far is dropping a column I do not need, but the manipulation I need I really can not find anything.. maybe it is not possible?
also, i am wondering how to change a column name..thanks..
#!/bin/env python
import pandas
data = pandas.read_csv('report.csv')
data = data.drop(['date',axis=1)
data.to_csv('final_report.csv')

Alternatively and taking your "comment question" into account (if you do not necessarily want to use pandas as in n1colas.m's answer) use string replacements and
simply loop over your file with:
with open("modified_file.csv","w") as of:
with open("report.csv", "r") as inf:
for line in inf:
if "#" not in line: # in the case your csv file has a comment marker somewhere and it is called #, the line is skipped, which means you get a clean comma separated value file as the outfile- if you do want to keep such lines simply remove the if condition
mystring=line.replace(", ,","not_found").replace("data","input") # in case it is not only one blank space you can also use the regex for n times blank space here
print(mystring, file=of, end=""); # prints the replaced line to outfile and writes no newline
I know this is not the most efficient way to do it, but probably the one where you easily understand what you are doing and are able to modify this to your heart's desire.
For any reasonably sized csv files it sould still work nearly instantaneously.
Also for testing purposes always use a separate file (of) for such replacements instead of writing to your infile as your question seems to state. Check that it did what you wanted. ONLY THEN overwrite your infile. This may seem unnecessary at first, but mistakes happen...

You have to perform this line
data['data'] = data['data'].fillna("not found")
Here the documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Here an example
import pandas
data = pandas.read_csv('final_report.csv')
data.info()
data['data'] = data['data'].fillna("Something")
print(data)
I would suggest to change the data variable to something different, because your column has the same name and can be confusing.

How to get rid of '\n' as a part of dict value when assigning from text file

I write function to change JSON with provided informations, specifically EAN, which I read from file. EAN is put with \n at the end and I'm not able to get rid of it.
I tried replace('\n', ''), value[:-3] but that does affect only number.
I tried add parameter newline=""/None in open function, only adding \r between number and \n
eans.txt is simply file containing eans each on new line without any gaps or tabs
material_template =
{
"eanUpc": "",
}
def get_ean():
with open('eans.txt', 'r') as x:
first = x.readline()
all = x.read()
with open('eans.txt', 'w') as y:
y.writelines(all[1:])
return first
def make_material(material_template):
material_template["eanUpc"] = get_ean()
print(material_template)
print(material_template["eanUpc"])
make_material(material_template)
{'eanUpc': '061500127178\n'}
061500127178
thanks in advance

Instead of return first use return first.strip("\n") to trim off the \n.
However, I question the organization of the original system... reading one line then writing the whole file back over the original.
What if the process fails during the processing of one line? The data is lost
IF this file is large, you'll be doing a lot of disk i/o
Maybe read one line at a time, and save an offset into the file (last_processed) into another file? Then you can seek to that location.

Should use return first.strip(). It is removing leading white-spaces.

write into file without reading all the file

I have a template of a file (html) with the header and footer. I try to insert text into right after <trbody>.
The way i'm doing it right now is with fileinput.input()
def write_to_html(self,path):
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
print line, # preserve old content
if "<tbody>" in line:
print ("<tr>")
###PRINT MY STUFFS
print ("</tr>")
I call this for each Table-line I have to add in my html table. but I have around 5k table-lines to add (each line is about 30 lines of hmtl code). It starts fast, but each line takes more and more times to be added. It's because it has to write the file all over again for each line right ?
Is there a way to speed up the process?
EDIT thanks for the responses :
I like the idee of creating my big string, and the just go through the file just once.
I'll have to change some stuff because right now because the function I showed is in a Classe. and in my main programe, I just iterate on a folder containing .json.
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html() (the function i showed)
I should turn that into :
block_of_lines=''
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
block_of_line += Object_a.to_html_sting()
Create_html(block_of_line)
Would that be faster ?

Re-reading the question a couple more times, the following thought occurs.
Could you split the writing into 3 blocks - one for the header, one for the table lines and another for the footer. It does rather seem to depend on what those three substitution lines are doing, but if I'm right, they can only update lines the first time the template is used, ie. while acting on the first json file, and then remain unchanged for the others.
file_footer = CLASS-A.write_html_header(path)
for json in jsonfolder :
Object_a = CLASS-A(json) #unserialization
Object_a.write_to_html(path) #use the part of the function
# that just handles the json file here
CLASS-A.write_html_footer(path, footer)
Then in your class, define the two new functions to write the header and footer as static methods (which means they can be used from the class rather than just on an instance)
i.e. (using a copy from your own code)
#staticmethod
def write_html_header(path):
footer = []
save_for_later = false
for line in fileinput.input(path, inplace=1):
line = re.sub(r'CURRENT_APPLICATION', obj, line)
line = re.sub(r'IN_PROGRESS', time.strftime("%Y-%m-%d %H:%M:%S"), line)
line = re.sub(r'CURRENT_VERSION', svers, line)
# this blocks prints the header, and saves the
# footer from your template.
if save_for_later:
footer.append(line)
else:
print line, # preserve old content
if "<tbody>" in line:
save_for_later = true
return footer
I do wonder why you're editing 'inplace' doesn't that mean the template get's overwritten, and thus it's less of a template and more of a single use form. Normally when I use a template, I read in from the template, and write out to a new file an edited version of the template. Thus the template can be re-used time and time again.
For the footer section, open your file in append mode, and then write the lines in the footer array created by the call to the header writing function.
I do think not editing the template in place would be of benefit to you. then you'd just need to :
open the template (read only)
open the new_file (in new, write mode)
write the header into new_file
loop over json files
append table content into new_file
append the footer into new_file
That way you're never re-reading the bits of the file you created while looping over the json files. Nor are you trying to store the whole file in memory if that is a concern.

5000 lines is nothing. Read the entire file using f.readlines() to get a list of lines:
with open(path) as f:
lines = f.readlines()
Then process each line, and eventually join them to one string and write the entire thing back to the file.

How does the code know when to split into a line?

So I was learning on how to download files from the web using python but got a bit thrown by one part of the code.
Here is the code:
from urllib import request
def download_stock_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r"stock.csv"
fx = open(dest_url, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
I don't quite understand the code in the variable lines. How does it know when to split into a new line on a csv file ?

A csv file is essentially just a text file will comma separated data but they also contain new lines (via the newline ascii character).
If there a csv file with a long single comma separated line for line in lines: would only see the single line.
You can open it up in notepad++ or something to see the raw .csv file. Excel will put data seperated by commas in a cell,and data on a new line into a new row.

"\n" is where the instruction to create a new line comes from.
In the code you have presented, you are telling python to split the string you received based upon "\n". So you get a list of strings split into lines.
When you write to fx, you are inserting a newline character onto every line you write by appending "\n". If you didn't do this, then you would just get one very long line.

Python: Check one element in csv, use another to remove from second file

I am trying to get a script working, where it will check the existance of an IP in a lookup csv file, and then if it exists take the third element and remove that third element from another (second) file. Here is a extract of what I have:
for line in fileinput.input(hostsURLFileLoc,inplace =1):
elements = open(hostsLookFileLoc, 'r').read().split(".").split("\n")
first = elements[0].strip()
third = elements[2].strip()
if first == hostIP:
if line != third:
print line.strip()
This obviously doesn't work, I have tried playing with a few options, but here is my latest (crazy) attempt.
I think the problem is that there are two input files open at once.
Any thoughts welcome,
Cheers

All right, even though I haven't got any response to my comment on the question, here's my shot at a general answer. If I've got something wrong, just say so and I'll edit to try to address the errors.
First, here are my assumptions. You have two files, who's names are stored in the HostsLookFileLoc and HostsURLFileLoc variables.
The file at HostsLookFileLoc is a CSV file, with an IP address in the third column of each row. Something like this:
HostsLookFile.csv:
blah,blah,192.168.1.1,whatever,stuff
spam,spam,82.94.164.162,eggs,spam
me,myself,127.0.0.1,and,I
...
The file at HostsURLFileLoc is a flat text file with one IP address per line, like so:
HostsURLFile.txt:
10.1.1.2
10.1.1.3
10.1.2.253
127.0.0.1
8.8.8.8
192.168.1.22
82.94.164.162
64.34.119.12
...
Your goal is to read and then rewrite the HostsURLFile.txt file, excluding all of the IP addresses that are found in the third column of a row in the CSV file. In the example lists above, localhost (127.0.0.1) and python.org (82.94.164.162) would be excluded, but the rest of the IPs in the list would remain.
Here's how I'd do it, in three steps:
Read in the CSV file and parse it using the csv module to find the IP addresses. Stick them into a set.
Open the flat file and read the IP addresses into a list, closing the file afterwards.
Reopen the flat file and overwrite it with the loaded list of addresses, skipping any that are contained in the set from the first step.
Code:
import csv
def cleanURLFile(HostsLookFileLoc, HostsURLFileLoc):
"""
Remove IP addresses from file at HostsURLFileLoc if they are in
the third column of the file at HostsLookFileLoc.
"""
with open(HostsLookFileLoc, "r") as hostsLookFile:
reader = csv.reader(hostsLookFile)
ipsToExclude = set(line[2].strip() for line in reader)
with open(HostsURLFileLoc, "r") as hostsURLFile:
ipList = [line.strip() for line in hostsURLFile]
with open(HostsURLFileLoc, "w") as hostsURLFile: # truncates the file!
hostsURLFile.write("\n".join(ip for ip in ipList
if ip not in ipsToExclude))
This code is deliberately simple. There are a few things that could be improved, if they are important to your use case:
If something crashes the program during the rewriting step, HostsURLFile.txt may be clobbered. A safer way of rewriting (at least, on Unix-style systems) is to write to a temp file, then after the writing has finished (and the file has been closed), rename the temp file over the top of the old file. That way if the program crashes, you'll still have the original version or a completely written replacement, but never anything in between.
If the checking you needed to do was more complicated than set membership, I'd add an extra step between 2 and 3 to do the actual processing, then write the results out without further manipulation (other than adding newlines).
Speaking of newlines, if you have a trailing newline, it will be passed through as an empty string in the list of IP addresses, which should be OK for this scenario (it won't be in the set of IPs to exclude, unless your CSV file has a messed up line), but might cause trouble if you were doing something more complicated.

In test file test.csv (note there is an IP address in there):
'aajkwehfawe;fh192.168.0.1awefawrgaer'
(I am pretty much ignoring that it is CSV for now. I am just going to use regex matches.)
# Get the file data
with open('test.csv', 'r') as f:
data = f.read()
# Look for the IP:
find_ip = '192.168.0.1'
import re
m = re.search('[^0-9]({})[^0-9]'.format(find_ip), data)
if m: # found!
# this is weird, because you already know the value in find_ip, but anyway...
ip = m.group(1).split('.')
print('Third ip = ' + ip[2])
else:
print('Did not find a match for {}'.format(find_ip))
I do not understand the second part of your question, i.e. removing the third value from a second file. Are there numbers listed line by line, and you want to find the line that contains this number above and delete the line? If yes:
# Make a new list of lines that omits the matched one
new_lines=[]
for line in open('iplist.txt','r'):
if line.strip()!=ip[2]: # skip the matched line
new_lines.append(line)
# Replace the file with the new list of lines
with open('iplist.txt', 'w') as f:
f.write('\n'.join(new_lines))

If, once you have found values in the first file that need to be removed in the second file, I suggest something like this pseudocode:
Load first file into memory
Search string representing first file for matches using a regular expression
(in python, check for re.find(regex, string), where regex = re.compile("[0-9]{3}\\.[0-9]{3}\\.[0-9]\\.[0-9]"), I am not entirely certain that you need the double backslash here, try with and without)
Build up a list of all matches
Exit first file
Load second file into memory
Search string representing second file for the start index and end index of each match
For each match, use the expression string = string[:start_of_match] + string[end_of_match:]
Re-write the string representing the second (now trimmed) file to the second file
Essentially whenever you find a match, redefine the string to be the slices on either side of it, excluding it from the new string assignment. Then rewrite your string to a file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.