So I was learning on how to download files from the web using python but got a bit thrown by one part of the code.
Here is the code:
from urllib import request
def download_stock_data(csv_url):
response = request.urlopen(csv_url)
csv = response.read()
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r"stock.csv"
fx = open(dest_url, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
I don't quite understand the code in the variable lines. How does it know when to split into a new line on a csv file ?
A csv file is essentially just a text file will comma separated data but they also contain new lines (via the newline ascii character).
If there a csv file with a long single comma separated line for line in lines: would only see the single line.
You can open it up in notepad++ or something to see the raw .csv file. Excel will put data seperated by commas in a cell,and data on a new line into a new row.
"\n" is where the instruction to create a new line comes from.
In the code you have presented, you are telling python to split the string you received based upon "\n". So you get a list of strings split into lines.
When you write to fx, you are inserting a newline character onto every line you write by appending "\n". If you didn't do this, then you would just get one very long line.
Related
first of all, completely new to python, so this question might be easy.
I want to read a text file and save it as a new file, separate the header in emails from the body. The place where the header ends is the empty line under "X-UID: 81" (as seen in the image). Not all emails have the "X-UID:" so the empty line is the place where I want to separate it. Is there an easy way to do this?
My code currently looks like this:
with open("1.txt") as fReader:
corpus = fReader.read()
loc = corpus.find("X-UID")
print(corpus[:loc])
This sort of works, but I can't separate at the empty line. And don't know how to save as new file
Example email
One way to do this is to read the entire file as one string, and split it by X-UID: 81 (provided that this substring is present, of course), like so:
parts = s.split('X-UID: 81')
header, body = parts[0], parts[1]
If the file doesn't contain X-UID: 81, you could just split() by double newline (\n\n) with maxsplit=1 to make sure it doesn't split further on the newlines in the email body:
parts = s.split('\n\n', maxsplit=1)
header, body = parts[0], parts[1]
I am working with a vcf file. I try to extract information from this file, but the file has errors in the format.
In this file there is a column that contains long character strings. The error is, that a number of tabs and a new line character are erronously placed within some rows of this column. So when I try to read in this tab delimited file, all columns are messed up.
I have an idea how to solve this, but don't know how to execute it in code. The string is DNA, so always has ATCG. Basically, if one could look for a number of tabs and a newline within characters ATCG and remove them, then the file is fixed:
ACTGCTGA\t\t\t\t\nCTGATCGA would become:
ACTGCTGACTGATCGA
So one would need to look into this file, look for [ACTG] followed by tabs or newlines, followed by more [ACTG], and then replace this with nothing. Any idea how to do this?
with open(file.vcf, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
Here's one way with regex:
First read the file in:
import re
with open('file.vcf', 'r') as file:
dnafile = file.read()
Then write a new file with the changes:
with open('fileNew.vcf', 'w') as file:
file.write(re.sub("(?<=[ACTG]{2})((\\t)*(\\n)*)(?=[ACTG]{2})", "", dnafile))
I write function to change JSON with provided informations, specifically EAN, which I read from file. EAN is put with \n at the end and I'm not able to get rid of it.
I tried replace('\n', ''), value[:-3] but that does affect only number.
I tried add parameter newline=""/None in open function, only adding \r between number and \n
eans.txt is simply file containing eans each on new line without any gaps or tabs
material_template =
{
"eanUpc": "",
}
def get_ean():
with open('eans.txt', 'r') as x:
first = x.readline()
all = x.read()
with open('eans.txt', 'w') as y:
y.writelines(all[1:])
return first
def make_material(material_template):
material_template["eanUpc"] = get_ean()
print(material_template)
print(material_template["eanUpc"])
make_material(material_template)
{'eanUpc': '061500127178\n'}
061500127178
thanks in advance
Instead of return first use return first.strip("\n") to trim off the \n.
However, I question the organization of the original system... reading one line then writing the whole file back over the original.
What if the process fails during the processing of one line? The data is lost
IF this file is large, you'll be doing a lot of disk i/o
Maybe read one line at a time, and save an offset into the file (last_processed) into another file? Then you can seek to that location.
Should use return first.strip(). It is removing leading white-spaces.
I had a data table I converted to a text file with a tilde ~ at the end of each line. This is how I ended each line, so it is not a delimiter.
I used Linux to fold the data into an 80 byte length wrapped text file and added a line feed at the end of each line.
Example (if I did this at 10 bytes per line):
Original file or table:
abcdefghigklmnop~
1234567890~
New file:
abcdefghig
klmnop~123
4567890~
Linux/Unix, Perl, or even Python responses would help and be appreciated.
I need the new file to look exactly like the original. Sometimes line lengths will be over 80 characters in length which is ok.
If your original data was delimited by ~\n (tildes at the end of a line), and the new format removed the newlines and inserted new ones every 80 bytes, then the reverse is to remove newlines and replacing ~ with ~\n again:
with open(inputfile, 'r') as inp, open(outputfile, 'w') as outp:
data = inp.read().replace('\n', '')
outp.write(data.replace('~', '~\n'))
so I have a TSV file that contains the location of parks and I'm trying to add it to a base GoogleMaps API address to eventually write a GeoJSON file.
here's what it the issue is.. I can't get the formatting down so that the address base I have is concatenated to the base GoogleMaps API url. The basic code is this:
def geocode(address):
url = ("http://maps.googleapis.com/maps/api/geocode/json?"
"sensor=false&address={0}".format(address.replace(" ", "+")))
print url
with open("MovieParksFixed.tsv", "rU") as f:
reader = csv.DictReader(f, delimiter = "\t")
for line in reader:
response = geocode(line['Location'])
but running this outputs:
http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=
Edgebrook+Park,+Chicago+
http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=
Gage+Park,+Chicago+
and so on , where the first line just won't connect to the second line. So what I end up with is http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=
and then Edgebrook+Park,+Chicago+on the following line, but not connected.
I swear it's like there's a hidden newline or something that's screwing it up...
I had to manually edit the one of two cells of the parsed TSV file a bit on Excel (but still looks fine now - https://github.com/yongcho822/Movies-in-the-park/blob/master/MovieParksFixed.tsv)... did that screw it all up or something?
note: the original TSV file when written was obviously delimited by tabs...
Before you interpolate your string into the URL string, try the following:
address.strip().replace(" ", "+")
The strip() method will remove all leading and trailing whitespace (tabs, newlines, spaces etc.). The final line:
url = ("http://maps.googleapis.com/maps/api/geocode/json?"
"sensor=false&address={0}".format(address.strip().replace(" ", "+")))