I want to be able to append to a .txt file each time I run a function.
The output I am trying to write to the function is something like this:
somelist = ['a','b','b','c']
somefloat = -0.64524
sometuple = (235,633,4245,524)
output = tuple(somelist,somefloat,sometuple) (the output does not need to be in tuple format.)
Right now, I am outputting like this:
outfile = open('log.txt','a')
out = str(output)+'\n
outfile.write(out)
This kind of works, but I have to import it like this:
with open('log.txt', "r") as myfile:
mydata = myfile.readlines()
for line in mydata:
line = eval(line)
Ideally, I would like to be able to import it back directly into a Pandas DataFrame something like this:
dflog = pd.read_csv('log.txt')
and have it generate a three column dataset with the first column containing a list (string format is fine), the second column containing a float, and the third column containing a tuple (same deal as the list).
My questions are:
Is there a way to append the output in a format that can be more easily imported into pandas
Is there a simpler way of doing this, this seems like a pretty common task, I wouldn't be surprised if somebody has made this into a line or two of code.
One way to do this is to separate your columns with a custom separator such as '|'
Say:
somelist = ['a','b','b','c']
somefloat = -0.64524
sometuple = (235,633,4245,524)
output = str(somelist) + "|" + str(somefloat) + "|" + str(sometuple)
(if you wanna have many more columns, then use string.join() or something like that)
Then, just as before:
outfile = open('log.txt','a')
out = output + '\n'
outfile.write(out)
As just read the whole file with
pd.read_csv("log.txt", sep='|')
Do note that using lists or tuples for an entry in pandasis discouraged (I couldn't find a official reference for that though). For speedups with operations, you might consider dividing your tuples or lists into separate columns so that you're left with floats, integers or simple strings. Pandas can easily handle automatic naming if you so need.
Related
I have a dataframe that consists of lines that look like:
"{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}"
How do I split this into columns. I've tried str.slice[stop and start].
I suspect it's all the quotes but finding and replacing them don't seem to work either
You can handle the first problem, the string object, using the eval('..') function. It will return the execution of the string, so will return the dict itself.
The second one, the dict structure, you have multiple choices. There is one solution
import pandas as pd
# Transform the string in dict
dict_data=eval("{'displayName':'MartinscroftTramStop','locationIdentifier':'STATION^15306','normalisedSearchTerm':'MARTINSCROFTTRAMSTOP'}")
# Organize the data
columns_name = dict_data.keys()
data_list = [list(dict_data.values())] # a row must be a list inside a list
pd.DataFrame(data_list, columns=columns_name)
These are two example rows of my tab-delimited file:
id reference_rc_001 alternative_rc_001 reference_rc_002 alternative_rc_002 reference_rc_003 alternative_rc_003
id1 0 433 0 0 69
I would like to merge fields every two columns. The example output should look like that. This is an step of a python script. So it has to be done with python
id reference_rc_001alternative_rc_001 reference_rc_002alternative_rc_002 reference_rc_003alternative_rc_003
id1 0433 00 690
This looks really horrible, is probably the worst way to do this and might be about as efficient as a donkey, but.....
I think it works.
You'll need to open the file, preferably using a with so you can iterate across the lines in the file. (Many other SO articles will deomstrate doing that with a decent explanation so I'm not going to.)
Then use the bit of code inside my demonstration for loop :
for line in file:
items = line.split("\t")
counters = range(len(items)/2)
new_items = [items[0]] + [items[1+2*x] + items[2+2*x] for x in counters ]
new_line = '\t'.join(new_items)
print new_line
To explain :
I'm splitting each line into a list (using the tab as delimiter).
Then I'm creating a new list by indexing the n and n+1 elements of the list and adding them together (as strings).
Finally, to recreate a line of text with tab separated entries, I'm joining the new list back together with tab delimiters.
Hopefully that gives you the pieces you might need for your solution.
I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)
import itertools
from itertools import islice
import tempfile
import heapq
#converts heapq.merge to ints
#def merge(*temp_files):
# return heapq.merge(*[itertools.imap(int, s) for s in temp_files])
with open("path\to\input", "r") as f:
temp_file = tempfile.TemporaryFile()
temp_files = []
elements = []
while True:
elements = list(islice(f, 1000))
if not elements:
break
elements.sort(key=int)
temp_files.append(elements)
temp_file.writelines(elements)
temp_file.flush()
temp_file.seek(0)
with open("path\to\output", "w") as output_file:
output_file.writelines(heapq.merge(*temp_files))
Your elements are read by default as strings, you have to do something like:
elements = list(islice(f, 1000))
elements = [int(elem) for elem in elements]
so that they would be interpreted as integers instead.
That would also mean that you need to convert them back to strings when writing, e.g.:
temp_file.writelines([str(elem) for elem in elements])
Apart from that, you would need to convert your elements again to int for the final merging. In your case, you probably want to uncomment your merge method (and then convert the result back to strings again, same way as above).
Your code doesn't make much sense to me (temp_files.append(elements)? Merging inside the loop?), but here's a way to merge files sorting numerically:
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
out.writelines(map('{}\n'.format,
heapq.merge(*(map(int, f)
for f in files))))
First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge. Then map('{}\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.
A different way to write it (might be clearer for some):
import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
int_streams = (map(int, f) for f in files)
int_stream = heapq.merge(*int_streams)
line_stream = map('{}\n'.format, int_stream)
out.writelines(line_stream)
And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map.
And yes, if you do it right, this will allow you to sort gigantic files with very little memory.
You are doing Kway merge within the loop which will add a lots of runtimeComplexity . Better Store the file handles into a spearate list and perform a Kway merge
You also don't have to remove and add new line back ,just sort it based on number.
sorted(temp_files,key=lambda no:int(no.strip()))
Rest of things are fine.
https://github.com/melvilgit/external-Merge-Sort/blob/master/README.md
A newbie question for which I apologize if it's basic.
I have a set, myset, that is filled by reading an csv file see below printed representation.
set(['value1', 'value2'])
The number of elements on the set is arbitrary, depending upon the read file. I want to add entries on a csv file using the individual elements of the set. I've tried:
file_row = ['#Entry','Time', str(myset), 'cpu usage']
print file_row
filewriter.writerow(file_row)
However the output I get is:
#Entry,Time,"set(['value1', 'value2'])",cpu usage
where I actually wanted
#Entry,Time,value1,value2,cpu usage.
Can you suggest how to get my desired result ?
You could approach this as follows:
file_row = ['#Entry','Time'] # start with pre-myset elements
file_row.extend(myset) # add on myset
file_row.append('cpu usage') # add final item
Note that using a set means the order of elements will also be arbitrary.
If you want to do it all in one line:
file_row = ['#Entry','Time'] + [x for x in myset] + ['cpu usage']
I am trying to get an output such as this:
169.764569892, 572870.0, 19.6976
However I have a problem because the files that I am inputing have a format similar to the output I just showed, but some line in the data have 'nan' as a variable which I need to remove.
I am trying to use this to do so:
TData_Pre_Out = map(itemgetter(0, 7, 8), HDU_DATA)
TData_Pre_Filter = [Data for Data in TData_Pre_Out if Data != 'nan']
Here I am trying to use list comprehension to get the 'nan' to go away, but the output still displays it, any help on properly filtering this would be much appreciated.
EDIT: The improper output looks like this:
169.519361471, nan, nan
instead of what I showed above. Also, some more info:1) This is coming from a special data file, not a text file, so splitting lines wont work. 2) The input is exactly the same as the output, just mapped using the map() line that I show above and split into the indices I actually need (i.e. instead of using all of a data list like L = [(1,2,3),(3,4,5)] I only pull 1 and 3 from that list, to give you the gist of the data structure)
The Data is read in as so:
with pyfits.open(allfiles) as HDU:
HDU_DATA = HDU[1].data
The syntax is from a specialized program but you get the idea
TData_Pre_Out = map(itemgetter(0, 7, 8), HDU_DATA)
This statement gives you a list of tuples. And then you compare the tuple with a string. All the != comparisions success.
Without showing how you read in your data, the solution can only be guessed.
However, if HDU_DATA stores real NaN values, try following:
Comparing variable to NaNs does not work with the equality operator ==:
foo == nan
where nan and foo are both NaNs gives always false.
Use math.isnan() instead:
import math
...if math.isnan(Data)…
Based on my understanding of your description, this could work
with open('path/to/file') as infile:
for line in infile:
vals = line.strip().split(',')
print[v for v in vals if v!='nan']