I am pretty new to Python and is busy with a bootcamp one of the task I have to complete have me a bit stump. They give as a input txt file that looks like the following:
min:1,2,3,4,5,6
max:1,2,3,4,5,6
avg:1,2,3,4,5,6
The task is that I have to open the txt file in my program and then work out the min, max and the avg of each line. I can do this a long way of doing .readlines(), but they want it in a generic way such that the lines don't matter. They want me to read through the lines with a loop statement and check the first word and then make that word start the operations.
I hope that I have put the question through correctly.
Regards
While your question wasnt entirely clear with how to use use readlines and not , maybe this is what you were looking for.
f=open("store.txt","r")
for i in f.readlines():
func,data=i.split(":")
data=[int(j) for j in data.rstrip('\n').split(",")]
print(func,end=":")
if(func=="max"):
print(max(data))
elif(func=="min"):
print(min(data))
else:
print(sum(data)/len(data))
Next time please try to show your work and ask specific errors , i.e. not how to solve a problem but rather how to change your solution to fix the problem that you are facing
eval() may be useful here.
The name of the math operation to perform is conveniently the first word of each line in the text file, and some are python keywords. So after parsing the file into a math expression I found it tempting to just use python's eval function to perform the operations on the list of numbers.
Note: this is a one-off solution as use of eval is discouraged on unknown data, but safe here as we manage the input data.
avg, is not a built-in operation, so we can define it (and any other operations that are not built-ins) with a lambda.
with open('input.txt', 'r') as f:
data = f.readlines()
clean = [d.strip('\n').split(':') for d in data]
lines = []
# define operations in input file that are not built-in functions
avg = lambda x: sum(x)/float(len(x)) # float for accurate calculation result
for i in clean:
lines.append([i[0], list(map(int, i[1].split(',')))])
for expr in lines:
info = '{}({})'.format(str(expr[0]), str(expr[1]))
print('{} = {}'.format(info, eval('{op}({d})'.format(op=expr[0],d=expr[1]))))
output:
min([1, 2, 3, 4, 5, 6]) = 1
max([1, 2, 3, 4, 5, 6]) = 6
avg([1, 2, 3, 4, 5, 6]) = 3.5
Related
This seems like it should be very simple but am not sure the proper syntax in Python. To streamline my code I want a while loop (or for loop if better) to cycle through 9 datasets and use the counter to call each file out using the counter as a way to call on correct file.
I would like to use the "i" variable within the while loop so that for each file with sequential names I can get the average of 2 arrays, the max-min of this delta, and the max-min of another array.
Example code of what I am trying to do but the avg(i) and calling out temp(i) in loop does not seem proper. Thank you very much for any help and I will continue to look for solutions but am unsure how to best phrase this to search for them.
temp1 = pd.read_excel("/content/113VW.xlsx")
temp2 = pd.read_excel("/content/113W6.xlsx")
..-> temp9
i=1
while i<=9
avg(i) =np.mean(np.array([temp(i)['CC_H='],temp(i)['CC_V=']]),axis=0)
Delta(i)=(np.max(avg(i)))-(np.min(avg(i)))
deltaT(i)=(np.max(temp(i)['temperature='])-np.min(temp(i)['temperature=']))
i+= 1
EG: The slow method would be repeating code this for each file
avg1 =np.mean(np.array([temp1['CC_H='],temp1['CC_V=']]),axis=0)
Delta1=(np.max(avg1))-(np.min(avg1))
deltaT1=(np.max(temp1['temperature='])-np.min(temp1['temperature=']))
avg2 =np.mean(np.array([temp2['CC_H='],temp2['CC_V=']]),axis=0)
Delta2=(np.max(avg2))-(np.min(avg2))
deltaT2=(np.max(temp2['temperature='])-np.min(temp2['temperature=']))
......
Think of things in terms of lists.
temps = []
for name in ('113VW','113W6',...):
temps.append( pd.read_excel(f"/content/{name}.xlsx") )
avg = []
Delta = []
deltaT = []
for data in temps:
avg.append(np.mean(np.array([data['CC_H='],data['CC_V=']]),axis=0)
Delta.append(np.max(avg[-1]))-(np.min(avg[-1]))
deltaT.append((np.max(data['temperature='])-np.min(data['temperature=']))
You could just do your computations inside the first loop, if you don't need the dataframes after that point.
The way that I would tackle this problem would be to create a list of filenames, and then iterate through them to do the necessary calculations as per the following:
import pandas as pd
# Place the files to read into this list
files_to_read = ["/content/113VW.xlsx", "/content/113W6.xlsx"]
results = []
for i, filename in enumerate(files_to_read):
temp = pd.read_excel(filename)
avg_val =np.mean(np.array([temp(i)['CC_H='],temp['CC_V=']]),axis=0)
Delta=(np.max(avg_val))-(np.min(avg_val))
deltaT=(np.max(temp['temperature='])-np.min(temp['temperature=']))
results.append({"avg":avg_val, "Delta":Delta, "deltaT":deltaT})
# Create a dataframe to show the results
df = pd.DataFrame(results)
print(df)
I have included the enumerate feature to grab the index (or i) should you want to access it for anything, or include it in the results. For example, you could change the the results.append line to something like this:
results.append({"index":i, "Filename":filename, "avg":avg_val, "Delta":Delta, "deltaT":deltaT})
Not sure if I understood the question correctly. But if you want to read the files inside a loop using indexes (i variable), you can create a list to hold the contents of the excel files instead of using 9 different variables.
something like
files = []
files.append(pd.read_excel("/content/113VW.xlsx"))
files.append(pd.read_excel("/content/113W6.xlsx"))
...
then use the index variable to iterate over the list
i=1
while i<=9
avg(i) = np.mean(np.array([files[i]['CC_H='],files[i]['CC_V=']]),axis=0)
...
i+=1
P.S.: I am not a Pandas/NumPy expert, so you may have to adapt the code to your needs
I'm new to python (pandas, numPy, etc.).
I'd like to know the perfect approach to solve this task in the best and performant way.
I have a huge file that has the following format - expect everything is in one line:
{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}
What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference.
I was thinking to solve it with regExp, but as the file is huge, and need to join to another file, I think I should use Pandas, shouldn't I?
Or as all datasets are the same length, I could also insert by length
Is there a function for that to use pandas? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it.
I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. And it would really help me to figure out the right approach to this kind of tasks.
For any help or links, I'm very thankful.
Simon
PS: Which cloud environment do you use for this kind of tasks? Which works best with python and the data science libraries?
UPDATE
I used the following code to format into a valid JSON and loaded it with json.loads() successfully:
#syntay: python 3
import json
#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"
d = json.loads(my_list)
So far so good. Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code?
The other JSON looks like this:
{
"data": [
{
"BFS-Nr": 1,
"Raum mit städtischem Charakter 2012": 4,
"Typologie der MS-Regionen 2000 (2)": 3,
"E": 679435,
"Zusatzziffer": 0,
"Agglomerationsgrössenklasse 2012": 1,
"Gemeinde-typen (9 Typen) 2000 (1)": 4,
"N": 235653,
"Stadt/Land-Typologie 2012": 3,
"Städte 2012": 0,
"Gemeinde-Grössenklasse 2015": 7,
"BFS Nr.": 1,
"Sprachgebiete 2016": 1,
"Europäsiche Berggebietsregionen (2)": 1,
"Gemeindename_1": "Aeugst am Albis",
"Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
"Kantonskürzel": "ZH",
"Kanton": 1,
"Metropolräume 2000 (2)": 1,
"PLZ": 8914,
"Bezirk": 101,
"Gemeindetypologie 2012\n(25 Typen)": 237,
"Raumplanungs-regionen": 105,
"Gemeindetypologie 2012\n(9 Typen)": 23,
"Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
"Ortschaftsname": "Aeugst am Albis",
"Arbeitsmarktregionen 2000 (2)": 10,
"Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
"Städtische / Ländliche Gebiete 2000 (1)": 2,
"Grossregionen": 4,
"Gemeindename": "Aeugst am Albis",
"MS-Regionen (2)": 4,
"Touris-mus Regionen 2017": 3,
"DEGURBA 2011 eurostat": 3
},
{....}
}
What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json?
I could load it into JSON file without any problems:
plz_data=open('plz.js').read()
plz = json.loads(plz_data)
Sorry for the long message. But hopefully, someone can help me with this easy problem. The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders.
Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed, try:
sed 's|}{|}\n{|g' originalfile > result
Note I added in newlines, not commas. Probably better for your future editing. You can use the -i flag so sed edits in place, but this is safer. If you really want to use Python it's not a big deal with standard Python. Safest is to read character by character:
with open("originalfile") as fd:
while True:
ch=fd.read(1)
if not ch: break
if ch =="{": print("\n")
print(ch,end="")
or just replace and print (never tested limits of Python, I'm guessing this will work:
print(open("originalfile").read().replace("}{","}\n{"))
no need for regex for this - It's a bit of overkill. Once this is a proper Json file it will be easier to use, including loading Json through pandas.read_json.
Here's one way.
data = []
with open("originalfile") as fp:
for l in fp:
clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
data.append(clean_line)
Then you can convert the data list into a pandas dataframe and export to JSON.
df = pandas.DataFrame(data)
df.to_json()
If you want to remove the text, e.g. "billing_address_zip_code", and keep only data, then you can do
data = []
with open(filepath) as fp:
for l in fp:
splitted = ([x.split(":")[1] for x in l.split(",")])
data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))
I'm converting python2.7 scripts to python3.
2to3 makes these kinds of suggestions:
result = result.split(',')
syslog_trace("Result : {0}".format(result), False, DEBUG)
- data.append(map(float, result))
+ data.append(list(map(float, result)))
if (len(data) > samples):
data.pop(0)
syslog_trace("Data : {0}".format(data), False, DEBUG)
# report sample average
if (startTime % reportTime < sampleTime):
- somma = map(sum, zip(*data))
+ somma = list(map(sum, list(zip(*data))))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
I'm sure the makers of Python3 did not want to do it like that. At least, it gives me a "you must be kidding" feeling.
Is there a more pythonic way of doing this?
What's wrong with the unfixed somma?
2to3 cannot know how somma is going to be used, in that case, as a generator in the next line to compute averages it is OK and optimal, no need to convert it as a list.
That's the genius of python 3 list to generator changes: most people used those lists as generators already, wasting precious memory materializing lists they did not need.
# report sample average
if (startTime % reportTime < sampleTime):
somma = map(sum, zip(*data))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
Of course the first statement, unconverted, will fail since we append a generator whereas we need a list. In that case, the error is quickly fixed.
If left like this: data.append(map(float, result)), the next trace shows something fishy: 'Data : [<map object at 0x00000000043DB6A0>]', that you can quickly fix by cnverting to list as 2to3 suggested.
2to3 does its best to create running Python 3 code, but it does not replace manual rework or produce optimal code. When you are in a hurry you can apply it, but always check the diffs vs the old code like the OP did.
The -3 option of latest Python 2 versions print warnings when an error would be raised using Python 3. It's another approach, better when you have more time to perform your migration.
I'm sure the makers of Python3 did not want to do it like that
Well, the makers of Python generally don't like seeing Python 2 being used, I've seen that sentiment being expressed in pretty much every recent PyCon.
Is there a more pythonic way of doing this?
That really depends on your interpretation of Pythonic, list comps seem more intuitive in your case, you want to construct a list so there's no need to create an iterator with map or zip and then feed it to list().
Now, why 2to3 chose list() wrapping instead of comps, I do not know; probably easiest to actually implement.
I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.
How to read n lines from a file instead of just one when iterating over it? I have a file which has well defined structure and I would like to do something like this:
for line1, line2, line3 in file:
do_something(line1)
do_something_different(line2)
do_something_else(line3)
but it doesn't work:
ValueError: too many values to unpack
For now I am doing this:
for line in file:
do_someting(line)
newline = file.readline()
do_something_else(newline)
newline = file.readline()
do_something_different(newline)
... etc.
which sucks because I am writing endless 'newline = file.readline()' which are cluttering the code.
Is there any smart way to do this ? (I really want to avoid reading whole file at once because it is huge)
Basically, your fileis an iterator which yields your file one line at a time. This turns your problem into how do you yield several items at a time from an iterator. A solution to that is given in this question. Note that the function isliceis in the itertools module so you will have to import it from there.
If it's xml why not just use lxml?
You could use a helper function like this:
def readnlines(f, n):
lines = []
for x in range(0, n):
lines.append(f.readline())
return lines
Then you can do something like you want:
while True:
line1, line2, line3 = readnlines(file, 3)
do_stuff(line1)
do_stuff(line2)
do_stuff(line3)
That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser...
itertools to the rescue:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
pass
for i in file produces a str, so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack").
It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this:
for line in file_handle:
do_something(line)
if some_condition:
break # Don't want to read anything else
(Also, don't use file as a variable name, you're shadowning a builtin.)
If your're doing the same thing why do you need to process multiple lines per iteration?
For line in file is your friend. It is in general much more efficient than manually reading the file, both in terms of io performance and memory.
Do you know something about the length of the lines/format of the data? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\n")[0:3].
If you want to be able to use this data over and over again, one approach might be to do this:
lines = []
for line in file_handle:
lines.append(line)
This will give you a list of the lines, which you can then access by index. Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly.
why can't you just do:
ctr = 0
for line in file:
if ctr == 0:
....
elif ctr == 1:
....
ctr = ctr + 1
if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do:
for line in file:
function_list[ctr]()
or something similar
It sounds like you are trying to read from disk in parallel... that is really hard to do. All the solutions given to you are realistic and legitimate. You shouldn't let something put you off just because the code "looks ugly". The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code.
As for running out of memory, you may want to check out pickle.
It's possible to do it with a clever use of the zip function. It's short, but a bit voodoo-ish for my tastes (hard to see how it works). It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. If you need the final lines, itertools.izip_longest might do the trick.
zip(*[iter(inputfile)] * 3)
Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution:
def groupsoflines(f, n):
while True:
group = []
for i in range(n):
try:
group.append(next(f))
except StopIteration:
if group:
tofill = n - len(group)
yield group + [None] * tofill
return
yield group
for line1, line2, line3 in groupsoflines(inputfile, 3):
...
N.B. If this runs out of lines halfway through a group, it will fill in the gaps with None, so that you can still unpack it. So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None.