I have a file with id numbers along with the specifics on event that is logged. (time, temp, location)
I want python to group all of the same i.d.'s into their own unique files, storing all of the even specifics from each record.
That is to say go through each record, if the id does not have a log file create one, if it does log the new record into that id's file.
Example input
1, 11:00, 70, port A
1, 11:02, 70, port B
2, 11:00, 40, blink
3, 11:00, 30, front
Desired output
file name "1" with :[11:00, 70, port A ;
11:02, 70, port B ]
file name "2" with :[11:00, 40, blink]
file name "3" with :[11:00, 30, front]
I am very new to python and I am having trouble finding a reference guide. If anyone knows a good place where I can look for an answer I would appreciate it.
This is pretty straightforward - presuming your file is as you described it however I am in Python 2.7 so I am not sure what the differences could be.
all_lines = open('c:\\path_to_my_file.txt').readlines()
from collections import defaultdict
my_classified_lines = defaultdict(list)
for line in all_lines:
data_type, value1,value2,value3 = line.split(',')
my_classified_lines[data_type).append(','.join([value1,value2,value3])
for data_type in my_classified_lines:
outref = open('c:\\directory\\' + data_type + '.txt', 'w')
outref.writelines(my_classified_lines[data_type])
outref.close()
to understand this you need to learn about dictionaries - useful containers for data,
file operations
and loops
I found Dive into Python a great resource when I was starting up
I may not have exactly what you want after looking at your output mine would be like
11:00, 70, port A
11:02, 70, port B
I think you are stating that you want a list like object with semi-colons as the separator which to me suggests you are asking for a string with brackets around it. If your output is really as you describe it then try this
for data_type in my_classified_lines:
outref = open('c:\\directory\\' + data_type + '.txt', 'w')
out_list =[ ';'.join([item for item in my_classified_lines[data_type'])]
outref.writelines(outlist) # should be only one line
outref.close()
Related
I'm trying parse variables and the intervals in which they live in from a file. For example, an input file might contain something like this:
Constants
x1 = 1;
Variables
x45 in [-45,5844];
x63 in [0,456];
x41 in [45, 1.e8]; #Where 1.e8 stands for 10^8
All I want to do is to stock every couple (variable, interval) in a dictionary. If it's a constant, the interval would be [constant, constant]. I first imagined I had to use the built-in function findall to search through the whole file all the lines of the type "x"random_number" in "random_interval" or "x"random_number = random_number" but I don't know how to get and stock the "x" and the intervals after I find all the lines I wanted.
Also, whenever there is a "1.e8" in an interval, I want to replace it by a "10^8" before stocking it in the dictionary.
Any clue ?
Thanks for helping me to solve my problem
Honestly I don't exactly know how to begin. All I did is this :
from sys import argv
import re
script, filename = argv
#The file which contains the variables is filename, so that I can use as a command
in a terminal : "python myscript.py filename.txt"
file_data = open(filename, 'r')
Dict = {} #The dictionnary in which I want to store the variables and the intervals
txt = file_data.read()
#Here I thought I might use this : re.findall(.*"in".*, txt) or something like that
#but It will get me all the lines of the type "blabla in bloblo".
#I want also blabla and bloblo so that I can put them into my dictionnary like this :
#Dict[blabla] = bloblo
#In the dictionnary it will be for example Dict = {x45 : [-10, 456]}
file_data.close()
To answer Akilan,
if the file is the following :
Constants
x1 = 5;
Variables
x56 in [3,12];
x78 in [1,4];
It should return the following :
{"x1" : [5,5], "x56" : [3,12], "x78" : [1,4]}
I'm more of a SQL guy but I was asked a question that stumped me during an interview. I'll put the gist of it here:
there is a flatfile with two columns: 'Course' and 'Student_id' with several rows
Course: Science, Math, Science, History, Science, Math
Student_id: 101, 103, 102, 101, 103, 101
How would you go about using only base python with no packages or libraries, grouping the students by courses, returning counts of students in each course, returning 'Science' with number of students enrolled, returning 'Math' with each student_id enrolled
I knew how I would go about this in SQL and with pandas but did not know how to go about this in base python without packages or libraries. Please help.
You can build a dictionary with courses as keys and keep sets of student ids.
(You could keep lists of student ids but then you might end up with duplicates which would skew your numbers, although maybe that's something you should check and warn about or stop with an error.)
dict has a function setdefault which creates a value for a key only if it doesn't already exist, and returns the value. If you set a set it will return it and you can add the latest student id:
course_students = {}
with open(input) as flatfile:
for line in flatfile:
course, student_id = line.split(',')
course_students.setdefault(course, set()).add(student_id)
print(len(course_students['Science']))
print(course_students['Math'])
Edit:
Seems I misread your description of the file format, this solution works if you get two rows with column-separated values in them - not for lots of rows with two comma sepparated values each.
Leaving it in as its a mvca for the file format I thought you faced.
You could do this:
data = """Course: Science, Math, Science, History, Science, Math
Student_id: 101, 103, 102, 101, 103, 101"""
fn = "data.txt"
# write file
with open(fn,"w") as f:
f.write(data)
With that file you:
# read file
d = {}
with open(fn,"r") as f:
for line in f:
c,cc = line.split(":")
d[c] = [x.strip() for x in cc.split(",")]
# create a (course,student)-tuple list
tups = list(zip( d["Course"],d["Student_id"]))
# create a dict of course : student_list
# you can streamline this using defaultdict from collections but that needs an import
courses = {}
for course,student in tups: # iterate, create course:pupillist dict
if course in courses:
courses[course].append(student)
else:
courses[course] = [student]
# print all (including Science) with amount of pupils
for k in courses:
print(k, len(courses[k]))
# print Math + StudentIds
print("Math: ", courses["Math"])
Output:
Science 3
Math 2
History 1
Math: ['103', '101']
I'm new to python (pandas, numPy, etc.).
I'd like to know the perfect approach to solve this task in the best and performant way.
I have a huge file that has the following format - expect everything is in one line:
{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}
What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference.
I was thinking to solve it with regExp, but as the file is huge, and need to join to another file, I think I should use Pandas, shouldn't I?
Or as all datasets are the same length, I could also insert by length
Is there a function for that to use pandas? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it.
I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. And it would really help me to figure out the right approach to this kind of tasks.
For any help or links, I'm very thankful.
Simon
PS: Which cloud environment do you use for this kind of tasks? Which works best with python and the data science libraries?
UPDATE
I used the following code to format into a valid JSON and loaded it with json.loads() successfully:
#syntay: python 3
import json
#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"
d = json.loads(my_list)
So far so good. Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code?
The other JSON looks like this:
{
"data": [
{
"BFS-Nr": 1,
"Raum mit städtischem Charakter 2012": 4,
"Typologie der MS-Regionen 2000 (2)": 3,
"E": 679435,
"Zusatzziffer": 0,
"Agglomerationsgrössenklasse 2012": 1,
"Gemeinde-typen (9 Typen) 2000 (1)": 4,
"N": 235653,
"Stadt/Land-Typologie 2012": 3,
"Städte 2012": 0,
"Gemeinde-Grössenklasse 2015": 7,
"BFS Nr.": 1,
"Sprachgebiete 2016": 1,
"Europäsiche Berggebietsregionen (2)": 1,
"Gemeindename_1": "Aeugst am Albis",
"Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
"Kantonskürzel": "ZH",
"Kanton": 1,
"Metropolräume 2000 (2)": 1,
"PLZ": 8914,
"Bezirk": 101,
"Gemeindetypologie 2012\n(25 Typen)": 237,
"Raumplanungs-regionen": 105,
"Gemeindetypologie 2012\n(9 Typen)": 23,
"Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
"Ortschaftsname": "Aeugst am Albis",
"Arbeitsmarktregionen 2000 (2)": 10,
"Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
"Städtische / Ländliche Gebiete 2000 (1)": 2,
"Grossregionen": 4,
"Gemeindename": "Aeugst am Albis",
"MS-Regionen (2)": 4,
"Touris-mus Regionen 2017": 3,
"DEGURBA 2011 eurostat": 3
},
{....}
}
What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json?
I could load it into JSON file without any problems:
plz_data=open('plz.js').read()
plz = json.loads(plz_data)
Sorry for the long message. But hopefully, someone can help me with this easy problem. The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders.
Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed, try:
sed 's|}{|}\n{|g' originalfile > result
Note I added in newlines, not commas. Probably better for your future editing. You can use the -i flag so sed edits in place, but this is safer. If you really want to use Python it's not a big deal with standard Python. Safest is to read character by character:
with open("originalfile") as fd:
while True:
ch=fd.read(1)
if not ch: break
if ch =="{": print("\n")
print(ch,end="")
or just replace and print (never tested limits of Python, I'm guessing this will work:
print(open("originalfile").read().replace("}{","}\n{"))
no need for regex for this - It's a bit of overkill. Once this is a proper Json file it will be easier to use, including loading Json through pandas.read_json.
Here's one way.
data = []
with open("originalfile") as fp:
for l in fp:
clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
data.append(clean_line)
Then you can convert the data list into a pandas dataframe and export to JSON.
df = pandas.DataFrame(data)
df.to_json()
If you want to remove the text, e.g. "billing_address_zip_code", and keep only data, then you can do
data = []
with open(filepath) as fp:
for l in fp:
splitted = ([x.split(":")[1] for x in l.split(",")])
data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))
About a year back, I wrote a little program in python that basically automates a part of my job (with quite a bit of assistance from you guys!) However, I ran into a problem. As I kept making the program better and better, I realized that Python did not want to play nice with excel, and (without boring you with the details suffice to say xlutils will not copy formulas) I NEED to have more access to excel for my intentions.
So I am starting back at square one with VB (2010 Express if it helps.) The only programming course I ever took in my life was on it, and it was pretty straight forward so I decided I'd go back to it for this. Unfortunately, I've forgotten much of what I had learned, and we never really got this far down the rabbit hole in the first place. So, long story short I am trying to:
1) Read data from a .csv structured as so:
41,332.568825,22.221759,-0.489714,eow
42,347.142926,-2.488763,-0.19358,eow
46,414.9969,19.932693,1.306851,r
47,450.626074,21.878299,1.841957,r
48,468.909171,21.362568,1.741944,r
49,506.227269,15.441723,1.40972,r
50,566.199838,17.656284,1.719818,r
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
2) Sort that data alphabetically by column 5
3) Then selecting only the ones with an "l" in column 5, sort THOSE numerically by column 2 (ascending order) AND copy them to a new file called coil.csv
4) Then selecting only the ones that have an "r" in column 5, sort those numerically by column 2 (descending order) and copy them to the SAME file coil.csv (appended after the others obviously)
After all of that hoopla I wish to get out:
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
50,566.199838,17.656284,1.719818,r
49,506.227269,15.441723,1.40972,r
48,468.909171,21.362568,1.741944,r
47,450.626074,21.878299,1.841957,r
46,414.9969,19.932693,1.306851,r
I realize that this may be a pretty involved question, and I certainly understand if no one wants to deal with all this bs, lol. Anyway, some full on code, snippets, ideas or even relevant links would be GREATLY appreciated. I've been, and still am googling, but it's harder than expected to find good reliable information pertaining to this.
P.S. Here is the piece of python code that did what I am talking about (although it created two seperate files for the lefts and rights which I don't really need) - if it helps you at all.
msgbox(msg="Please locate your survey file in the next window.")
mainfile = fileopenbox(title="Open survey file")
toponame = boolbox(msg="What is the name of the shots I should use for topography? Note: TOPO is used automatically",choices=("Left","Right"))
fieldnames = ["A","B","C","D","E"]
surveyfile = open(mainfile, "r")
left_file = open("left.csv",'wb')
right_file = open("right.csv",'wb')
coil_file = open("coil1.csv","wb")
reader = csv.DictReader(surveyfile, fieldnames=fieldnames, delimiter=",")
left_writer = csv.DictWriter(left_file, fieldnames + ["F"], delimiter=",")
sortedlefts = sorted(reader,key=lambda x:float(x["B"]))
surveyfile.seek(0,0)
right_writer = csv.DictWriter(right_file, fieldnames + ["F"], delimiter=",")
sortedrights = sorted(reader,key=lambda x:float(x["B"]), reverse=True)
coil_writer = csv.DictWriter(coil_file, fieldnames, delimiter=",",extrasaction='ignore')
for row in sortedlefts:
if row["E"] == "l" or row["E"] == "cl+l":
row['F'] = '%s,%s' % (row['B'], row['D'])
left_writer.writerow(row)
coil_writer.writerow(row)
for row in sortedrights:
if row["E"] == "r":
row['F'] = '%s,%s' % (row['B'], row['D'])
right_writer.writerow(row)
coil_writer.writerow(row)
One option you have is to start with a class to hold the fields. This allows you to override the ToString method to facilitate the output. Then, it's a fairly simple matter of reading each line and assigning the values to a list of the class. In your case you'll want the extra step of making 2 lists sorting one descending and combining them:
Class Fields
Property A As Double = 0
Property B As Double = 0
Property C As Double = 0
Property D As Double = 0
Property E As String = ""
Public Overrides Function ToString() As String
Return Join({A.ToString, B.ToString, C.ToString, D.ToString, E}, ",")
End Function
End Class
Function SortedFields(filename As String) As List(Of Fields)
SortedFields = New List(Of Fields)
Dim test As New List(Of Fields)
Dim sr As New IO.StreamReader(filename)
Using sr As New IO.StreamReader(filename)
Do Until sr.EndOfStream
Dim fieldarray() As String = sr.ReadLine.Split(","c)
If fieldarray.Length = 5 AndAlso Not fieldarray(4)(0) = "e"c Then
If fieldarray(4) = "r" Then
test.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
Else
SortedFields.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
End If
End If
Loop
End Using
SortedFields = SortedFields.OrderBy(Function(x) x.B).Concat(test.OrderByDescending(Function(x) x.B)).ToList
End Function
One simple way of writing the data to a csv file is to use the IO.File.WriteAllLines methods and the ConvertAll method of the List:
IO.File.WriteAllLines(" coil.csv", SortedFields("textfile1.txt").ConvertAll(New Converter(Of Fields, String)(Function(x As Fields) x.ToString)))
You'll notice how the ToString method facilitates this quite easily.
If the class will only be used for this you do have the option to make all the fields string.
I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.