I am beginner in programming and I use python. I have a code that calculate the rotation period of star, but I have to change the star ID each time, which will take me effort and time to complete it.
Can I change the star ID automatically?
from lightkurve import search_lightcurvefile
lcf = search_lightcurvefile('201691589').download() ## star Id = 201691589
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
print Prot
I saved all 'stars_ID' that I want to use in a txt file(starID.txt) with 10000 lines, and I want to calculate the rotation period (Prot) in an automatic way so that the code takes the star ID from the txt file one by one and do the calculations, then save the star_ID and Prot in a csv file (two columns: 'star_ID', 'Prot'). Can you please help me do it.
This should get you close but I don't have a bunch of star IDs handy, nor is this in my field.
The main points:
Use the csv module for reading and writing files.
When you have code that you need to call many times (well, oftentimes even just once for a logical grouping), you want to consider packaging it into a function
There are other pointers for you to research. If I didn't try and make things a little more succinct than basic loops then the code would be quite long, and I tried to not make it too terse. Hopefully it's enough for you to follow up on.
from lightkurve import search_lightcurvefile
import csv
# You need to read the file and get the star IDs. I'm taking a guess here that
# the file has a single column of IDs
with open('starID.txt') as infile:
reader = csv.reader(infile)
# Below is where my guess matters. A "list comprehension" assuming a single
# column so I just take the first value of each row.
star_ids = [item[0] for item in reader]
def data_reader(star_id):
"""
Function to read periodogram for a star_id
Returns [star_id, periodogram]
"""
lcf = search_lightcurvefile('201691589').download()
lc = lcf.PDCSAP_FLUX.remove_nans()
pg = lc.to_periodogram()
Prot= pg.frequency_at_max_power**-1
return [star_id, Prot]
# Now start calling the function on your list of star IDs and storing the result
results = []
for id_number in star_ids:
individual_result = data_reader(id_number) # Call function
results.append(individual_result) # Add function response to the result collection
# Now write the data out
with open('star_results.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
writer.writerows(results)
You said 'beginner' so this is an answer for someone writing programs almost for the first time.
In Python the straight-forward way to change the star id each time is to write a function and call it multiple times with different star ids. Taking the code you have and changing into a function without changing the behavior at all might look like this:
from lightkurve import search_lightcurvefile
def prot_for_star(star_id):
lcf = search_lightcurvefile(star_id).download()
pg = lcf.PDCSAP_FLUX.remove_nans().to_periodogram()
return pg.frequency_at_max_power**-1
# Now use map to call the function for each star id in the list
star_ids = ['201691589', '201692382']
prots = map(prot_for_star, star_ids)
print(list(prots))
This could be inefficient code though. I don't know what this lightkurve package does exactly, so there could be additional ways to save time. If you need to do more than one thing with each lcf object, you might need your functions to be structured differently. Or if creating a periodogram is CPU intensive, and you end up generating the same ones multiple times, there could be ways to save time doing that.
But this is the basic idea of using abstraction to avoid repeating the same lines of code over and over again.
Combining the original star id with its period of rotation can be achieved like this. This is a bit of functional programming magic.
# interleave the star ids with their periods of rotation
for pair in zip(star_ids, prots):
# separate the id and prot with a comma for csv format
print(','.join(pair))
The output of the Python script can then be stored in a csv file.
Related
What is the best way to create a python script which by calling which, it will ALWAYS generate a new UNIQUE ID (Autoincremental)
You run the script and it will tell you 1, then close script and open again and will tell you 2.
Purpose of it is to create a script which will be used across and this ID will be used to track the latest changes and so on.
P.S. I'm not talking to make a function which will do it.
import uuid
uniqueid = uuid.uuid1()
Since you didnt provide any code, I will also not provide any code.
Solution 1: Unique ID
1) TIME: create function to give you timestamp
2) ID: create function that generate long string with random numbers and letters
This is of course 'risky' because there is a chance you will generate already existing ID, but from statistical point of view, it is so called 'impossible even if it is possible'
save in file or somewhere
Solution 2: offset - incremental
1) have file with a 0 in it.
2) open a file, read line, convert to integer, increment to +1, write in a file.
Note:
Your title is wrong. One moment you talk about UNIQUE ID, Next moment you are talking about offset. Unique ID and counting running python script are quite contradicting ideas
I assume you have a script, it will generate some result every time it is executed. Then you need need a value that (1) distinguish one result from another and (2) shows which result came last. Is that right? If so, we have many options here. In the simplest case (a script always running in the same machine) I would suggest two options
Save a count to a file
In this case, you would have a file and would read the number from it:
try:
with open('count.txt') as count_file:
content = count_file.read()
count = int(content)
except Exception:
count = 0
After doing whatever your script does, you would write to the file the value you've read, but incremented:
with open('count.txt', 'w') as count_file:
count_file.write(str(count + 1))
Save the timestamp
A simpler option, however, is not to increment a value but get a timestamp. You could use time.time(), that returns the number of seconds since Unix epoch:
>>> import time
>>> time.time()
1547479233.9383247
You will always know which result came later than the others. Personally, however, I would rather format the current time, it is easier to read and reason about:
>>> from datetime import datetime
>>> datetime.now().strftime('%Y%m%d%H%M%S')
'20190114132407'
Those are basic ideas, you may need to pay attention to corner cases and possible failures (especially with the file-based solution). That said, I guess those are quite viable first steps.
A technical note
What you want here is to a program to remember a piece of information between two or more executions, and we have a technical term for that: the information should be persistent. Since you asked for an autoincrementing feature, you wanted a persistent count. I suspect, however, you do not need that if you use the timestamp option. It is up to you to decide what to do here.
I had the same situation. I ended up in creating a CSV file so that I can map variable names.
def itemId_generator(itemIdLocation):
# Importing value of ItemID from the csv file
df = pd.read_csv(itemIdLocation)
#return value which is current ItemID in the csv file
ItemId = df.loc[0, 'ItemId']
# If loop to set limit to maximum Item ID
if ItemId>= 10000:
df.loc[0, 'ItemId'] = 1
elif ItemId<10000:
# updating the column value/data
df.loc[0, 'ItemId'] = df.loc[0,'ItemId'] + 1
else:
print("Invalid value returned")
sys.exit()
# writing new ItemID into the file
df.to_csv(itemIdLocation, index=False)
# The function .item() converts numpy integer to a normal int
return str(ItemId.item())
If there is any chance of the file being accessed concurrently, it is best to lock the file. Keep trying if the file is locked.
http://tilde.town/~cristo/file-locking-in-python.html
Old answer:
You could store it as an environment variable on the system. If not set, initialise to 1. Else increment it by 1.
csv data:
>c1,v1,c2,v2,Time
>13.9,412.1,29.7,177.2,14:42:01
>13.9,412.1,29.7,177.2,14:42:02
>13.9,412.1,29.7,177.2,14:42:03
>13.9,412.1,29.7,177.2,14:42:04
>13.9,412.1,29.7,177.2,14:42:05
>0.1,415.1,1.3,-0.9,14:42:06
>0.1,408.5,1.2,-0.9,14:42:07
>13.9,412.1,29.7,177.2,14:42:08
>0.1,413.4,1.3,-0.9,14:42:09
>0.1,413.8,1.3,-0.9,14:42:10
My current code that I have:
import pandas as pd
import csv
import datetime as dt
#Read .csv file, get timestamp and split it into date and time separately
Data = pd.read_csv('filedata.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
Data['Date'] = Data.Time_Stamp.dt.date
Data['Time'] = Data.Time_Stamp.dt.time
#print (Data)
print (Data['Time_Stamp'])
Data['Time_Stamp'] = pd.to_datetime(Data['Time_Stamp'])
#Read timestamp within a certain range
mask = (Data['Time_Stamp'] > '2017-06-12 10:48:00') & (Data['Time_Stamp']<= '2017-06-12 11:48:00')
june13 = Data.loc[mask]
#print (june13)
What I'm trying to do is to read every 5 secs of data, and if 1 out of 5 secs of data of c1 is 10.0 and above, replace that value of c1 with 0.
I'm still new to python and I could not find examples for this. May I have some assistance as this problem is way beyond my python programming skills for now. Thank you!
I don't know the modules around csv files so my answer might look primitive, and I'm not quite sure what you are trying to accomplish here, but have you though of dealing with the file textually ?
From what I get, you want to read every c1, check the value and modify it.
To read and modify the file, you could do:
with open('filedata.csv', 'r+') as csv_file:
lines = csv_file.readlines()
# for each line, isolate data part and check - and modify, the first one if needed.
# I'm seriously not sure, you might have wanted to read only one out of five lines.
# For that, just do a while loop with an index, which increments through lines by 5.
for line in lines:
line = line.split(',') # split comma-separated-values
# Check condition and apply needed change.
if float(line[0]) >= 10:
line[0] = "0" # Directly as a string.
# Transform the list back into a single string.
",".join(line)
# Rewrite the file.
csv_file.seek(0)
csv_file.writelines(lines)
# Here you are ready to use the file just like you were already doing.
# Of course, the above code could be put in a function for known advantages.
(I don't have python here, so I couldn't test it and typos might be there.)
If you only need the dataframe without the file being modified:
Pretty much the same to be honest.
Instead of the file-writing at the end, you could do :
from io import StringIO # pandas needs stringIO instead of strings.
# Above code here, but without the last 6 lines.
Data = pd.read_csv(
StringIo("\n".join(lines)),
parse_dates=['Time_Stamp'],
infer_datetime_format=True
)
This should give you the Data you have, with changed values where needed.
Hope this wasn't completely off. Also, some people might find this approach horrible ; we have already coded working modules to do that kind of things, so why botter and dealing with the rough raw data ourselves ? Personally, I think that it's often much easier than learning all of the external modules I'll be using in my life if I don't try to understand how the text representation of files can be used. Your opinion might differ.
Also, this code might result in performances being lower, as we need to iterate through the text twice (pandas does it when reading). However, I don't think you'd get faster result by reading the csv like you already do, then iterate through data anyway to check condition. (You might win a cast per c1 checked value, but the difference is small and iterating through pandas dataframe might as well be slower than a list, depending on the state of their current optimisation.)
Of course, if you don't really need the pandas dataframe format, you could completely do it manually, it would take only a few more lines (or not, tbh) and shouldn't be slower, as the amount of iterations would be minimized : you could check conditions on data at the same time as you read it. It's getting late and I'm sure you can figure that out by yourself so I won't code it in my great editor (known as stackoverflow), ask if there's anything !
I have a large file (1.6 gigs) with millions of rows that has columns delimited with:
[||]
I have tried to use the csv module but it says I can only use a single character as a delimiter. So Here is what I have:
fileHandle = open('test.txt', 'r', encoding="UTF-16")
thelist = []
for line in fileHandle:
fields = line.split('[||]')
therow = {
'dea_reg_nbr':fields[0],
'bus_actvty_cd':fields[1],
'drug_schd':fields[3],
#50 more columns like this
}
thelist.append(therow)
fileHandle.close()
#now I have thelist which is what I want
And boom, now I have a list of dictionaries and it works. I want a list because I care about the order, and the dictionary because downstream it's being expected. This just feels like I should be taking advantage of something more efficient. I don't think this scales well with over a million rows and so much data. So, my question as follows:
What would be the more efficient way of taking a multi-character delimited text file (UTF-16 encoded) and creating a list of dictionaries?
Any thoughts would be appreciated!
One way to make it scale better is to use a generator instead of loading all million rows into memory at once. This may or may not be possible depending on your use-case; it will work best if you only need to make one pass over the full data set. Multiple passes will require you to either store all the data in memory in some form or another or to read it from the file multiple times.
Anyway, here's an example of how you could use a generator for this problem:
def file_records():
with open('test.txt', 'r', encoding='UTF-16') as fileHandle:
for line in fileHandle:
fields = line.split('[||]')
therow = {
'dea_reg_nbr':fields[0],
'bus_actvty_cd':fields[1],
'drug_schd':fields[3],
#50 more columns like this
}
yield therow
for record in file_records():
# do work on one record
The function file_records is a generator function because of the yield keyword. When this function is called, it returns an iterator that you can iterate over exactly like a list. The records will be returned in order, and each one will be a dictionary.
If you're unfamiliar with generators, this is a good place to start reading about them.
The thing that makes this scale so well is that you will only ever have one therow in memory at a time. Essentially what's happening is that at the beginning of every iteration of the loop, the file_records function is reading the next line of the file and returning the computed record. It will wait until the next row is needed before doing the work, and the previous record won't linger in memory unless it's needed (such as if it's referenced in whatever data structure you build in # do work on one record).
Note also that I moved the open call to a with statement. This will ensure that the file gets closed and all related resources are freed once the iteration is done or an exception is raised. This is much simpler than trying to catch all those cases yourself and calling fileHandle.close().
I have a .dat file with no delimiters that I am trying to read into an array. Say each new line represents one person, and variables in each line are defined in terms of a fixed number of characters, e.g the first variable "year" is the first four characters, the second variable "age" is the next 2 characters (no delimiters within the line) e.g.:
201219\n
201220\n
201256\n
Here is what I am doing right now:
data_file = 'filename.dat'
file = open(data_file, 'r')
year = []
age = []
for line in file:
year.append(line[0:4])
age.append(line[4:])
This works fine for a small number of lines and variables, but when I try loading the full data file (500Mb with 10 million lines and 20 variables) I get a MemoryError. Is there a more efficient way to load this type of data into arrays?
First off, you're probably better off with a list of class instances than a bunch of parallel lists, from a software engineering standpoint. If you try this, you probably should look into __slots__ to decrease the memory overhead.
You could also try pypy - it has some memory optimizations for homogeneous lists.
I'd probably use gdbm or bsddb rather than sqlite, if you want an on-disk solution. gdbm and bsddb look like dict's, except you index them (the keys) by a string and the values are strings too. So your class (the one I mentioned above) would have a __str__ and/or __repr__ method(s) that would convert to a string (could use pickle) for storage in the table. Then your constructor would be made to deal with reversing the process somehow.
If you ever get to such large data that a gdbm or bsddb is too slow, you could try just writing to a flat file - that'll not be as nice for jumping around obviously, but it eliminates a lot of seek()'ing which can be very advantageous sometimes.
HTH
The problem here doesn't appear to be that you're having problems reading it as fitting it into memory. When you're talking about 200 million anything in memory you're going to have some issues.
Try storing it as a list of strings (i.e. trade memory for CPU), or if you can just don't store it at all.
Another option to try is dumping it into a sqlite database. If you use an in-memory db you might end out with the same issue, but maybe not.
If you go for the string style, do something like this:
def get_age(person):
return int(person[4:])
people = file.readlines() # Wait a while....
for person in people:
print(get_age(person)*2) # Or something else
Here's an example of getting mean income for a particular age in a particular year:
def get_mean_income_by_age_and_year(people, target_age, target_year):
count = 0
total = 0.0
for person in people:
income, age, year = get_income(person), get_age(person), get_year(person)
if age == target_age and year == target_year:
total += income
count += 1
if count:
return total/count
else:
return 0.0
Really, though, this basically does what storing it in a sqlite database would do for you. If there are only a couple of very specific things you want to do, then going this way is probably reasonable. But it sounds like there are probably several things you want to be doing with this info - if so a sqlite database is probably what you want.
A more efficient data structure for lots of uniform numeric data is the array. Depending on how much memory you have, using an array may work.
import array
year = array.array('i') # int
age = array.array('i') # int
income = array.array('f') # float
with open('data.txt', 'r') as f:
for line in f:
year.append(int(line[0:4]))
age.append(int(line[4:6]))
income.append(float(line[6:12]))
To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!
I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!
I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...