I'm working on creating a program that will pull some data from a preformatted file that does not include a timestamp but requires one. I know the following things:
The name of the file, which includes that hour at which the data was logged. I can assume that the first data point was collected at the start of the hour and I can parse that.
I know that each data point was collected at a frequency of 64Hz, so I know the time delta between each data point.
As I write the code chunk to extract these data, I am running into this problem that my date is updating, but my hour isn't. The result is that all my data have the correct date, but the same hour. I'm hoping this is just the result of missing something logical as a result of sleep deprivation, but I'd appreciate it if someone could point out the problem with my code.
#Paths for files to process
advpath = '/Users/stnixon/Dropbox/GradSchool/Research/EddyCovarianceData/data/palmyra2016/**'
#Create list of files to process
advfiles = glob.glob(os.path.join(advpath,'*.A16'))
#create data frames, load files, concatenate, and sort adv files and dfetfiles
advframe = []
for f in advfiles:
advdf = pd.read_csv(f, sep='\s+', names=['ID','u','v','w','u1','v1','w1','ucorr','vcorr','wcorr'], usecols=[0,1,2,3,7,8,9])
file_now = os.path.basename(f)
print(int(file_now[4:6]))
advdf['Time'] = pd.to_datetime(int(file_now[4:6]),unit='h')
advdf['Date'] = pd.to_datetime('2016'+file_now[0:2]+file_now[2:4])
advframe.append(advdf)
advdata = pd.concat(advframe)
Essentially, the Date column gives me the right date across each row, while the Time column just gives me the same value for all.
It turns out that this was not a bug, but instead a weird coincidence. The files I needed to process were parsing the hour correctly, but they were being processed in a random order and it just so happens that the hours for the first two and last two files were all the same. So it looked in the terminal as if it wasn't updating.
Related
I have a file named "sample name_TIC.txt". The first three columns in this file are useful - Scan, Time, and TIC. It also has 456 not useful columns after the first 3. To do other data processing, I need these not-useful columns to go away. So I wrote a bit of code to start:
os.chdir(main_folder)
mydir = (os.getcwd())
nameslist=['Scan','Time', 'TIC']
for path, subdirs, files in os.walk(mydir):
for file in files:
if (file.endswith('TIC.txt')):
myfile=os.path.join(path, file)
TIC_df = pd.read_csv(myfile,sep="\t",skiprows=1, usecols=[0,1,2],names=nameslist)
Normally, the for loop is set into a function that is iterated over a very large set of folders with a lot of samples, hence the os.walk stuff, but we can ignore that right now. This code will be completed to save a new .txt file with only the 3 relevant columns.
The problem comes in the last line, the pd.read_csv line. This results in a dataframe with an index column that comprises the data from the first 456 columns and the last 3 columns of the .txt are given the names in nameslist and callable as columns in pandas, (i.e. using .iloc). This is not a multi-index. It is a single index with all the data and whitespace of those first columns.
In this example code sep="\t" because that's how excel can successfully import it. But I've also tried:
sep="\s"
delimiter=r"\s+" rather than a sep argument
including header=None
not including the usecols argument I made an error, and did not call the proper result from this code edit. This is the correct solution. See edit below or the answer.
setting index_col=False
How can I get pd.read_csv to take the first 3 columns and ignore the rest?
Thanks.
EDIT: In my end-of-day foolishness, I made an error, changing the target df to the example TIC_df. In the original code set I took this from, this was named mz207_df. My call function was still referncing the old df name.
Changing the last line of code to:
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols[0,1,2],names=nameslist)
successfully resolved my problem. Using sep="\t" also worked. Sorry for wasting people's time. I will post this with an answer as well in case someone needs to learn about usecols like I did.
Answering here to make sure the problem gets flagged as answered, in case someone else searches for it.
I made an error when calling the result from the code which included the usecols=[0,1,2] argument, and I was calling an older dataframe. The following line of code successfully generated the desired code.
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols=[0,1,2],names=nameslist)
Using sep="\t" also generated the correct dataframe, but I default to \s+ to accomdate different and varible formatting from analytical machine outputs.
I'm still new to python and cannot achieve to make what i'm looking for. I'm using Python 3.7.0
I have one file, called log.csv, containing a log of CANbus messages.
I want to check what is the content of column label Data2 and Data3 when the ID is 348 in column label ID.
If they are both different from "00", I want to make a new string called fault_code with the "Data3+Data2".
Then I want to check on another CSV file where this code string appear, and print the column 6 of this row (label description). But this last part I want to do it only one time per fault_code.
Here is my code:
import csv
CAN_ID = "348"
with open('0.csv') as log:
reader = csv.reader(log,delimiter=',')
for log_row in reader:
if log_row[1] == CAN_ID:
if (log_row[5]+log_row[4]) != "0000":
fault_code = log_row[5]+log_row[4]
with open('Fault_codes.csv') as fault:
readerFC = csv.reader(fault,delimiter=';')
for fault_row in readerFC:
if "0x"+fault_code in readerFC:
print("{fault_row[6]}")
Here is a part of the log.csv file
Timestamp,ID,Data0,Data1,Data2,Data3,Data4,Data5,Data6,Data7,
396774,313,0F,00,28,0A,00,00,C2,FF
396774,314,00,00,06,02,10,00,D8,00
396775,**348**,2C,00,**00,00**,FF,7F,E6,02
and this is a part of faultcode.csv
Level;LED Flashes;UID;FID;Type;Display;Message;Description;RecommendedAction
1;2;1;**0x4481**;Warning;F12001;Handbrake Fault;Handbrake is active;Release handbrake
1;5;1;**0x4541**;Warning;F15001;Fan Fault;blablabla;blablalba
1;5;2;**0x4542**;Warning;F15002;blablabla
Also do you think of a better way to do this task? I've read that Pandas can be very good for large files. As log.csv can have 100'000+ row, it's maybe a better idea to use it. What do you think?
Thank you for your help!
Be careful with your indentation, you get this error because you sometimes you use spaces and other tabs to indent.
As PM 2Ring said, reading 'Fault_codes.csv' everytime you read 1 line of your log is really not efficient.
You should read faultcode once and store the content in RAM (if it fits). You can use pandas to do it, and store the content into a DataFrame. I would do that before reading your logs.
You do not need to store all log.csv lines in RAM. So I'd keep reading it line by line with csv module, do my stuff, write to a new file, and read the next line. No need to use pandas here as it will fill your RAM for nothing.
I have a Python code which is logging some data into a .csv file.
logging_file = 'test.csv'
dt = datetime.datetime.now()
f = open(logging_file, 'a')
f.write('\n "{:%H:%M:%S}",{},{}'.format(dt,x,y,))
The above code is the core part and this produces continuous data in .csv file as
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
Now I wish to add the following lines in first row of this data. time, data1,data2.I expect output as
time, data1, data2
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
I tried many ways. Those ways not produced me the result as preferred format.But I am unable to get my expected result.
Please help me to solve the problem.
I would recommend writing a class specifically for creating and managing logs.Have it initialize a file, on creation, with the expected first line (don't forget a \n character!), and keep track of any necessary information about that log(the name of the log it created, where it is, etc). You can then have the class 'write' to the log (append the log, really), you can create new logs as necessary, and, you can have it check for existing logs, and make decisions about either updating what is existing, or scrapping it and starting over.
This is my function to build a record of user's performed action in python csv. It will get the username from the global and perform increment given in the amount parameter to the specific location of the csv, matching the user's row and current date.
In brief, the function will read the csv in a list, and do any modification on the data before rewriting the whole list back into the csv file.
Every first item on rows is the username, and the header has the dates.
Accs\Dates,12/25/2016,12/26/2016,12/27/2016
user1,217,338,653
user2,261,0,34
user3,0,140,455
However, I'm not sure why sometimes, the header get's pushed down to the second row, and data gets wiped entirely when it crashes.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
I'm thinking maybe I can write the stats separately and uniquely to each users and combine later, hence eliminating the possible clashing in writing. Although would be great if I could just improve from what I have here and read/write everything on a file.
Any fail-safe way to do what I'm trying to do here?
# Search current user in first rows and updating the count on the column (today's date)
# 'amount' will be added to the respective position
def dailyStats(self, amount, code = None):
def initStats():
# prepping table
with open(self.stats, 'r') as f:
reader = csv.reader(f)
for row in reader:
if row:
self.statsTable.append(row)
self.statsNames.append(row[0])
def getIndex(list, match):
# get the index of the matched date or user
for i, j in enumerate(list):
if j == match:
return i
self.statsTable = []
self.statsNames = []
self.statsDates = None
initStats()
today = datetime.datetime.now().strftime('%m/%d/%Y')
user_index = None
today_index = None
# append header if the csv is empty
if len(self.statsTable) == 0:
self.statsTable.append([r'Accs\Dates'])
# rebuild updated table
initStats()
# add new user/date if not found in first row/column
self.statsDates = self.statsTable[0]
if getIndex(self.statsNames, self.username) is None:
self.statsTable.append([self.username])
if getIndex(self.statsDates, today) is None:
self.statsDates.append(today)
# rebuild statsNames after table appended
self.statsNames = []
for row in self.statsTable:
self.statsNames.append(row[0])
# getting the index of user (row) and date (column)
user_index = getIndex(self.statsNames, self.username)
today_index = getIndex(self.statsDates, today)
# the row where user is matched, if there are previous dates than today which
# has no data, append 0 (e.g. user1,0,0,0,) until the column where today's date is match
if len(self.statsTable[user_index]) < today_index + 1:
for i in range(0,today_index + 1 - len(self.statsTable[user_index])):
self.statsTable[user_index].append(0)
# insert pv or tb code if found
if code is None:
self.statsTable[user_index][today_index] = amount + int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0))
else:
self.statsTable[user_index][today_index] = str(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0)) + ' - ' + code
# Writing final table
with open(self.stats, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(self.statsTable)
# return the summation of the user's total count
total_follow = 0
for i in range(1, len(self.statsTable[user_index])):
total_follow += int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][i])).group(0))
return total_follow
As David Z says, concurrency is more likely the cause of your problem.
I will add that CSV format is not suitable for Database storing, indexing, sorting, because it is plain/text and sequential.
You could handle it using a RDBMS for storing and updating your data and periodically processing your stats. Then your CSV format is just an import/export format.
Python offers a SQLite binding in its Standard Library, if you build a connector that import/update CSV content in a SQLite schema and then dump results as CSV you will be able to handle concurency and keep your native format without worring about installing a database server and installing new packages in Python.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
More likely than not that is exactly your issue. When two things are trying to write to the same file at the same time, the outputs from the two sources can easily get mixed up together, resulting in a file full of gibberish.
An easy way to fix this is just what you mentioned in the question, have each different process (or thread) write to its own file and then have separate code to combine all those files in the end. That's what I would probably do.
If you don't want to do that, what you can do is have different processes/threads send their information to an "aggregator process", which puts everything together and writes it to the file - the key is that only the aggregator ever writes to the file. Of course, doing that requires you to build in some method of interprocess communication (IPC), and that in turn can be tricky, depending on how you do it. Actually, one of the best ways to implement IPC for simple programs is by using temporary files, which is just the same thing as in the previous paragraph.
I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.
First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.