Finding time difference between columns - python
I am currently working with a dataset which has two DateTime columns: ACTUAL_SHIPMENT_DTM and SHIPMENT_CONFIRMED_DTM.
I am trying to find the difference in time between the two columns. I have tried the following code but the output is giving me the time difference of one column based on the rows. Basically I want a new column to be populated with the time difference of (ACTUAL_SHIPMENT_DTM - SHIPMENT_CONFIRMED_DTM).
Golden['Cycle_TIme'] = Golden.groupby('ACTUAL_SHIPMENT_DTM')
['SHIPMENT_CONFIRMED_DTM'].diff().dt.total_seconds()
Can anyone see errors in my code or guide me to proper documentation?
Lol I underestimated myself and asked a question way too soon. Well if anyone wants to know how to find the time difference between two columns here is my example code. Golden = DataFrame
Golden['Cycle_TIme'] = Golden["SHIPMENT_CONFIRMED_DTM"]-
Golden["ACTUAL_SHIPMENT_DTM"]
Related
Resample().mean() in Python/Pandas and adding the results to my dataframe when the starting point is missing
I'm pretty new to coding and have a problem resampling my dataframe with Pandas. I need to resample my data ("value") to means for every 10 minutes (13:30, 13:40, etc.). The problem is: The data start around 13:36 and I can't access them by hand because I need to do this for 143 dataframes. Resampling adds the mean at the respective index (e.g. 13:40 for the second value), but because 13:30 is not part of my indices, that value gets lost. I'm trying two different approaches here: First, I tried every option of resample() (offset, origin, convention, ...). Then I tried adding the missing values manually with a loop, which doesn't run properly because I didn't know how to access the correct spot on the list. The list does include all relevant values though. I also tried adding a row with 13:30 as the index on top of the dataframe but didn't manage to convince Python that my index is legit because it's a timestamp (this is not in the code). Sorry for the very rough code, it just didn't work in several places which is why I'm asking here. If you have a possible solution, please keep in mind that it has to function within an already long loop because of the many dataframes I have to work on simultaneously. Thank you very much! df["tenminavg"] = df["value"].resample("10Min").mean() df["tenminavg"] = df["tenminavg"].ffill() ls1 = df["value"].resample("10Min").mean() #my alternative: list the resampled values in order to eventually access the first relevant timespan for i in df.index: #this loop doesn't work. It should add the value for the first 10 min if df["tenminavg"][i]=="nan": if datetime.time(13,30) <= df.index.time < datetime.time(13,40): df["tenminavg"][i] = ls1.index.loc[i.floor("10Min")]["value"] #tried to access the corresponding data point in the list else: continue
Is there a pandas function to merge 2 dfs so that repeating items in the second df are added as columns to the first df?
I have a hard time to formulate this problem in abstract terms, therefore I will mostly try to explain it with examples. I have 2 pandas dataframes (I get them from a sqlite DB). First DF: Second DF: So the thing is: There are several images per "capture". I would like to add the images to the capture df as columns, so that each capture has 9 image columns, each with a path. There are always 9 images per capture. I solved it in pandas with what I know in the following way: cam_idxs = sorted(list(range(9)) * 2) for cam_idx in cam_idxs: sub_df = images.loc[(images["camera_id"]==cam_idx)] captures = captures.merge(sub_df[["image", "capture_id"]], left_on="id", right_on="capture_id") I imagine though that there must be a better way. At least I imagine people probably stumble into this problem more often when getting data from a sql database. Since I am getting the data into pandas from a sql database, I am also open to SQL commands that get me this result. And I'm also grateful for people telling me what this kind of operation is called, I did not find a good way to google for this, therefore I am asking here. Excuse me when this question was asked somewhere, I did not find anything with my searchterms. So the question at the end is: Is there a better way to do this, especially a more efficient way to do this?
What you are looking for is the pivot table. You just need to create a column containing the index of the number of image by capture_id that you will use as columns in the pivot table. For example this could be : images['column_pivot'] = [x for x in range(1,10)]*int(images.shape[0]/9) In your case 'column_pivot' would be [1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9...7,8,9] (e.g. rolling from 1 to 9) Then you pivot : pd.pivot_table(images, columns='column_pivot', index='capture_id', values='image') This will give the expected result.
Pandas dataframe to numpy array [duplicate]
This question already has answers here: Convert pandas dataframe to NumPy array (15 answers) Closed 3 years ago. I am very new to Python and have very little experience. I've managed to get some code working by copying and pasting and substituting the data I have, but I've been looking up how to select data from a dataframe but can't make sense of the examples and substitute my own data in. The overarching goal: (if anyone could actually help me write the entire thing, that would be helpful, but highly unlikely and probably not allowed) I am trying to use scipy to fit the curve of a temperature change when two chemicals react. There are 40 trials. The model I am hoping to use is a generalized logistic function with six parameters. All I need are the 40 functions, and nothing else. I have no idea how to achieve this, but I will ask another question when I get there. The current issue: I had imported 40 .csv files, compiled/shortened the data into 2 sections so that there are 20 trials in 1 file. Now the data has 21 columns and 63 rows. There is a title in the first row for each column, and the first column is a consistent time interval. However, each trial is not necessarily that long. One of them does, though. So I've managed to write the following code for a dataframe: import pandas as pd df = pd.read_csv("~/Truncated raw data hcl.csv") print(df) It prints the table out, but as expected, there are NaNs where there exists no data. So I would like to know how to arrange it into workable array with 2 columns , time and a trial like an (x,y) for a graph for future workings with numpy or scipy such that the rows that there is no data would not be included. Part of the .csv file begins after the horizontal line. I'm too lazy to put it in a code block, sorry. Thank you. time,1mnaoh trial 1,1mnaoh trial 2,1mnaoh trial 3,1mnaoh trial 4,2mnaoh trial 1,2mnaoh trial 2,2mnaoh trial 3,2mnaoh trial 4,3mnaoh trial 1,3mnaoh trial 2,3mnaoh trial 3,3mnaoh trial 4,4mnaoh trial 1,4mnaoh trial 2,4mnaoh trial 3,4mnaoh trial 4,5mnaoh trial 1,5mnaoh trial 2,5mnaoh trial 3,5mnaoh trial 4 0.0,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.0,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.3,24.3,24.1,24.1 0.5,23.2,23.1,23.1,23.8,23.1,23.1,23.3,22.1,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.4,24.3,24.1,24.1 1.0,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.3,22.8,23.4,23.3,24.0,23.0,23.8,23.8,24.0,23.5,24.3,24.1,24.1 1.5,23.2,23.1,23.1,23.7,23.1,23.1,23.3,22.4,22.8,23.4,23.3,24.0,23.0,23.8,23.8,23.9,23.6,24.3,24.1,24.1 2.0,23.3,23.2,23.2,24.2,23.6,23.2,24.3,22.5,23.0,23.7,24.4,24.1,23.1,23.9,24.4,24.2,23.7,24.5,24.7,25.1 2.5,24.0,23.5,23.5,25.4,25.3,23.3,26.4,22.7,23.5,25.8,27.9,25.1,23.1,23.9,27.4,26.8,23.8,27.2,26.7,28.1 3.0,25.4,24.4,24.1,26.5,27.8,23.3,28.5,22.8,24.6,28.6,31.2,27.2,23.2,23.9,30.9,30.5,23.9,31.4,29.8,31.3 3.5,26.9,25.5,25.1,27.4,29.9,23.4,30.1,22.9,26.4,31.4,34.0,30.0,23.3,24.2,33.8,34.0,23.9,35.1,33.2,34.4 4.0,27.8,26.5,26.2,27.9,31.4,23.4,31.3,23.1,28.8,34.0,36.1,32.6,23.3,26.6,36.0,36.7,24.0,37.7,35.9,36.8 4.5,28.5,27.3,27.0,28.2,32.6,23.5,32.3,23.1,31.2,36.0,37.5,34.8,23.4,30.0,37.7,38.7,24.0,39.7,38.0,38.7 5.0,28.9,27.9,27.7,28.5,33.4,23.5,33.1,23.2,33.2,37.6,38.6,36.5,23.4,33.2,39.0,40.2,24.0,40.9,39.6,40.2 5.5,29.2,28.2,28.3,28.9,34.0,23.5,33.7,23.3,35.0,38.7,39.4,37.9,23.5,35.6,39.9,41.2,24.0,41.9,40.7,41.0 6.0,29.4,28.5,28.6,29.1,34.4,24.9,34.2,23.3,36.4,39.6,40.0,38.9,23.5,37.3,40.6,42.0,24.1,42.5,41.6,41.2 6.5,29.5,28.8,28.9,29.3,34.7,27.0,34.6,23.3,37.6,40.4,40.4,39.7,23.5,38.7,41.1,42.5,24.1,43.1,42.3,41.7 7.0,29.6,29.0,29.1,29.5,34.9,28.8,34.8,23.5,38.6,40.9,40.8,40.2,23.5,39.7,41.4,42.9,24.1,43.4,42.8,42.3 7.5,29.7,29.2,29.2,29.6,35.1,30.5,35.0,24.9,39.3,41.4,41.1,40.6,23.6,40.5,41.7,43.2,24.0,43.7,43.1,42.9 8.0,29.8,29.3,29.3,29.7,35.2,31.8,35.2,26.9,40.0,41.6,41.3,40.9,23.6,41.1,42.0,43.4,24.2,43.8,43.3,43.3 8.5,29.8,29.4,29.4,29.8,35.3,32.8,35.4,28.9,40.5,41.8,41.4,41.2,23.6,41.6,42.2,43.5,27.0,43.9,43.5,43.6 9.0,29.9,29.5,29.5,29.9,35.4,33.6,35.5,30.5,40.8,41.8,41.6,41.4,23.6,41.9,42.4,43.7,30.8,44.0,43.6,43.8 9.5,29.9,29.6,29.5,30.0,35.5,34.2,35.6,31.7,41.0,41.8,41.7,41.5,23.6,42.2,42.5,43.7,33.9,44.0,43.7,44.0 10.0,30.0,29.7,29.6,30.0,35.5,34.6,35.7,32.7,41.1,41.9,41.8,41.7,23.6,42.4,42.6,43.8,36.2,44.0,43.7,44.1 10.5,30.0,29.7,29.6,30.1,35.6,35.0,35.7,33.3,41.2,41.9,41.8,41.8,23.6,42.6,42.6,43.8,37.9,44.0,43.8,44.2 11.0,30.0,29.7,29.6,30.1,35.7,35.2,35.8,33.8,41.3,41.9,41.9,41.8,24.0,42.9,42.7,43.8,39.3,,43.8,44.3 11.5,30.0,29.8,29.7,30.1,35.8,35.4,35.8,34.1,41.4,41.9,42.0,41.8,26.6,43.1,42.7,43.9,40.2,,43.8,44.3 12.0,30.0,29.8,29.7,30.1,35.8,35.5,35.9,34.3,41.4,42.0,42.0,41.9,30.3,43.3,42.7,43.9,40.9,,43.9,44.3 12.5,30.1,29.8,29.7,30.2,35.9,35.7,35.9,34.5,41.5,42.0,42.0,,33.4,43.4,42.7,44.0,41.4,,43.9,44.3 13.0,30.1,29.8,29.8,30.2,35.9,35.8,36.0,34.7,41.5,42.0,42.1,,35.8,43.5,42.7,44.0,41.8,,43.9,44.4 13.5,30.1,29.9,29.8,30.2,36.0,36.0,36.0,34.8,41.5,42.0,42.1,,37.7,43.5,42.8,44.1,42.0,,43.9,44.4 14.0,30.1,29.9,29.8,30.2,36.0,36.1,36.0,34.9,41.6,,42.2,,39.0,43.5,42.8,44.1,42.1,,,44.4 14.5,,29.9,29.8,,36.0,36.2,36.0,35.0,41.6,,42.2,,40.0,43.5,42.8,44.1,42.3,,,44.4 15.0,,29.9,,,36.0,36.3,,35.0,41.6,,42.2,,40.7,,42.8,44.1,42.4,,, 15.5,,,,,36.0,36.4,,35.1,41.6,,42.2,,41.3,,,,42.4,,,
To convert a whole DataFrame into a numpy array, use df = df.values() If i understood you correctly, you want seperate arrays for every trial though. This can be done like this: data = [df.iloc[:, [0, i]].values() for i in range(1, 20)] which will make a list of numpy arrays, every one containing the first column with temperature and one of the trial columns.
Finding multiple maximum values from a file using Python
I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values. My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy. Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it. I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here). Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5
Pandas looks perfect for your task: import pandas as pd df = pd.read_csv('data.csv') df.nlargest(3, 'column name')
Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame
I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this. I now want to delete the columns contained within 'public' from the larger DataFrame ('main'). I've tried the following instructions: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html Python Pandas - Deleting multiple series from a data frame in one command without any success, along with various other statements that have been unsuccessful. The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function]. Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful. (Have Python 2.7 and am using Pandas, numpy, math, pylab etc.) Thanks in advance.
Ignore my question - Murphy's Law prevails and I've just solved it. I was using the statement from the stackoverflow question mentioned below: df.drop(df.columns[1:], axis=1) and this was not working. I have instead used df = df.drop(df2, axis=1) and this worked (df = main, df2 = public). Simple really once you don't overthink it.