Get time difference between two values in csv file [duplicate] - python

This question already has answers here:
Pandas: Difference to previous value
(2 answers)
Closed 3 years ago.
I trying to get the avarage, max and min time difference between value occurrences in a csv file.
The contains a multiple columns and rows.
I am currently working in python and trying to use pandas to solve my problem.
I have managed to break down the csv file to the column i want to get the time difference from and the time column.
Where the "payload" column "value occurrences" happens.
looking like:
time | payload
12.1 2368
13.8 2508
I have also tried to get the time in a array when the value occurrences happens and tried to step through the array but failed bad. I felt like there was a easier way to do it.
def average_time(avg_file):
avg_read = pd.read_csv(avg_file, skiprows=2, names=new_col_names, usecols=[2, 3], na_filter=False, skip_blank_lines=True)
test=[]
i=0
for row in avg_read.payload:
if row != None:
test[i]=avg_read.time
i+=1
if len[test] > 2:
average=test[1]-test[0]
i=0
test=[]
return average
The csv-file currently look like:
time | payload
12.1 2250
12.5 2305
12.9 (blank)
13.1 (blank)
13.5 2309
14.6 2350
14.9 2680
15.0 (blank)
I want to get the time diffenrence between the values in the payload columen. example time between
2250 and 2305 --> 12.5-12.1 = 0.4 sec
and the get the difference between
2305 and 2309 --> 13.5-12.5 = 1 s
Skipping the blank numbers
To later on get the maximum, minimun and average difference.

First use dropna then use Series.diff
DataFrame used:
print(df)
time payload
0 12.1 2250.0
1 12.5 2305.0
2 12.9 NaN
3 13.1 NaN
4 13.5 2309.0
5 14.6 2350.0
6 14.9 2680.0
7 15.0 NaN
df.dropna().time.diff()
0 NaN
1 0.4
4 1.0
5 1.1
6 0.3
Name: time, dtype: float64
Note I assumed your (blank) values are NaN, else use the following before running my code:
df.replace('(blank)', np.NaN, inplace=True, axis=1)
# Or if they are whitespaces
df.replace('', np.NaN, inplace=True, axis=1)

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

how to read url .txt files using pandas

I have a problem reading files using pandas (read_csv). I can do it using the built in, with open(...), however it is much easier with pandas. I just need to read the data (numbers) between the ----. This is the LINK with one of my data url. There are more depending on the date that i insert. A sample of this is :
MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020
NAME: Krieza Evias CITY: Krieza Evias STATE:
ELEV: 119 m LAT: 38° 24' 00" N LONG: 24° 18' 00" E
TEMPERATURE (°C), RAIN (mm), WIND SPEED (km/hr)
HEAT COOL AVG
MEAN DEG DEG WIND DOM
DAY TEMP HIGH TIME LOW TIME DAYS DAYS RAIN SPEED HIGH TIME DIR
------------------------------------------------------------------------------------
1 18.2 22.4 10:20 13.5 23:50 1.0 0.9 0.0 4.5 33.8 12:30 E
2 17.6 22.3 15:00 10.8 4:10 2.0 1.3 0.0 4.5 30.6 15:20 E
3 18.1 21.9 12:20 14.1 3:40 1.3 1.1 1.0 4.2 24.1 14:40 E
Keep in mind that i cannot just use skiprows=8 and skipfooter=9 to get the data between the --------, because not all files of this format have a specific number of footer (skipfooter)or title (skiprows) to skip. Some have 2 or 3 and some others have 8-9 lines of footer or title to skip. But every file has 2 lines of -------- where the data are between them.
I think you can't directly use read_csv but you could do this:
import urllib
from io import StringIO
count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
if "---" in line.decode('windows-1252'):
count+=1
elif count==1:
txt+=line.decode('windows-1252')
else:
break
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)
header is None because in your link column names are not in a row only but divided into multiple rows. If they're fixed I suggest you to put them by hand such as ["DAY", "MEAN TEMP", ...].

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Txt to csv format with rows and columns [python]

Need help converting a txt file to csv with the rows and columns intact. The text file is here:
(http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2020&MONTH=06&FROM=2300&TO=2300&STNM=72265)
So far I only have this...
df = pd.read_csv('sounding-72265-2020010100.txt',delimiter=',')
df.to_csv('sounding-72265-2020010100.csv')
But it has only one column with all the other columns within its rows.
Instead want with to format it to something like this
CSV Format
Thanks for any help
I'm assuming you can start with text copied from the website; i.e. you create a data.txt file looking like the following by copy/pasting:
1000.0 8
925.0 718
909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
...
...
...
Then the following works, mainly based on this answer:
import pandas as pd
df = pd.read_table('data.txt', header=None, sep='\n')
df = df[0].str.strip().str.split('\s+', expand=True)
You read the data only separating by new lines, generating a one column df. Then use string methods to format the entries and expand them into a new DataFrame.
You can then add the column names in as such with help from this answer:
col1 = 'PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV'.split()
col2 = 'hPa m C C % g/kg deg knot K K K '.split()
df.columns = pd.MultiIndex.from_tuples(zip(col1,col2), names = ['Variable','Unit'])
The result (df.head()):
Variable PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
Unit hPa m C C % g/kg deg knot K K K
0 1000.0 8 None None None None None None None None None
1 925.0 718 None None None None None None None None None
2 909.0 872 39.6 4.6 12 5.88 80 7 321.4 340.8 322.5
3 900.0 964 37.6 11.6 21 9.62 75 8 320.2 351.3 322.1
4 883.0 1139 36.6 7.6 17 7.47 65 9 321.0 345.3 322.4
I would actually probably drop the "Units" column name were it me, b/c I think the multiindex columns can make things more complicated to slice.
Again, both reading the data and column names assume you can just copy paste those into a text file/into Python and then parse. If you are reading many pages like this, or were looking to do some sort of web scraping, that will require additional work.

Assigning arrays as values for a key in a dictionary

I have a dat file with different data. The file has different numbers arranged in 7 columns seperated with two whitespaces. Is it possible to read and extract the data for each column and assign the data to a key in a dictionary, using arrays. Is it possible to assign numpy arrays as values for a key in a dictionary?
The dat.file have numbers like this:
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
I first read all the lines, then I split the values with whitespace as seperator, for each line. I tried to assign the values in each column to the corresponding key in the dictionary, but this does not work. I think I have to put the values in an array and then put the array in the dictionary in some way?
def read_data(filename):
infile = open(filename, 'r')
for line in infile.readlines():
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
lines = line.split()
data['hour'] = lines[0]
data['temperature'] = lines[1]
data['humidity'] = lines[2]
data['wind_speed'] = lines[3]
data['wind_direction'] = lines[4]
data['direct_flux'] = lines[5]
data['diffuse_flux'] = lines[6]
return data
EDIT: I realized numpy arrays are a specific scientific data structure. I have not used them but assume converting the below lists (and its append operation) into numpy arrays is trivial.
You are correct. A dictionary holds (key, value) pairs. An entry of the form (key, value, value, ..., value) is not acceptable. Using a list() as the value (as you suggested) is a solution. Note now that the index corresponds to the line number the data was in.
data = {'hour': None, 'temperature': None, 'humidity':
None, 'wind_speed':
None, 'wind_direction':
None, 'direct_flux': None, 'diffuse_flux': None}
# For each key, initialize a list as its value.
for key in data:
data[key] = list()
for line in infile.readlines():
lines = line.split()
# we simply append into the list this key references.
data['hour'].append(lines[0])
data['temperature'].append(lines[1])
data['humidity'].append(lines[2])
data['wind_speed'].append(lines[3])
data['wind_direction'].append(lines[4])
data['direct_flux'].append(lines[5])
data['diffuse_flux'].append(lines[6])
return data
I'm not quite sure I got right what you are asking for, but I'll try to answer.
I guess you want to load those tabulated data in a way you can easily work with, and making use of numpy's functionality.
Then, I think you have two options.
Using PANDAS
Pandas (here the documentation) is a really complete package that uses numpy to let you work with labelled data (so that columns and rows have a name, and not only a positional index)
using pandas the idea would be to do:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0, header=None,
names=['hour', 'temp', 'hum', 'w_speed', 'w_direction',
'direct_flux','diffuse_flux'])
df
temp hum w_speed w_direction direct_flux diffuse_flux
hour
1 -0.8 92.3 2.8 150 0 0
2 -0.7 99.3 1.9 140 0 0
3 -0.3 96.4 2.5 120 0 0
4 -0.3 95.0 3.1 130 0 0
5 -0.8 95.7 3.1 130 0 0
6 -0.5 95.0 2.1 120 0 0
7 -0.7 90.9 3.6 110 0 0
8 -0.6 85.7 2.6 80 0 0
9 -0.7 85.7 3.1 60 0 0
10 -1.2 85.6 3.6 50 0 8
Or, if you have the column names as the first row of the file simply:
import pandas as pd
df = pd.read_csv('data.tab', sep=" ", index_col=0)
If you haven't heard of this library and you are managing this kind of data, I think it is really worthwhile to give it a close look.
Using only Numpy
If you don't need to do much with those data, or won't do it again or whatever, getting Pandas may be a bit too much...
In any case, you can always read the tabulated file from numpy
import numpy as np
array = np.loadtxt("data.tab", delimiter=" ")
It will ignore comment lines (by default lines with #) and you can also skip the first row and so on.
Now you'll have all the data on array, and you can access it slicing and indexing. If you want to have labelled categories (and you don't like the first option), you can build your dictionary of arrays following the last snippet of code by:
data = {}
headers = ['hour', 'temp', 'hum', 'w_speed', 'w_direction', 'direct_flux',
'diffuse_flux']
for i in xrange(len(headers)):
data[header[i]] = array[:,i]

Categories

Resources