Python CSV data import to table like object - python

I have data from csv:
time, meas, meas2
15:10, 10, 0.3
15:22, 12, 0.4
15:30, 4
So every row can contain different number of data, less or equal to number of columns in first row.
I am writing some simple stats app. But for one graph I need for example sum of data in column with name meas. But for the second graph I would like to filter this data by the time.
Is there any ready-to-get- class with some kind of object to utilise getting data from columns or rows depending on need?
Or I just need to keep data in rows and calculate input for the 1st graph on the fly?

You are looking for the pandas library. The docs can be found here https://pandas.pydata.org/pandas-docs/stable/
You can run pip install pandas to install it.
The DataFrame is the basic pandas object that you work with. You can read your data in like this:
>>> import pandas as pd
>>> df = pd.read_csv(file_name)
>>> df
time meas meas2
0 15:10 10 0.3
1 15:22 12 0.4
2 15:30 4 NaN
>>> df['meas'].sum()
26
At this point time will be string values. To convert them to time objects you could do this (there may be a better way):
>>> df['time'] = [x.time() for x in pd.to_datetime(df['time'])]
Now to filter on time... Let's say you want everything after line 1.
>>> time1 = df['time'][1]
>>> df['time'] > time1
0 False
1 False
2 True
Name: time, dtype: bool
You can use the boolean expression to filter your DataFrame like this:
>>> df[df['time'] > time0]
time meas meas2
2 15:30:00 4 NaN

Your question is a little confusing, but it sounds like a Pandas DataFrame would be helpful. You can read csv files right into them.
import pandas as pd
df=pd.read_csv('your_csv_file.csv')
Of course you may need to get familiar with pandas for this to be useful.

Related

How to convert all data in column to datetime - pandas

I have a large dataframe that, in its date column, has a mixture of date formats (only 2).
Most are in the correct format but there is some data that is in a different format.
i.e. most are 2013-11-07. Some are 20170510. Pandas throws an exception when i try to validate the code against a schema i have.
Is there a quick way to convert all dates to have the same format as the majority? Or do i have to do something more painful/manual?
i.e.
date \
0 2013-11-07 False
2 2013-11-07 False
... ... ... ... ... ...
3595037 20170510 NaN
3595038 20200701 NaN
Is there a quick way to convert all dates to have the same format as the majority?
Considering that you have only two formats, one represented by 2013-11-07 and another by 20170510 it is enough to remove - from first to get common format, i.e.
import pandas as pd
df = pd.DataFrame({'day':['2013-11-07','20170510']})
df['day'] = df['day'].str.replace('-','')
print(df)
output
day
0 20131107
1 20170510
pandas.to_datetime does understand it correctly
df['day'] = pd.to_datetime(df['day'])
print(df)
output
day
0 2013-11-07
1 2017-05-10
Disclaimer: I converted to format of minority not majority. It is possible to convert that to format of majority using regular expression, however if you are interested in datetime objects, this is unnecessary complication.

Why repeating a pd.series does not work as expected?

I have just started working with python 3.7 and I am trying to create a series e.g from 0 to 23 and repeat it. Using
rep1 = pd.Series(range(24))
I figured out how to make the first 24 values and I wanted to "copy-paste" it many times so that the final series is the original 5 times, one after the other. The result with rep = pd.Series.repeat(rep1, 5) gives me a result that looks like this and it's not what I want
0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ...
What I seek for is the 0-23 range multiple times. Any advice?
you can try this:
pd.concat([rep1]*5)
This will repeat your series 5 times.
Another solution using numpy.tile:
import numpy as np
rep = pd.Series(np.tile(rep1, 5))
If you want the repeated Series as one data object then use a pandas DataFrame for this. A DataFrame is multiple pandas Series in one object, sharing an index.
So firstly I am creating a python list, of 0-23, 5 times.
Then I put this into a DataFrame and optionally transpose so that I have the rows going down rather than across in this example.
import pandas as pd
lst = [list(range(0,24))] * 5
rep = pd.DataFrame(lst).transpose()
You could use a list to generate directly your Series.
rep = pd.Series(list(range(24))*5)

Pandas - get first n-rows based on percentage

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I'm looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply the total number of rows to your percent and use the result as parameter for head method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.
I've extended Mihai's answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
I also had the same problem and #mihai's solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Search in pandas dataframe

Potentially a slightly misleading title but the problem is this:
I have a large dataframe with multiple columns. This looks a bit like
df =
id date value
A 01-01-2015 1.0
A 03-01-2015 1.2
...
B 01-01-2015 0.8
B 02-01-2015 0.8
...
What I want to do is within each of the IDs I identify the date one week earlier and place the value on this date into e.g. a 'lagvalue' column. The problem comes with not all dates existing for all ids so a simple .shift(7) won't pull the correct value [in this instance I guess I should put a NaN in].
I can do this with a lot of horrible iterating over the dates and ids to find the value, for example some rough idea
[
df[
df['date'] == df['date'].iloc[i] - datetime.timedelta(weeks=1)
][
df['id'] == df['id'].iloc[i]
]['value']
for i in range(len(df.index))
]
but I'm certain there is a 'better' way to do it that cuts down on time and processing that I just can't think of right now.
I could write a function using a groupby on the id and then look within that and I'm certain that would reduce the time it would take to perform the operation - is there a much quicker, simpler way [aka am I having a dim day]?
Basic strategy is, for each id, to:
Use date index
Use reindex to expand the data to include all dates
Use shift to shift 7 spots
Use ffill to do last value interpolation. I'm not sure if you want this, or possibly bfill which will use the last value less than a week in the past. But simple to change. Alternatively, if you want NaN when not available 7 days in the past, you can just remove the *fill completely.
Drop unneeded data
This algorithm gives NaN when the lag is too far in the past.
There are a few assumptions here. In particular that the dates are unique inside each id and they are sorted. If not sorted, then use sort_values to sort by id and date. If there are duplicate dates, then some rules will be needed to resolve which values to use.
import pandas as pd
import numpy as np
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::3]
A = pd.DataFrame({'date':dates,
'id':['A']*len(dates),
'value':np.random.randn(len(dates))})
dates = pd.date_range('2001-01-01',periods=100)
dates = dates[::5]
B = pd.DataFrame({'date':dates,
'id':['B']*len(dates),
'value':np.random.randn(len(dates))})
df = pd.concat([A,B])
with_lags = []
for id, group in df.groupby('id'):
group = group.set_index(group.date)
index = group.index
group = group.reindex(pd.date_range(group.index[0],group.index[-1]))
group = group.ffill()
group['lag_value'] = group.value.shift(7)
group = group.loc[index]
with_lags.append(group)
with_lags = pd.concat(with_lags, 0)
with_lags.index = np.arange(with_lags.shape[0])

Categories

Resources