Why repeating a pd.series does not work as expected? - python

I have just started working with python 3.7 and I am trying to create a series e.g from 0 to 23 and repeat it. Using
rep1 = pd.Series(range(24))
I figured out how to make the first 24 values and I wanted to "copy-paste" it many times so that the final series is the original 5 times, one after the other. The result with rep = pd.Series.repeat(rep1, 5) gives me a result that looks like this and it's not what I want
0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ...
What I seek for is the 0-23 range multiple times. Any advice?

you can try this:
pd.concat([rep1]*5)
This will repeat your series 5 times.

Another solution using numpy.tile:
import numpy as np
rep = pd.Series(np.tile(rep1, 5))

If you want the repeated Series as one data object then use a pandas DataFrame for this. A DataFrame is multiple pandas Series in one object, sharing an index.
So firstly I am creating a python list, of 0-23, 5 times.
Then I put this into a DataFrame and optionally transpose so that I have the rows going down rather than across in this example.
import pandas as pd
lst = [list(range(0,24))] * 5
rep = pd.DataFrame(lst).transpose()

You could use a list to generate directly your Series.
rep = pd.Series(list(range(24))*5)

Related

Pandas: add a new column with one single value at the last row of a dataframe

My request is simple but I am stuck and do not know why... my dataframe looks like this:
price time
0 1 3
1 3 6
2 4 7
What I need to do is to add a new column mkt with only one value equal to 10 at the last row (in my example index = 2). What I have tried:
df.mkt=''
df.mkt.loc[-1] =10
But when I want to see again my dataframe the last row is not updated... I know the answer must be simple but I am stuck? Any idea? thanks!
You can use the at function:
df.mkt = ''
df.at[-1, 'mkt'] = 10
Use pd.Series:
df['mkt'] = [''] * (len(df)-1) +[10]

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.
Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)
It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Pandas: Nesting Dataframes

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!
Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Pandas - get first n-rows based on percentage

I have a dataframe i want to pop certain number of records, instead on number I want to pass as a percentage value.
for example,
df.head(n=10)
Pops out first 10 records from data set. I want a small change instead of 10 records i want to pop first 5% of record from my data set.
How to do this in pandas.
I'm looking for a code like this,
df.head(frac=0.05)
Is there any simple way to get this?
I want to pop first 5% of record
There is no built-in method but you can do this:
You can multiply the total number of rows to your percent and use the result as parameter for head method.
n = 5
df.head(int(len(df)*(n/100)))
So if your dataframe contains 1000 rows and n = 5% you will get the first 50 rows.
I've extended Mihai's answer for my usage and it may be useful to people out there.
The purpose is automated top-n records selection for time series sampling, so you're sure you're taking old records for training and recent records for testing.
# having
# import pandas as pd
# df = pd.DataFrame...
def sample_first_prows(data, perc=0.7):
import pandas as pd
return data.head(int(len(data)*(perc)))
train = sample_first_prows(df)
test = df.iloc[max(train.index):]
I also had the same problem and #mihai's solution was useful. For my case I re-wrote to:-
percentage_to_take = 5/100
rows = int(df.shape[0]*percentage_to_take)
df.head(rows)
I presume for last percentage rows df.tail(rows) or df.head(-rows) would work as well.
may be this will help:
tt = tmp.groupby('id').apply(lambda x: x.head(int(len(x)*0.05))).reset_index(drop=True)
df=pd.DataFrame(np.random.randn(10,2))
print(df)
0 1
0 0.375727 -1.297127
1 -0.676528 0.301175
2 -2.236334 0.154765
3 -0.127439 0.415495
4 1.399427 -1.244539
5 -0.884309 -0.108502
6 -0.884931 2.089305
7 0.075599 0.404521
8 1.836577 -0.762597
9 0.294883 0.540444
#70% of the Dataframe
part_70=df.sample(frac=0.7,random_state=10)
print(part_70)
0 1
8 1.836577 -0.762597
2 -2.236334 0.154765
5 -0.884309 -0.108502
6 -0.884931 2.089305
3 -0.127439 0.415495
1 -0.676528 0.301175
0 0.375727 -1.297127

Filling missing time values in a multi-indexed dataframe

Problem and what I want
I have a data file that comprises time series read asynchronously from multiple sensors. Basically for every data element in my file, I have a sensor ID and time at which it was read, but I do not always have all sensors for every time, and read times may not be evenly spaced. Something like:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
2,1,5 # skip some sensors for some time steps
0,2,6
2,2,7
2,3,8
1,5,9 # skip some time steps
2,5,10
Important note the actual time column is of datetime type.
What I want is to be able to zero-order hold (forward fill) values for every sensor for any time steps where that sensor does not exist, and either set to zero or back fill any sensors that are not read at the earliest time steps. What I want is a dataframe that looks like it was read from:
ID,time,data
0,0,1
1,0,2
2,0,3
0,1,4
1,1,2 # ID 1 hold value from time step 0
2,1,5
0,2,6
1,2,2 # ID 1 still holding
2,2,7
0,3,6 # ID 0 holding
1,3,2 # ID 1 still holding
2,3,8
0,5,6 # ID 0 still holding, can skip totally missing time steps
1,5,9 # ID 1 finally updates
2,5,10
Pandas attempts so far
I initialize my dataframe and set my indices:
df = pd.read_csv(filename, dtype=np.int)
df.set_index(['ID', 'time'], inplace=True)
I try to mess with things like:
filled = df.reindex(method='ffill')
or the like with various values passed to the index keyword argument like df.index, ['time'], etc. This always either throws an error because I passed an invalid keyword argument, or does nothing visible to the dataframe. I think it is not recognizing that the data I am looking for is "missing".
I also tried:
df.update(df.groupby(level=0).ffill())
or level=1 based on Multi-Indexed fillna in Pandas, but I get no visible change to the dataframe again, I think because I don't have anything currently where I want my values to go.
Numpy attempt so far
I have had some luck with numpy and non-integer indexing using something like:
data = [np.array(df.loc[level].data) for level in df.index.levels[0]]
shapes = [arr.shape for arr in data]
print(shapes)
# [(3,), (2,), (5,)]
data = [np.array([arr[i] for i in np.linspace(0, arr.shape[0]-1, num=max(shapes)[0])]) for arr in data]
print([arr.shape for arr in data])
# [(5,), (5,), (5,)]
But this has two problems:
It takes me out of the pandas world, and I now have to manually maintain my sensor IDs, time index, etc. along with my feature vector (the actual data column is not just one column but a ton of values from a sensor suite).
Given the number of columns and the size of the actual dataset, this is going to be clunky and inelegant to implement on my real example. I would prefer a way of doing it in pandas.
The application
Ultimately this is just the data-cleaning step for training recurrent neural network, where for each time step I will need to feed a feature vector that always has the same structure (one set of measurements for each sensor ID for each time step).
Thank you for your help!
Here is one way , by using reindex and category
df.time=df.time.astype('category',categories =[0,1,2,3,4,5])
new_df=df.groupby('time',as_index=False).apply(lambda x : x.set_index('ID').reindex([0,1,2])).reset_index()
new_df['data']=new_df.groupby('ID')['data'].ffill()
new_df.drop('time',1).rename(columns={'level_0':'time'})
Out[311]:
time ID data
0 0 0 1.0
1 0 1 2.0
2 0 2 3.0
3 1 0 4.0
4 1 1 2.0
5 1 2 5.0
6 2 0 6.0
7 2 1 2.0
8 2 2 7.0
9 3 0 6.0
10 3 1 2.0
11 3 2 8.0
12 4 0 6.0
13 4 1 2.0
14 4 2 8.0
15 5 0 6.0
16 5 1 9.0
17 5 2 10.0
You can have a dictionary of last readings for each sensors. You'll have to pick some initial value; the most logical choice is probably to back-fill the earliest reading to earlier times. Once you've populated your last_reading dictionary, you can just sort all the readings by time, update the dictionary for each reading, and then fill in rows according to the dictionay. So after you have your last_reading dictionary initialized:
last_time = readings[1][time]
for reading in readings:
if reading[time] > last_time:
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
last_reading[reading[ID]] = reading[data]
#the above for loop doesn't update for the last time
#so you'll have to handle that separately
for ID in ID_list:
df.loc[last_time,ID] = last_reading[ID]
last_time = reading[time]
This assumes that you have only one reading for each time/sensor pair, and that 'readings' a list of dictionaries sorted by time. It also assumes that df has the different sensors as columns and different times as index. Adjust the code as necessary if otherwise. You can also probably optimize it a bit more by updating a whole row at once instead of using a for loop, but I didn't want to deal with making sure I had the Pandas syntax right.
Looking at the application, though, you might want to have each cell in the dataframe be not a number but a tuple of last value and time it was read, so replace last_reading[reading[ID]] = reading[data] with
last_reading[reading[ID]] = [reading[data],reading[time]]. Your neural net can then decide how to weight data based on how old it is.
I got this to work with the following, which I think is pretty general for any case like this where the time index for which you want to fill values is the second in a multi-index with two indices:
# Remove duplicate time indices (happens some in the dataset, pandas freaks out).
df = df[~df.index.duplicated(keep='first')]
# Unstack the dataframe and fill values per serial number forward, backward.
df = df.unstack(level=0)
df.update(df.ffill()) # first ZOH forward
df.update(df.bfill()) # now back fill values that are not seen at the beginning
# Restack the dataframe and re-order the indices.
df = df.stack(level=1)
df = df.swaplevel()
This gets me what I want, although I would love to be able to keep the duplicate time entries if anybody knows of a good way to do this.
You could also use df.update(df.fillna(0)) instead of backfilling if starting unseen values at zero is preferable for a particular application.
I put the above code block in a function called clean_df that takes the dataframe as argument and returns the cleaned dataframe.

Categories

Resources