Python Pandas value-dependent column creation - python

I have a pandas DataFrame with columns "Time" and "A". For each row, df["Time"] is an integer timestamp and df["A"] is a float. I want to create a new column "B" which has the value of df["A"], but the one that occurs at or immediately before five seconds in the future. I can do this iteratively as:
for i in df.index:
df["B"][i] = df["A"][max(df[df["Time"] <= df["Time"][i]+5].index)]
However, the df has tens of thousands of records so this takes far too long, and I need to run this a few hundred times so my solution isn't really an option. I am somewhat new to pandas (and only somewhat less new to programming in general) so I'm not sure if there's an obvious solution to this supported by pandas.
It would help if I had a way of referencing the specific value of df["Time"] in each row while creating the column, so I could do something like:
df["B"] = df["A"][max(df[df["Time"] <= df["Time"][corresponding_row]+5].index)]
Thanks.
Edit: Here's an example of what my goal is. If the dataframe is as follows:
Time A
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
then I would like the result to be:
Time A B
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 9
20 9 9
where each line in B comes from the value of A in the row with Time greater by at most 5. So if Time is the index as well, then df["B"][0] = df["A"][4] since 4 is the largest time which is at most 5 greater than 0. In code, 4 = max(df["Time"][df["Time"] <= 0+5], which is why df["B"][0] is df["A"][4].

Use tshift. You may need to resample first to fill in any missing values. I don't have time to test this, but try this.
df['B'] = df.resample('s', how='ffill').tshift(5, freq='s').reindex_like(df)
And a tip for getting help here: if you provide a few rows of sample data and an example of your desired result, it's easy for us to copy/paste and try out a solution for you.
Edit
OK, looking at your example data, let's leave your Time column as integers.
In [59]: df
Out[59]:
A
Time
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
Make an array containing the first and last Time values and all the integers in between.
In [60]: index = np.arange(df.index.values.min(), df.index.values.max() + 1)
Make a new DataFrame with all the gaps filled in.
In [61]: df1 = df.reindex(index, method='ffill')
Make a new column with the same data shifted up by 5 -- that is, looking forward in time by 5 seconds.
In [62]: df1['B'] = df1.shift(-5)
And now drop all the filled-in times we added, taking only values from the original Time index.
In [63]: df1.reindex(df.index)
Out[63]:
A B
Time
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 NaN
20 9 NaN
How you fill in the last values, for which there is no "five seconds later" is up to you. Judging from your desired output, maybe use fillna with a constant value set to the last value in column A.

Related

How can I combine chronologically consecutive rows based on a condition in pandas?

I have a dataset that looks like this:
begin end type
0 1 3 A
1 3 7 A
2 7 9 A
3 9 15 B
4 15 17 B
5 17 20 A
I would like to group rows that are from the same type and that are chronologically consecutive, such that the resulting DataFrame looks like this:
begin end type
0 1 9 A
1 9 17 B
2 17 20 A
I could, of course, write a function that checks each row and looks rows ahead until a different type is found, but I feel there must be an easier way. What would be the most pandas way to do this?
I have seen other similar questions, but none of them really apply to my case.
Given your data, you can shift the end column and compare that to begin:
groups = df.groupby('type')['end'].shift().ne(df['begin']).cumsum()
(df.groupby(['type', groups])
.agg({'begin':'first', 'end':'last'})
)
Output:
begin end
type
A 1 1 9
3 17 20
B 2 9 17

Adding multiple row values into one row keeping the index interval as same as the number of row added in python

I have a data frame with multiple columns (30/40) in a time series continuously from 1 to 1440 minutes.
df
time colA colB colC.....
1 5 4 3
2 1 2 3
3 5 4 3
4 6 7 3
5 9 0 3
6 4 4 0
..
Now I want to add two row values into one but I want to keep the interval of index 'time' same as the row number I am adding. The resulted data frame is:
df
time colA colB colC.......
1 6 6 6
3 11 11 6
5 13 4 3
..
Here I added two row values into one but the time index interval is also same as 2 rows. 1,3,5...
Is it possible to achieve that?
Another way would be to group your data set every two rows and aggregate with using sum on your 'colX' columns and mean on your time column. Chaining astype(int) will round the resulting values:
d = {col: 'sum' for col in [c for c in df.columns if c.startswith('col')]}
df.groupby(df.index // 2).agg({**d,'time': 'mean'}).astype(int)
prints back:
colA colB colC time
0 6 6 6 1
1 11 11 6 3
2 13 4 3 5
one way is to do the addition for all and then fix time:
df_new = df[1::2].reset_index(drop=True) + df[::2].reset_index(drop=True)
df_new['time'] = df[::2]['time'].values

Best way to compute sequence?

I just started learning pandas and I was trying to figure out the easiest possible solution for the problem mentioned below.
Suppose, I've a dataframe like this ->
A B
6 7
8 9
5 6
7 8
Here, I'm selecting the minimum value cell from column 'A' as the starting point and updating the sequence in the new column 'C'. After sequencing dataframe must look like this ->
A B C
5 6 0
6 7 1
7 8 2
8 9 3
Is there any easy way to pick a cell from from column 'A' and match it with the matching cell in column 'B' and update the sequence respectively in column 'C'?
Some extra conditions ->
If 5 is present in column 'B' then I need to add another row like this -
A B C
0 5 0
5 6 1
......
Try sort_values:
df.sort_values('A').assign(C=np.arange(len(df)))
Output:
A B C
2 5 6 0
0 6 7 1
3 7 8 2
1 8 9 3
I'm not sure what you mean with the extra conditions though.

Pandas Count values across rows that are greater than another value in a different column

I have a pandas dataframe like this:
X a b c
1 1 0 2
5 4 7 3
6 7 8 9
I want to print a column called 'count' which outputs the number of values greater than the value in the first column('x' in my case). The output should look like:
X a b c Count
1 1 0 2 2
5 4 7 3 1
6 7 8 9 3
I would like to refrain from using 'lambda function' or 'for' loop or any kind of looping techniques since my dataframe has a large number of rows. I tried something like this but i couldn't get what i wanted.
df['count']=df [ df.iloc [:,1:] > df.iloc [:,0] ].count(axis=1)
I Also tried
numpy.where()
Didn't have any luck with that either. So any help will be appreciated. I also have nan as part of my dataframe. so i would like to ignore that when i count the values.
Thanks for your help in advance!
You can using ge(>=) with sum
df.iloc[:,1:].ge(df.iloc[:,0],axis = 0).sum(axis = 1)
Out[784]:
0 2
1 1
2 3
dtype: int64
After assign it back
df['Count']=df.iloc[:,1:].ge(df.iloc [:,0],axis=0).sum(axis=1)
df
Out[786]:
X a b c Count
0 1 1 0 2 2
1 5 4 7 3 1
2 6 7 8 9 3
df['count']=(df.iloc[:,2:5].le(df.iloc[:,0],axis=0).sum(axis=1) + df.iloc[:,2:5].ge(df.iloc[:,1],axis=0).sum(axis=1))
In case anyone needs such a solution, you can just add the output you get from '.le' and '.ge' in one line. Thanks to #Wen for the answer to my question though!!!

Automating slicing prodcedures using pandas

I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i

Categories

Resources