Best way to compute sequence? - python

I just started learning pandas and I was trying to figure out the easiest possible solution for the problem mentioned below.
Suppose, I've a dataframe like this ->
A B
6 7
8 9
5 6
7 8
Here, I'm selecting the minimum value cell from column 'A' as the starting point and updating the sequence in the new column 'C'. After sequencing dataframe must look like this ->
A B C
5 6 0
6 7 1
7 8 2
8 9 3
Is there any easy way to pick a cell from from column 'A' and match it with the matching cell in column 'B' and update the sequence respectively in column 'C'?
Some extra conditions ->
If 5 is present in column 'B' then I need to add another row like this -
A B C
0 5 0
5 6 1
......

Try sort_values:
df.sort_values('A').assign(C=np.arange(len(df)))
Output:
A B C
2 5 6 0
0 6 7 1
3 7 8 2
1 8 9 3
I'm not sure what you mean with the extra conditions though.

Related

Is there a way in Pandas to modify previous rows based on a certain row value?

I have a dataframe where something happens on every fifth row. I would like to represent column A below as a countdown before the 'x' on every fifth trial. Is there a simple code I could run that finds every 'x' and then numbers the rows after them 1-4?
Currently:
index A
1 1
2 2
3 3
4 4
5 x
6 6
7 7
8 8
9 9
10 x
Desired output:
index A
1 1
2 2
3 3
4 4
5 x
6 1
7 2
8 3
9 4
10 x
Instead of trying to look at it based on the position of the row within the actual DataFrame, you can reframe the problem to be related to modular arithmetic. You're looking to count the trials with mod 5 arithmetic. You can use a statement like the following, which will label which number the trial is and will give every 5th trial an 'x' as its value:
df["A"] = df.index.map(lambda x: 'x' if x%5 == 0 else x%5)

How can I slice one column in a dataframe to several series based on a condition

I have a data frame that looks like this:
'A' diff('A')
0 1 NaN
1 2 1
2 5 3
3 2 -3
4 4 2
5 6 2
6 1 -5
7 7 6
8 9 2
What I would like to obtain is something like this:
'B'
0 1
1 2
2 5
'C'
0 2
1 4
2 6
'D'
0 1
1 7
2 9
I would like to slice the column 'A' into several new columns; the condition to slice the original column is that value on the column diff('A') is negative. I was thinking that an iterator should go through the dataframe and, whenever it encounters a negative value in diff('A'), it should slice the column and pass it to a Series, and then continue until it reaches the end of the column.
Does anyone have any ideas to do this?
Thanks in advance!
I believe your idea works fine, but it will be more efficient to use the pandas built-in selector:
decreased_value = df[df['diff'] < 0]['A'].reset_index(drop=True)

Pandas Count values across rows that are greater than another value in a different column

I have a pandas dataframe like this:
X a b c
1 1 0 2
5 4 7 3
6 7 8 9
I want to print a column called 'count' which outputs the number of values greater than the value in the first column('x' in my case). The output should look like:
X a b c Count
1 1 0 2 2
5 4 7 3 1
6 7 8 9 3
I would like to refrain from using 'lambda function' or 'for' loop or any kind of looping techniques since my dataframe has a large number of rows. I tried something like this but i couldn't get what i wanted.
df['count']=df [ df.iloc [:,1:] > df.iloc [:,0] ].count(axis=1)
I Also tried
numpy.where()
Didn't have any luck with that either. So any help will be appreciated. I also have nan as part of my dataframe. so i would like to ignore that when i count the values.
Thanks for your help in advance!
You can using ge(>=) with sum
df.iloc[:,1:].ge(df.iloc[:,0],axis = 0).sum(axis = 1)
Out[784]:
0 2
1 1
2 3
dtype: int64
After assign it back
df['Count']=df.iloc[:,1:].ge(df.iloc [:,0],axis=0).sum(axis=1)
df
Out[786]:
X a b c Count
0 1 1 0 2 2
1 5 4 7 3 1
2 6 7 8 9 3
df['count']=(df.iloc[:,2:5].le(df.iloc[:,0],axis=0).sum(axis=1) + df.iloc[:,2:5].ge(df.iloc[:,1],axis=0).sum(axis=1))
In case anyone needs such a solution, you can just add the output you get from '.le' and '.ge' in one line. Thanks to #Wen for the answer to my question though!!!

Sorting pandas.DataFrame in python sorted() function manner

Description
Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.
The approach used for now
I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.
I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.
Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.
Code
Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.
def sort_table_by_key(self, ascending=True, key=len):
"""
Sort the table inplace.
"""
# column_tmp = "".join(self._table.columns)
column_tmp = object() # Create a new object to avoid column name collision.
# Calculate the scores of the objects.
self._table[column_tmp] = self._table["snippet"].apply(key)
self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
del self._table[column_tmp]
Now this is not implemented, check github issue 3942.
I think you need argsort and then select by iloc:
df = pd.DataFrame({
'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
'B': list(range(9))
})
print (df)
A B
0 assdsd 0
1 sda 1
2 affd 2
3 asddsd 3
4 ffb 4
5 sdb 5
6 db 6
7 cf 7
8 d 8
def sort_table_by_length(column, ascending=True):
if ascending:
return df.iloc[df[column].str.len().argsort()]
else:
return df.iloc[df[column].str.len().argsort()[::-1]]
print (sort_table_by_length('A'))
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3
print (sort_table_by_length('A', False))
A B
3 asddsd 3
0 assdsd 0
2 affd 2
5 sdb 5
4 ffb 4
1 sda 1
7 cf 7
6 db 6
8 d 8
How it working:
First get lengths to new Series:
print (df['A'].str.len())
0 6
1 3
2 4
3 6
4 3
5 3
6 2
7 2
8 1
Name: A, dtype: int64
Then get indices by sorted values by argmax, for descending ordering is used this solution:
print (df['A'].str.len().argsort())
0 8
1 6
2 7
3 1
4 4
5 5
6 2
7 0
8 3
Name: A, dtype: int64
Last change ordering by iloc:
print (df.iloc[df['A'].str.len().argsort()])
A B
8 d 8
6 db 6
7 cf 7
1 sda 1
4 ffb 4
5 sdb 5
2 affd 2
0 assdsd 0
3 asddsd 3

Python Pandas value-dependent column creation

I have a pandas DataFrame with columns "Time" and "A". For each row, df["Time"] is an integer timestamp and df["A"] is a float. I want to create a new column "B" which has the value of df["A"], but the one that occurs at or immediately before five seconds in the future. I can do this iteratively as:
for i in df.index:
df["B"][i] = df["A"][max(df[df["Time"] <= df["Time"][i]+5].index)]
However, the df has tens of thousands of records so this takes far too long, and I need to run this a few hundred times so my solution isn't really an option. I am somewhat new to pandas (and only somewhat less new to programming in general) so I'm not sure if there's an obvious solution to this supported by pandas.
It would help if I had a way of referencing the specific value of df["Time"] in each row while creating the column, so I could do something like:
df["B"] = df["A"][max(df[df["Time"] <= df["Time"][corresponding_row]+5].index)]
Thanks.
Edit: Here's an example of what my goal is. If the dataframe is as follows:
Time A
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
then I would like the result to be:
Time A B
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 9
20 9 9
where each line in B comes from the value of A in the row with Time greater by at most 5. So if Time is the index as well, then df["B"][0] = df["A"][4] since 4 is the largest time which is at most 5 greater than 0. In code, 4 = max(df["Time"][df["Time"] <= 0+5], which is why df["B"][0] is df["A"][4].
Use tshift. You may need to resample first to fill in any missing values. I don't have time to test this, but try this.
df['B'] = df.resample('s', how='ffill').tshift(5, freq='s').reindex_like(df)
And a tip for getting help here: if you provide a few rows of sample data and an example of your desired result, it's easy for us to copy/paste and try out a solution for you.
Edit
OK, looking at your example data, let's leave your Time column as integers.
In [59]: df
Out[59]:
A
Time
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
Make an array containing the first and last Time values and all the integers in between.
In [60]: index = np.arange(df.index.values.min(), df.index.values.max() + 1)
Make a new DataFrame with all the gaps filled in.
In [61]: df1 = df.reindex(index, method='ffill')
Make a new column with the same data shifted up by 5 -- that is, looking forward in time by 5 seconds.
In [62]: df1['B'] = df1.shift(-5)
And now drop all the filled-in times we added, taking only values from the original Time index.
In [63]: df1.reindex(df.index)
Out[63]:
A B
Time
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 NaN
20 9 NaN
How you fill in the last values, for which there is no "five seconds later" is up to you. Judging from your desired output, maybe use fillna with a constant value set to the last value in column A.

Categories

Resources