Panda: Get row number by comparing value of different column - python

I have a dataframe consist of the following, and want to add a new column based on
high - open < x number
and High.rowNum >= Open.rowNUm
basically I just want to get the first Row Num that match the criteria above and store it as different column
S/N
High
Low
Open
Close
Date
[New Column] e.g. High - Open >= 85 [Value of S/N]
1
100
20
22
90
1 Jan
1
2
200
40
72
50
2 Jan
3
3
390
20
55
90
2 Jan

As per my understanding based on your question and comment, you need 'S/N' in the new column which satisfy the criteria .. so simply you can use apply function in dataframe and store result as new column
df['New'] = df.apply(lambda x: x['S/N'] if x['High']-x['Open'] >= 85 else np.nan, axis=1)
Here we get new column with 'S/N' which satisfy the condition else we fill it with NaN

Related

creating new column from columns whose name contains a specific string

For the columns with name containing a specific string Time, I would like to create a new column with the same name. I want for each item of Pax_cols (if there are more than one) to update the column with the sum with the column Temp.
data={'Run_Time':[60,20,30,45,70,100],'Temp':[10,20,30,50,60,100], 'Rest_Time':[5,5,5,5,5,5]}
df=pd.DataFrame(data)
Pax_cols = [col for col in df.columns if 'Time' in col]
df[Pax_cols[0]]= df[Pax_cols[0]] + df["Temp"]
This is what I came up with, if Pax_cols has only one values, but it does not work.
Expected output:
data={'Run_Time':[70,40,60,95,130,200],'Temp':[10,20,30,50,60,100], 'Rest_Time':[15,25,35,55,65,105]}
You can use:
# get columns with "Time" in the name
cols = list(df.filter(like='Time'))
# ['Run_Time', 'Rest_Time']
# add the value of df['Temp']
df[cols] = df[cols].add(df['Temp'], axis=0)
output:
Run_Time Temp Rest_Time
0 70 10 15
1 40 20 25
2 60 30 35
3 95 50 55
4 130 60 65
5 200 100 105

Pandas: Creating New Column for Previous Week Values with Multiple IDs

I'm working with time series data. I need to find the previous week values for each entry. My data has 3 columns: ID, Date, and Value. I wanted to create a 4th column for LWValue (Last Week's Value). Here what the sample data might look like:
ID Date Value
0 1 2/1/2020 100
1 2 2/1/2020 80
2 1 2/2/2020 105
3 2 2/2/2020 84
4 1 2/8/2020 102
5 2 2/8/2020 82
6 1 2/9/2020 104
7 2 2/9/2020 86
Sample Data Image
How would I go about doing this in Pandas?
I tried this:
# create new column
df["LWValues"] = pd.Series()
# test out code on same values
df.loc[((df.ID == df.ID) & (df.Date == (df.Date) )), "LWValues"].values
# test out code with timedelta grabbing last week
df.loc[((df.ID == df.ID) & (df.Date == (df.Date - datetime.timedelta(days=7)) )), "LWValues"].values
When I do that, the 2nd operation works, but the final one, trying to pull data from the previous week with the timedelta argument does not. Instead, I get an empty array.
How do I need to fix this code?
Alternatively, is there a better way to get the previous week data in Pandas than this?
df.loc doesn't work that way. In your case, it just comparing each row with itself. One way to do this is by using apply.
df.apply(lambda row : df.loc[(df.ID == row['ID']) & (df.Date == (row['Date'] - datetime.timedelta(days=7)))]['Value'], axis=1)
Don't forget to handle the case where there's no last week entry for a row.

Get first and last value for a sequence of pairs between two columns of a pandas dataframe

I have a dataframe with 3 columns Replaced_ID, New_ID and Installation Date of New_ID.
Each New_ID replaces the Replaced_ID.
Replaced_ID New_ID Installation Date (of New_ID)
3 5 16/02/2018
5 7 17/05/2019
7 9 21/06/2019
9 11 23/08/2020
25 39 16/02/2017
39 41 16/08/2018
My goal is to get a dataframe which includes the first and last record of the sequence. I care only for the first Replaced_ID value and the last New_ID value.
i.e from above dataframe I want this
Replaced_ID New_ID Installation Date (of New_ID)
3 11 23/08/2020
25 41 16/08/2018
Sorting by date and perform shift is not the solution here as far as I can imagine.
Also, I tried to join the columns New_ID with Replaced_ID but this is not the case because it returns only the previous sequence.
I need to find a way to get the sequence [3,5,7,9,11] & [25,41] combining the Replaced_ID & New_ID columns for all rows.
I care mostly about getting the first Replaced_ID value and the last New_ID value and not the Installation Date because I can perform join in the end.
Any ideas here? Thanks.
First, let's create the DataFrame:
import pandas as pd
import numpy as np
from io import StringIO
data = """Replaced_ID,New_ID,Installation Date (of New_ID)
3,5,16/02/2018
5,7,17/05/2019
7,9,21/06/2019
9,11,23/08/2020
25,39,16/02/2017
39,41,16/08/2018
11,14,23/09/2020
41,42,23/10/2020
"""
### note that I've added two rows to check whether it works with non-consecutive rows
### defining some short hands
r = "Replaced_ID"
n = "New_ID"
i = "Installation Date (of New_ID)"
df = pd.read_csv(StringIO(data),header=0,parse_dates=True,sep=",")
df[i] = pd.to_datetime(df[i], )
And now for my actual solution:
a = df[[r,n]].values.flatten()
### returns a flat list of r and n values which clearly show duplicate entries, i.e.:
# [ 3 5 5 7 7 9 9 11 25 39 39 41 11 14 41 42]
### now only get values that occur once,
# and reshape them nicely, such that the first column gives the lowest (replaced) id,
# and the second column gives the highest (new) id, i.e.:
# [[ 3 14]
# [25 42]]
u, c = np.unique( a, return_counts=True)
res = u[c == 1].reshape(2,-1)
### now filter the dataframe where "New_ID" is equal to the second column of res, i.e. [14,42]:
# and replace the entries in "r" with the "lowest possible values" of r
dfn = df[ df[n].isin(res[:,1].tolist()) ]
# print(dfn)
dfn.loc[:][r] = res[:,0]
print(dfn)
Which yields:
Replaced_ID New_ID Installation Date (of New_ID)
6 3 14 2020-09-23
7 25 42 2020-10-23
Assuming dates are sorted , You can create a helper series and then groupby and aggregate:
df['Installation Date (of New_ID)']=pd.to_datetime(df['Installation Date (of New_ID)'])
s = df['Replaced_ID'].ne(df['New_ID'].shift()).cumsum()
out = df.groupby(s).agg(
{"Replaced_ID":"first","New_ID":"last","Installation Date (of New_ID)":"last"}
)
print(out)
Replaced_ID New_ID Installation Date (of New_ID)
1 3 11 2020-08-23
2 25 41 2018-08-16
The helper series s helps in differentiating the groups by comparing the Replaced_ID with the next value of New_ID and when they do not match , it returns True. Then with the help of series.cumsum we return a sum across the series to create seperate groups:
print(s)
0 1
1 1
2 1
3 1
4 2
5 2

Python Pandas: Delete all rows of data below first empty cell in column A

I have a csv file that gets mailed to me every day and I want to write a script to clean up the data before I push it in a database. At the bottom of the csv file are 2 empty rows (Row 73 & 74 in image) and two rows with some junk data in them (Row 75 & 76 in image) and I need to delete these rows.
To identify the first empty row, it might be helpful to know that Column A will always have data in it until the first empty row (Row 73 in image).
Can you help me figure out how to identify these rows and delete the data in them?
You can check misisng values by Series.isna, create cumulative sum by Series.cumsum and filter only if equal 0 by boolean indexing. Also this solution working if no missing value in first column.
df = pd.DataFrame({'A':['as','bf', np.nan, 'vd', 'ss'],
'B':[1,2,3,4,5]})
print (df)
A B
0 as 1
1 bf 2
2 NaN 3
3 vd 4
4 ss 5
df = df[df['A'].isna().cumsum() == 0]
print (df)
A B
0 as 1
1 bf 2

Getting a merged dataframe from two dataframe

I have two dataframe:
Source dataframe
index A x y
1 1 100 100
2 1 100 400
3 1 100 700
4 1 300 200
5 2 50 200
6 2 100 200
7 2 800 400
8 2 1200 800
Destination dataframe
index A x y
1 1 105 100
2 1 110 410
3 1 110 780
4 2 1000 90
For each row in source dataframe I need to find values nearest to it based on values in the destination dataframe grouped by 'A' column. The resultant dataframe should be as below(Just a sample taking only one row from source(index 1) and corresponding nearest ones from destination in that group(A == 1))
A x_1 y_1 x_2 y_2 nearness(approx.)
1 100 100 105 100 95
1 100 100 110 410 50
1 100 100 110 780 20
NOTE: The nearness column is just a mere representation and will be a calculation function in the future based on x and y. What I need is row wise merging between the two dataframe.
This might be arbitrary, but can someone explain how merge works?
pd.merge(source_df, dest_df, on='A')
Basically, it will go through every item of the left dataframe, look for its key in the right dataframe, and create an entry in the merged datagrame (it creates an entry for each time the key is found in the right dataframe, but you can change this behaviour with the validate keyword)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html for more infos!!!
source_df.merge(dest_df, on='A')
What it does is it first looks at source_df's column and 'A' and matches it with dest_df's column 'A' (if 'on' is specified) - much like SQL join -, else it tries to do this using index, if fails then it tries to achieve joining using common column names. You can also join on different column names using 'left' and 'right' arguments.

Categories

Resources