I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?
Related
I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much
Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
I have a data frame like this:
df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
17 NaN
18 NaN
I want to filter this data frame from the start to the row where it finds a number in the score column.
So, after filtering the data frame should look like this:
new_df:
number score
12 NaN
13 NaN
14 NaN
15 NaN
16 10
I want to filter this data frame from the row where it finds a number in the score column to the end of the data frame.
So, after filtering the data frame should look like this:
new_df:
number score
16 10
17 NaN
18 NaN
How do I filter this data frame?
Kindly help
You can use pd.Series.last_valid_index and pd.Series.first_valid_index like this:
df.loc[df['score'].first_valid_index():]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
And,
df.loc[:df['score'].last_valid_index()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
And, if you wanted to clip leading NaN and trailing Nan you can combined the two.
df.loc[df['score'].first_valid_index():df['score'].last_valid_index()]
Output:
number score
4 16 10.0
You can use a reverse cummax and boolean slicing:
new_df = df[df['score'].notna()[::-1].cummax()]
Output:
number score
0 12 NaN
1 13 NaN
2 14 NaN
3 15 NaN
4 16 10.0
For the second one, a simple cummax:
new_df = df[df['score'].notna().cummax()]
Output:
number score
4 16 10.0
5 17 NaN
6 18 NaN
I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0
I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN
pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5
I am trying to join a fragment of a dataframe with another one. The structure of the dataframe to join is simplified below:
left:
ID f1 TIME
1 10 1
3 10 1
7 10 1
9 10 2
2 10 2
1 10 2
3 10 2
right:
ID f2 f3
1 0 11
7 9 11
I need to select the left dataset by time, and I need to attached the right one, the result I would like to have is the following:
left:
ID f1 TIME f2 f3
1 10 1 0 11
3 10 1 nan nan
7 10 1 9 11
9 10 2 nan nan
2 10 2 nan nan
1 10 2 nan nan
3 10 2 nan nan
Currently I am usually joining dataframes in this way:
left = left.join(right.set_index('ID'), on='ID')
In this case I am using:
left[left.TIME == 1] = left[left.TIME == 1].join(right.set_index('ID'), on='ID')
I have also tried with merge, but the result is the left dataframe without any of the other columns.
Finally the structure of my script need to do this for every unique TIME in the dataframe, thus:
for t in numpy.unique(left.TIME):
#do join on the fragment left.TIME == t
If I save the returned value from the join function in a new dataframe everything works fine, but trying to add the value at the left dataframe does not work.
EDIT: The IDs of the left dataset can be present multiple times, but not inside the same TIME value.
You can filter first by boolean indexing, merge and concat last:
df1 = left[left['TIME']==1]
#alternative
#df1 = left.query('TIME == 1')
df2 = left[left['TIME']!=1]
#alternative
#df2 = left.query('TIME != 1')
df = pd.concat([df1.merge(right, how='left'), df2])
print (df)
ID TIME f1 f2 f3
0 1 1 10 0.0 11.0
1 3 1 10 NaN NaN
2 7 1 10 9.0 11.0
3 9 2 10 NaN NaN
4 2 2 10 NaN NaN
5 1 2 10 NaN NaN
6 3 2 10 NaN NaN
EDIT: merge create default indices, so possible solution is create column first and then set to index:
print (left)
ID f1 TIME
10 1 10 1
11 3 10 1
12 7 10 1
13 9 10 2
14 2 10 2
15 1 10 2
16 3 10 2
#df = left.merge(right, how='left')
df1 = left[left['TIME']==1]
df2 = left[left['TIME']!=1]
df = pd.concat([df1.reset_index().merge(right, how='left').set_index('index'), df2])
print (df)
ID TIME f1 f2 f3
10 1 1 10 0.0 11.0
11 3 1 10 NaN NaN
12 7 1 10 9.0 11.0
13 9 2 10 NaN NaN
14 2 2 10 NaN NaN
15 1 2 10 NaN NaN
16 3 2 10 NaN NaN
EDIT:
After discussion after modify input data is possible use:
df = left.merge(right, how='left', on=['ID','TIME'])
This is one way:
res = left.drop_duplicates('ID')\
.merge(right, how='left')\
.append(left[left.duplicated(subset=['ID'])])
# ID TIME f1 f2 f3
# 0 1 1 10 0.0 11.0
# 1 3 1 10 NaN NaN
# 2 7 1 10 9.0 11.0
# 3 9 2 10 NaN NaN
# 4 2 2 10 NaN NaN
# 5 1 2 10 NaN NaN
# 6 3 2 10 NaN NaN
Note that columns f2 and f3 become float since NaN is considered a float.