Concatenating the values of column and putting back to same row again - python

Customer Material ID Bill Quantity
0 1 64578 100
1 2 64579 58
2 3 64580 36
3 4 64581 45
4 5 64582 145
We have to concatenate the 0th index material id and 1st index material id and put it into the 0th index material id record.
similarly 1,2 3,4
The result should contain only catenated records.

Just shift the data and combine the columns.
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 129157.0
1 1 64579 58 NaN 129159.0
2 2 64580 36 NaN 129161.0
3 3 64581 45 NaN 129163.0
4 4 64582 145 NaN NaN
If you need to concatenate it as a str type then the following would work.
df["Material ID"] = df["Material ID"].astype(str)
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 6457864579
1 1 64579 58 NaN 6457964580
2 2 64580 36 NaN 6458064581
3 3 64581 45 NaN 6458164582
4 4 64582 145 NaN NaN

Related

How to filter a dataframe and identify records based on a condition on multiple other columns

id zone price
0 0000001 1 33.0
1 0000001 2 24.0
2 0000001 3 34.0
3 0000001 4 45.0
4 0000001 5 51.0
I have the above pandas dataframe, here there are multiple ids (only 1 id is shown here). dataframe consist of a certain id with 5 zones and 5 prices. these prices should follow the below pattern
p1 (price of zone 1) < p2< p3< p4< p5
if anything out of order we should identify and print anomaly records to a file.
here in this example p3 <p4 <p5 but p1 and p2 are erroneous. (p1 > p2 whereas p1 < p2 is expected)
therefore 1st 2 records should be printed to a file
likewise this has to be done to the entire dataframe for all unique ids in it
My dataframe is huge, what is the most efficient way to do this filtering and identify erroneous records?
You can compute the diff per group after sorting the values to ensure the zones are increasing. If the diff is ≤ 0 the price is not strictly increasing and the rows should be flagged:
s = (df.sort_values(by=['id', 'zone']) # sort rows
.groupby('id') # group by id
['price'].diff() # compute the diff
.le(0) # flag those ≤ 0 (not increasing)
)
df[s|s.shift(-1)] # slice flagged rows + previous row
Example output:
id zone price
0 1 1 33.0
1 1 2 24.0
Example input:
id zone price
0 1 1 33.0
1 1 2 24.0
2 1 3 34.0
3 1 4 45.0
4 1 5 51.0
5 2 1 20.0
6 2 2 24.0
7 2 3 34.0
8 2 4 45.0
9 2 5 51.0
saving to file
df[s|s.shift(-1)].to_csv('incorrect_prices.csv')
Another way would be to first sort your dataframe by id and zone in ascending order and compare the next price with previous price using groupby.shift() creating a new column. Then you can just print out the prices that have fell in value:
import numpy as np
import pandas as pd
df.sort_values(by=['id','zone'],ascending=True)
df['increase'] = np.where(df.zone.eq(1),'no change',
np.where(df.groupby('id')['price'].shift(1) < df['price'],'inc','dec'))
>>> df
id zone price increase
0 1 1 33 no change
1 1 2 24 dec
2 1 3 34 inc
3 1 4 45 inc
4 1 5 51 inc
5 2 1 34 no change
6 2 2 56 inc
7 2 3 22 dec
8 2 4 55 inc
9 2 5 77 inc
10 3 1 44 no change
11 3 2 55 inc
12 3 3 44 dec
13 3 4 66 inc
14 3 5 33 dec
>>> df.loc[df.increase.eq('dec')]
id zone price increase
1 1 2 24 dec
7 2 3 22 dec
12 3 3 44 dec
14 3 5 33 dec
I have added some extra ID's to try and mimic your real data.

how to use LOCF in this time series data for pandas in python

If I have a data as given below , I need to fil the last observations based on the id when it last appeared, the data is as given below -
ID OpenDate ObsDate Amount ClosedDate Output
1 10-12-1990 15-08-1991 20 15-08-1992 2
3 10-12-1993 15-12-1993 25 15-08-1994 1
5 10-12-1995 25-11-1997 0 18-08-1998 1
1 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
The expected output should be the filed out values FOR 1 AND 3 IDS with the previous values of 1 and 3 i.e
ID OpenDate ObsDate Amount ClosedDate Output
1 10-12-1990 15-08-1991 20 15-08-1992 2
3 10-12-1993 15-12-1993 25 15-08-1994 1
5 10-12-1995 25-11-1997 0 18-08-1998 1
1 10-12-1990 15-08-1991 20 15-08-1992 2
3 10-12-1993 15-12-1993 25 15-08-1994 1
Consider this to be a dataframe,the inputs needed for python.

Compare value from two dataframes

How to compare between 2 dataframes by 0,1?
df1
L_ID L_Values
1 20-25
2 30-35
3 25
4 45
5 30-45
df2
From L_ID in df1, it represents to columns 1,2,3,4,5 in df2
Name 1 2 3 4 5
John 25 25 20 30 45
Zara 20 NaN NaN 25 30
Kim NaN NaN NaN 45 50
I would like to compare values in df2, is it range in df1?
yes = 0, no = 1
Expect output in df3
Name 1 2 3 4 5
John 0 1 1 1 0
Zara 0 1 1 1 0
Kim 1 1 1 0 1
The following should work
Just take care for Nan condition in
df3.col.iloc[i]==df3.col.iloc[i]
if Nan is in sting format, just replace this part of the code with
if df3.col.iloc[i]!='Nan'
The code is:
d={}
for i in range len(df1):
d[df1.L_ID.iloc[i]]=[]
temp=df1.L_Values.iloc[i].split('-')
if len(temp)==2:
d[df1.L_ID.iloc[i]]=temp
else:
d[df1.L_ID.iloc[i]]=[temp[0], temp[0]]
df3=df2.copy()
for i in range(len(df3)):
for col in df3.columns:
if df3.col.iloc[i]==df3.col.iloc[i] and d[col][0] <= df3.col.iloc[i] <= d[col][1]:
df3.col.iloc[i]=0
else:
df3.col.iloc[i]=1
print(df3)

How to shift the column values based on the difference with previous row in python pandas?

I have dataframe which looks like below:
Name width height breadth
0 1 13 90 2
1 2 101 45 1
2 3 78 6 1
3 5 11 34 1
4 6 23 8 2
So like seen, the name is not in sequence. There are missing files in between.
I want to shift the column values of width and height one row below if the Name is in sequence. If not i want to populate the width and height of the row as NaN.
I tried the below code:
diff=data['Name'].diff()
And tried to do a group_by using this diff in value. But it did not work.
I am expecting a result like below:
Name width height breadth
0 1 NaN Nan 2
1 2 13 90 1
2 3 101 45 1
3 5 Nan Nan 1
4 6 11 34 2
Create helper Series for groups by Series.diff, compare by Series.ne and Series.cumsum and pass it to DataFrameGroupBy.shift:
diff = data['Name'].diff().ne(1).cumsum()
data[['width','height']] = data.groupby(diff)['width','height'].shift()
print (data)
Name width height breadth
0 1 NaN NaN 2
1 2 13.0 90.0 1
2 3 101.0 45.0 1
3 5 NaN NaN 1
4 6 11.0 34.0 2
You could use a temporary dataframe to add the empty lines and shift the values:
temp = pd.DataFrame({'Name': np.arange(
data.Name.min(), data.Name.max() + 1)}).merge(data, on='Name', how='left')
temp.iloc(axis=1)[1:] = temp.iloc(axis=1)[1:].shift()
result = pd.DataFrame(data.Name).merge(temp , on='Name')

Iterate through the rows of a dataframe and reassign minimum values by group

I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0

Categories

Resources