1

Use pandas to select the lagged row along with current row based on criteria

I have a dataframe like as shown below

person_id  source_system   r_diff
  1              A          NULL
  1              B           0
  1              B           9
  1              A           15
  1              A           574
  1              B           0
  1              A           63
  1              A           136
  1              B           0 

I would like to select data based on Or operation of 2 rules

a) Select all records where source_system = B

b) Select n and n-1 rows where r_diff = 0.

For example, in the above data, you can find r_diff = 0 for row numbers 2,6,9. So, I would like to select rows 1,2 and 5,6 and 8,9. You can see how I have chosen n and n-1 rows

I tried the below

df['flag_1'] = np.where((df['source_system'] == 'B'), '1','0')
df['flag_2'] = np.where((df['r_diff'] == 0), '1','0')
df['flag_3'] = np.where((df['r_diff'].shift(-1) == 0, '1','0')
df = df[((df['flag_1'] == '1') or (df['flag_2'] == '1') or (df['flag_3'] == '1'))]

I expect my output to be like as shown below

person_id  source_system   r_diff
  1              A          NULL
  1              B           0
  1              B           9
  1              A           574
  1              B           0
  1              A           136
  1              B           0

Submitted June 21st 2021 by Admin

Answers
0

I think you are close, you can set mask to variables and chain by | for bitwise OR like:

m1 = df['source_system'] == 'B'
m2 = df['r_diff'] == 0
m3 = df.groupby('person_id')['r_diff'].shift(-1) == 0 df = df[m1 | m2 | m3]
print (df) person_id source_system r_diff
0 1 A NaN
1 1 B 0.0
2 1 B 9.0
4 1 A 574.0
5 1 B 0.0
7 1 A 136.0
8 1 B 0.0

Admin | 3 months ago



Relevant Questions