compare two data frames and extract rows of interest

compare two data frames and extract rows of interest - python

I changed this post due to lack of response and asking to broad of a question on my part, I think.I have taken the original csv and grouped/ selected out all the records that are the latest and placed them in a new df as well as created another df without the latest records.
I have two data frames, one with a the latest version of that record:
new_df = [{'ful_id':000c1a6c-1f1c, 'version':3, 'xs':123, 'at_grade':yes, 'date':20171003},
{'ful_id':00dc5fec-ddb8, 'version':2, 'xs':556, 'at_grade': , 'date':20171009}]
and another with older versions of that record:
old = [{'ful_id':000c1a6c-1f1c, 'version':2, 'xs': , 'at_grade':yes, 'date':20170902},
{'ful_id':000c1a6c-1f1c, 'version':1, 'xs': , 'at_grade':yes, 'date':20170810},
{'ful_id':00dc5fec-ddb8, 'version':1, 'xs':556, 'at_grade':no, 'date':20170803}]
*Data example though the real spread sheet has 130 columns and 20k plus records
I need to run through each record compare ids and then loop through all versions of that id and see if data was deleted in the new version that was in the old. I do not care about other changes, ex if the new version contains data that the old did not. So I was thinking about doing a Boolean comparison? The output would be a any record id missing info along with the column that was changed.
import pandas as pd
import numpy as np
#empty table for comparison
compare = []
#now im not sure how to proceed
for i,r in new_df.iterrows():
if (current['ful_id']== old_v['ful_id']):
turn values into boolean and compare any([])
if value is false in new version but true in old
compare.append
else :
continue through next id group
The last part is not code I know, im just unsure how to proceed
For my output I would like a csv with the id and columns that that are different then the current version. So for the example above, only one record and one column would be in the output due to the current version having no at_grade value.
ful_id at_grade
00dc5fec-ddb8 false

Related

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.

Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

Efficient row comparison in pandas dataframe on incomplete data

I work on an incomplete data that also has doubles and I need to clear it from doubles, choosing complete rows if available.
For example: that's how the data look
I need to search trough each row to see whether it's a double (has a 'rank'>1), and whether if it is incomplete itself, but has some complete doubles.
I'll explain now:
not every row with the 'rank' = 1 has a date in it (it is crutial),
but some of them have doubles ('rank'>1) which has a date.
not every row has a double. And if it doesn't have a date in it, that's ok.
So, I need to find the double with the date if it does exist, and rewrite it to the row with the rank 1 (or delete an incomplete first row)
In the end I need to have a DataFrame with no doubles and as much dates as available.
There's my code with EXTREMELY inefficient iterative loop, but I don't know how to rewrite it with vectorization or .apply() method:
def test_func(dataframe):
df = dataframe
df.iloc[0:0]
for i in range(0, dataframe.shape[0]):
if dataframe.iloc[i]['rank'] == 1:
temp_row = dataframe.iloc[i]
elif ((dataframe.iloc[i+1]['rank']>1)&
(pd.isna(dataframe.iloc[i]['date'])
&(~pd.isna(dataframe.iloc[i+1]['date'])))):
temp_row = dataframe.iloc[i+1]
df.loc[i] = temp_row
return df
Hope to find some help! From Russia with love xo.

Assuming that you are grouping by phone, and you are interested in populating missing dates, then you can use backwards fill and group by, which will fill the missing dates with the next available not null date within the group.
test_df['date'] = test_df.groupby(['phone'])['date'].apply(lambda x: x.bfill())
if you need to populate other missing data, just replace 'date' with the relevant column name

For Python Pandas, how to implement a "running" check of 2 rows against the previous 2 rows?

[updated with expected outcome]
I'm trying to implement a "running" check where I need the sum and mean of two rows to be more than the previous 2 rows.
Referring to the dataframe (copied into spreadsheet) below, I'm trying code out a function where if the mean of those two orange cells is more than the blue cells, the function will return true for row 8, under a new column called 'Cond11'. The dataframe here is historical, so all rows are available.
Note that that Rows column is added in the spreadsheet, easier for me to reference the rows here.
I have been using .rolling to refer to the current row + whatever number of rows to refer to, or using shift(1) to refer to the previous row.
df.loc[:, ('Cond9')] = df.n.rolling(4).mean() >= 30
df.loc[:, ('Cond10')] = df.a > df.a.shift(1)
I'm stuck here... how to I do this 2 rows vs the previous 2 rows? Please advise!
The 2nd part of this question: I have another function that checks the latest rows in the dataframe for the same condition above. This function is meant to be used in real-time, when new data is streaming into the dataframe and the function is supposed to check the latest rows only.
Can I check if the following code works to detect the same conditions above?
cond11 = candles.n[-2:-1].sum() > candles.n[-4:-3].sum()

I believe this solves your problem:
df.rolling(4).apply(lambda rows: rows[0] + rows[1] < rows[2] + rows[3])
The first 3 rows will be NaNs but you did not define what you would like to happen there.
As for the second part, to be able to produce this condition live for new data you just have to prepend the last 3 rows of your current data and then apply the same process to it:
pd.concat([df[-3:], df])

python checksum or hash need same output every execution

Trying to create unique key for a dataframe based on some columns. Used hashlib and zlib , both generating different values for each new python program execution for the same record in dataframe.
Looking for a way to create unique checksum and it should be the same for given data record in dataframe. There are many columns , so don't want to use concatenated column as a key. Any insights would be much appreciated. Sample code tested using hashlib and zlib below
Hashlib
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))
zlib.adler32
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (zlib.adler32(x) & 0xffffffff ))
Edited(10/21) Changed code and hitting new problem. Please review. Sorry for any confusion
Above code snippets have problem. For a row , some other row's column values hash was added in 'Unique travelid' column due to pd.DataFrame() altering original df rows order. Below modified code fetches respective column values for a given row but hitting new issue explained below
Modified code
stg_matchdf["Unique travelid_Sum"] = stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1)
stg_matchdf["Unique travelid_Key"] = stg_matchdf["Unique travelid_Sum"].apply(lambda x: (zlib.adler32(str(x).encode('utf-8')) & 0xffffffff))
stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1) is not concatenating columns in one particular order across multiple runs. Please see sample below for two runs. Entire length is same , but order of concatenation is random. So it is causing hashlib or zlib to return different values each time. Is there any way to specify order of columns in above code?
Run1:
AHKGCANADACANADANORTH AMERICA266430RDirect WDAYYZINTERNATIONALMANULIFE - CANADA TRANSIENTFeb-2020HONG KONGASIA/PACIFICPARTIAL REFUND2020-02-15Canada266430.02020-02-02Hong Kong2020-03-01QVKGS6
Run2:
YYZCANADAPARTIAL REFUND2664302020-02-02AMANULIFE - CANADA TRANSIENTHONG KONGNORTH AMERICA2020-03-01Hong KongQVKGS6INTERNATIONALDirect WDRHKGACanadaFeb-2020266430.02020-02-15CANADAASIA/PACIFIC

Get difference values between rows in Pandas DataFrame

Hi I have a result set from psycopg2 like so
(
(timestamp1, val11, val12, val13, val14),
(timestamp2, val21, val22, val23, val24),
(timestamp3, val31, val32, val33, val34),
(timestamp4, val41, val42, val43, val44),
)
I have to return the difference between the values of the row (exception for the timestamp column).
Each row would subtract the previous row values.
The first row would be
timestamp, 'NaN', 'NaN' ....
This has to then be returned as a generic object
Ie something like an array of the following objects
Group(timestamp=timestamp, rows=[val11, val12, val13, val14]
I was going to use Pandas to do the diff.
Something like below works ok on the values
df = DataFrame().from_records(data=results, columns=headers)
diffs = df.set_index('time', drop=False).diff()
But diff also performs on the timestamp column and I can't get it to ignore a column while
leaving the original timestamp column in place.
Also I wasn't sure it was going to be efficient to get the data into my return format
as Pandas advises against row access
What would a fast way to get the result set differences in my required output format ?

Why did you set drop=False? That puts the timestamps in the index (where they will not be touched by diff) but also leaves a copy of the timestamps as a proper column, to be process by diff.
I think this will do what you want:
diffs = df.set_index('time').diff().reset_index()
Since you mention psycopg2, take a look at the docs for pandas 0.14, released just a few days ago, which features improved SQL functionality, including new support for postgresql. You can read and write directly between the database and pandas DataFrames.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

compare two data frames and extract rows of interest - python

Related

Split and create data from a column to many columns

Efficient row comparison in pandas dataframe on incomplete data

For Python Pandas, how to implement a "running" check of 2 rows against the previous 2 rows?

python checksum or hash need same output every execution

Get difference values between rows in Pandas DataFrame

Categories

Resources