Validate column format in dataframe

Validate column format in dataframe - python

I have a CSV where some rows are parents and some are children, and I want to validate whether the format of the child rows are correct by adding a Yes/No value to a new column via a dataframe.
Type,ID,Ext_ID,Name
Parent,1111,abc.xyz.num1,yyy
Child,1,break://abc.xyz.num1/break1,break1
Child,2,break://abc.xyz.num1,break2
Parent,2222,abc.xyz.num2,zzz
Child,1,break://abc.xyz.num2/break1,break1
Child,2,break://abc.xyz.num2/break2,break2
Child,3,abc.xyz.num2/,break3
Child,4,break://abc.xyz.num2/break4,break4
Parent,3333,abc.xyz.num3,sss
Child,1,break://abc.xyz.num3/break1,break1
The correct format of the child Ext_ID is break://abc.xyz.{Ext_ID of parent}/break{Name}, so what would be the best way to achieve the desired output?
Type ID Ext_ID Name all_breaks_correct_format
Parent 1111 abc.xyz.num1 yyy No
Child 1 break://abc.xyz.num1/break1 break1
Child 2 break://abc.xyz.num1 break2
Parent 2222 abc.xyz.num2 zzz No
Child 1 break://abc.xyz.num2/break1 break1
Child 2 break://abc.xyz.num2/break2 break2
Child 3 abc.xyz.num2/ break3
Child 4 break://abc.xyz.num2/break4 break4
Parent 3333 abc.xyz.num3 sss Yes
Child 1 break://abc.xyz.num3/break1 break1

df = pd.read_csv("/content/sample_data/your_csv.csv")
df
Define a parents dataframe (simple part):
parents = df.loc[df.Type=="Parent"]
parents.index.name = "parent_id"
# Type ID Ext_ID
#parent_id
#0 Parent 1111 abc.xyz.num1
#3 Parent 2222 abc.xyz.num2
#8 Parent 3333 abc.xyz.num3
Define the children dataframe (difficult part):
children = pd.DataFrame(
np.split(df.index, df.loc[df.Type=="Parent"].index)[1:],
index=df.loc[df.Type=="Parent"].index
).iloc[:,1:].stack().astype(int).rename("child_id").reset_index()\
.set_index("child_id").level_0.rename("parent_id").to_frame().join(df)
Retrieve the Ext of parents for each child:
children = children.merge(
parents.Ext_ID, left_on="parent_id", right_index=True,
suffixes=["_child", "_parent"]
)
Check if the children Ext is correct:
children["expected_Ext_ID_child"] = "break://"+children.Ext_ID_parent+"/break"+children.ID.astype(str)
children["correct_Ext_ID"] = (children.Ext_ID_child==children.expected_Ext_ID_child)
children is:
parent_id Type ID Ext_ID_child Ext_ID_parent expected_Ext_ID_child correct_Ext_ID
child_id
1 0 Child 1 break://abc.xyz.num1/break1 abc.xyz.num1 break://abc.xyz.num1/break1 True
2 0 Child 2 break://abc.xyz.num1 abc.xyz.num1 break://abc.xyz.num1/break2 False
4 3 Child 1 break://abc.xyz.num2/break1 abc.xyz.num2 break://abc.xyz.num2/break1 True
5 3 Child 2 break://abc.xyz.num2/break2 abc.xyz.num2 break://abc.xyz.num2/break2 True
6 3 Child 3 abc.xyz.num2/ abc.xyz.num2 break://abc.xyz.num2/break3 False
7 3 Child 4 break://abc.xyz.num2/break4 abc.xyz.num2 break://abc.xyz.num2/break4 True
9 8 Child 1 break://abc.xyz.num3/break1 abc.xyz.num3 break://abc.xyz.num3/break1 True
Finally:
parents["all_breaks_correct_format"] = parents.join(children.set_index("parent_id").correct_Ext_ID).groupby(axis=0, level=0).correct_Ext_ID.all()
df = df.join(parents.all_breaks_correct_format)
PS
Note that I assumed that {num} must equal ID.

Related

python dataframe unique values

I dont have experience with dataframes and i stuck in the following problem:
There is a table looking like that:
enter image description here
parent account account number account name code
0 parent 1 123122 account1 1
1 parent 1 456222 account2 1
2 parent 1 456334 account3 1
3 parent 2 456446 account4 1
4 parent 2 456558 account5 2
5 parent 2 456670 account6 3
6 parent 2 456782 account7 1
7 parent 2 456894 account8 1
8 parent 2 457006 account9 1
9 parent 2 457118 account10 1
10 parent 2 457230 account11 1
11 parent 2 457342 account12 1
12 parent 2 457454 account13 1
13 parent 2 457566 account14 1
14 parent 3 457678 account15 1
15 parent 3 457790 account16 1
16 parent 4 457902 account17 5
17 parent 4 458014 account18 5
18 parent 4 458126 account19 5
19 parent 4 458238 account20 5
20 parent 4 458350 account21 1
I need to check which parents have only one version of code(last column) and which have more
the needed output is table looking like the sample but every parent with only one version of code is not included
> import pandas as pd
>
> read by default 1st sheet of an excel file
> dataframe1 = pd.read_excel("./input/dane.xlsx")
> parents = dataframe1.groupby(["parent account", "code"])
This is the only output I've got on that moment, its something but this is not the result i need
> for i in parents["parent account"]:
> print(list(i)[0])
> ```
> ('parent 1', 1)
> ('parent 2', 1)
> ('parent 2', 2)
> ('parent 2', 3)
> ('parent 3', 1)
> ('parent 4', 1)
> ('parent 4', 5)
Could you please help me with that?

First obtain a list of parent accounts such that they have more than 1 distinct code
condition = df.groupby('parent account').code.nunique() > 1
parent_list = list( condition.index[condition.values] )
Then apply the filter on your data
df[ df['parent acount'].isin(parent_list) ]

Why apply function did not work on pandas dataframe

ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
I am trying to encyrpt here is below head of ct_data
Unnamed: 0 IM NO CT ID
0 0 214281340 x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 214281244 -vf6738ee3bedf47e8acf4613034069ab0|aa0d2dac654
2 2 175326863 __g3d877adf9d154637be26d9a0111e1cd6|6FfHZRoiWs
3 3 299631931 __gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
4 4 214282320 773840905c424a10a4a31aba9d6458bb|__g1114a30c6e
But I get as below
Unnamed: 0 ... CT ID
0 0 ... x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1 1 ... aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
2 2 ... 6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
3 3 ... 54e2c39cd35044ffbd9c0918d07923dc|__gbe204670ca
4 4 ... __g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
5 5 ... 9e6eb976075b4b189ae7dde42b67ca3d|WgpKucd28IcdE
IM NO columns header name and its value should be 20 digit encrpted ,
Normally encryption is done as below
import pyffx
strEncrypt = pyffx.Integer(b'dkrya#Jppl1994', length=20)
strEncrptVal = strEncrypt.encrypt(int('9digit IM No'))
ct_data.iloc[:, 1]) displays below thing
0 214281340
1 214281244
2 175326863
3 299631931
4 214282320
5 214279026

This should be a comment but it contains formatted data.
It is probably a mere display problem. With the initial sample of you dataframe, I have executed your command and printed its returned values:
print(ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x))))
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
So it is correctly executed. Let us go one step further:
ct_data['IM NO'] = ct_data['IM NO'].apply(lambda x: pyffx.Integer(b'dkrya#Jppl1994', length=20).encrypt(int(x)))
print(ct_data['IM NO'])
0 88741194526272080902
1 2665012251053580165
2 18983388112345132770
3 85666027666173191357
4 78253063863998100367
Name: IM NO, dtype: object
Again...
That means that your command was successfull, but as the IM NO column is now larger, you system can no more display all the columns and it displays the first and las ones, with ellipses (...) in the middle.

How can I improve the performance to collect the information I need from this Pandas dataframe?

First time here and a beginner in Pandas so I'll try to be as clear as possible.
I have a data set that contains a column "Name" that contain child and parent.
The parent row give me a start and a stop value and with that I know what are the child associated with this parent.
My data
In[64]: df
Out[64]:
Name Start Stop Id
0 child 2 4 x
1 child 5 6 x
2 child 7 8 x
3 parent 1 10 x
4 child 12 15 y
5 child 15 16 y
6 child 16 19 y
7 child 20 22 y
8 child 23 24 y
9 parent 11 25 y
10 child 27 28 z
11 child 29 34 z
12 parent 26 35 z
What I want is a dataframe for each parent that will contain all the child row.
The child start and end value must be between the range of value found in the parent and the id needs to match as well.
UPDATE: Multiple parent can have the same Id.
I have a working strategy that goes like this:
Build a dataframe containing all the parent row.
Iterate through all the row of this new dataframe
For each row, check the start,stop and id and test all the row of my source dataframe if it's a match
Append this into a new dataframe and insert into a list of dataframe.
The code looks like this:
import pandas as pd
data = {'Name': ['child','child','child','parent','child','child','child','child','child','parent','child','child','parent'],
'Start': [2,5,7,1,12,15,16,20,23,11,27,29,26],
'Stop': [4,6,8,10,15,16,19,22,24,25,28,34,35],
'Id': ['x','x','x','x','y','y','y','y','y','y','z','z','z']
}
df = pd.DataFrame(data)
dfParent = df[df['Name'].str.contains('parent', regex=False)]
dfList = [] # Creating an empty series of dataframe
for index, row in dfParent.iterrows():
# Create a new dataframe that will contain all child of a parent
dfTemp = pd.DataFrame()
dfTemp = dfTemp.append(df.loc[(df['Start'] > row['Start']) & (df['Stop'] < row['Stop']) & (df['Id'] == row['Id'])])
dfList.append(dfTemp)
dfList
Out[61]:
[ Name Start Stop Id
0 child 2 4 x
1 child 5 6 x
2 child 7 8 x,
Name Start Stop Id
4 child 12 15 y
5 child 15 16 y
6 child 16 19 y
7 child 20 22 y
8 child 23 24 y,
Name Start Stop Id
10 child 27 28 z
11 child 29 34 z]
The result is ok but the performance are terrible when I use my real data set (~500 000 row).
So my question is: Do you have any tips on how I can start improving this code?
Thanks!

Assuming that each Id only has one parent:
dfList = []
for k, d in df.groupby('Id'):
start, stop = d.loc[d["Name"] == 'parent', ['Start', 'Stop']].iloc[0]
dfList.append(d[d["Name"].eq('child') & d["Start"].ge(start) & d["Stop"].le(stop)])
Or you can do a merge and query:
(df[df["Name"].eq('parent')]
.merge(df[df["Name"].eq('child')], on='Id',
suffixes=['_p','_c'])
.query('Start_p<=Start_c<=Stop_c<=Stop_p')
)

Identify parent of hierarchical data in a dataframe given ordered index and depth only

Before I begin, I can hack something together to do this on a small scale, but my goal is to apply this to 200k+ row dataset, so efficiency is priority and I lack more... nuanced techniques. :-)
So, I have an ordered data set that represents data from a very complex hierarchical structure. I only have a unique ID, the tree depth, and the fact that it is in order. For example:
a
b
c
d
e
f
g
h
i
j
k
l
Which is stored as:
ID depth
0 a 0
1 b 1
2 c 2
3 d 3
4 e 3
5 f 2
6 g 2
7 h 3
8 i 0
9 j 1
10 k 2
11 l 1
Here's a line that should generate my example.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"],
"depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
What I want is to return either the index of each elements' nearest parent node or the parents' unique ID (they'll both work since they're both unique). Something like:
ID depth parent p.idx
0 a 0
1 b 1 a 0
2 c 2 b 1
3 d 3 c 2
4 e 3 c 2
5 f 2 b 1
6 g 2 b 1
7 h 3 g 6
8 i 0
9 j 1 i 8
10 k 2 j 9
11 l 1 i 8
My initial sloppy solution involved adding a column that was index-1, then self matching the data set with idx-1 (left) and idx (right), then identifying the maximum parent idx less than the child index... it didn't scale up well.

Here are a couple of routes to performing this task I've put together that work but aren't very efficient.
The first uses simple loops and includes a break to exit when the first match is identified.
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
df["parent"] = ""
# loop over entire dataframe
for i1 in range(len(df.depth)):
# loop back up from current row to top
for i2 in range(i1):
# identify row where the depth is 1 less
if df.depth[i1] -1 == df.depth[i1-i2-1]:
# Set parent value and exit loop
df.parent[i1] = df.ID[i1-i2-1]
break
df.head(15)
This second merges the dataframe with itself and then uses a groupby to identify the maximum parent row less than each original row:
df = pd.DataFrame.from_dict({ "ID":["a","b","c","d","e","f","g","h","i","j","k","l"], "depth":[0,1,2,3,3,2,2,3,0,1,2,1] })
# Columns for comparision and merging
df["parent_depth"] = df.depth-1
df["row"]=df.index
# Merge to return ALL elements matching the parent depth of each row
df = df.merge(df[["ID","depth","row"]], left_on="parent_depth",right_on="depth",how="left",suffixes=('','_y'))
# Identify the maximum parent row less than the original row
g1 = df[ (df.row_y < df.row) | (df.row_y.isnull())].groupby("ID").max()
g1.reset_index(inplace=True)
#clean up
g1.drop(["parent_depth","row","depth_y","row_y"],axis=1,inplace=True)
g1.rename({"ID_y":"parent"},inplace=True)
g1.head(15)
I'm confident those with more experience can provide more elegant solutions, but since I got something working, I wanted to provide my "solution". Thanks!

pandas - remove specific sequence from column

I want to remove specific sequences from my column, because they appear a lot and don't give me a lot of extra information. The database consists of edges between nodes. In this case, there will be an edge between node 1 and node 1, node 1 and node 2, node 2 and node 3.....
However, the edge 1-5 happens around 80.000 times in the real database. I want to filter those out, only keeping the 'not so common' interactions.
Lets say my dataframe looks like this
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
7 5 33
What I have so far was removing a sequence that was only repeating itself:
c1 = datatry['num'].eq('1')
c2 = datatry['num'].eq(datatry['num'].shift(1))
datatry2 = datatry[(c1 & ~c2) | ~(c1)]
How could I alter the code above (that removes all the rows that repeat the integer 1 and keeps only the first row with the value 1) to code that removes all rows that are a specific sequence? For example: a 1 and then a 5? In this case, I want to remove both the row with value 1 and the row with value 5 that appear in that sequence. My end result would ideally be:
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 144
5 5 33

Here is one way:
import numpy as np
import pandas as pd
def find_drops(seq, df):
if seq:
m = np.logical_and.reduce([df.num.shift(-i).eq(seq[i]) for i in range(len(seq))])
if len(seq) == 1:
return pd.Series(m, index=df.index)
else:
return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
else:
return pd.Series(False, index=df.index)
find_drops([1], df)
#0 True
#1 True
#2 False
#3 False
#4 True
#5 False
#6 False
#7 False
#dtype: bool
find_drops([1,1,2,3], df)
#0 True
#1 True
#2 True
#3 True
#4 False
#5 False
#6 False
#7 False
#dtype: bool
Then just use those Series to slice df[~find_drops([1,5], df)]

Did you look at duplicated? That has a default value of keep=first. So you can simply do:
datatry.loc[datatry['num'].duplicated(), :]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Validate column format in dataframe - python

Related

python dataframe unique values

Why apply function did not work on pandas dataframe

How can I improve the performance to collect the information I need from this Pandas dataframe?

Identify parent of hierarchical data in a dataframe given ordered index and depth only

pandas - remove specific sequence from column

Categories

Resources