Removing the rows from dataframe till the actual column names are found - python

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?

You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior

Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

Related

Compare different df's row by row and return changes

Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

I am working on two data-frames which have different column names and dimensions.
First data-frame "df1" contains single column "name" that has names need to be located in second data-frame. If matched, value from df2 first column df2[0] needs to be returned and added in the result_df
Second data-frame "df2" has multiple columns and no header. This contains all the possible diminutive names and full names. Any of the column can have the "name" that needs to be matched
Goal: Locate the name in "df1" in "df2" and if it is matched, return the value from first column of the df2 and add in the respective row of df1
df1
name
ab
alex
bob
robert
bill
df2
0
1
2
3
abram
ab
robert
rob
bob
robbie
alexander
alex
al
william
bill
result_df
name
matched_name
ab
abram
alex
alexander
bob
robert
robert
robert
bill
william
The code i have written so far is giving error. I need to write it as an efficient code as it will be checking millions of entries in df1 with df2:
'''
result_df = process_name(df1, df2)
def process_name(df1, df2):
for elem in df2.values:
if elem in df1['name']:
df1["matched_name"] = df2[0]
'''
Try via concat(),merge(),drop() and rename() and reset_index() method:
df=(pd.concat((df1.merge(df2,left_on='name',right_on=x) for x in df2.columns))
.drop(['1','2','3'],1)
.rename(columns={'0':'matched_name'})
.reset_index(drop=True))
Output of df:
name matched_name
0 robert robert
1 ab abram
2 alex alexander
3 bill william
4 bob robert

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

Duplicate Analysis Using Python

I am a beginner in Python.
So far I have identified the duplicates using pandas lib but don't know how this will help me.
import pandas as pd
import numpy as np
dataframe = pd.read_csv("HKTW_INDIA_Duplicate_Account.csv")
dataframe.info()
name = dataframe["PARTY_NAME"duplicate_data=dataframe[name.isin(name[name.duplicated()])].sort_values("PARTY_NAME")
duplicate_data.head()
What I want: I have a set of data that is duplicated and I need to merge the duplicates based on certain conditions and need to populate the feedback in a new column.
I can do this manually also in Excel but the records are very high which will consume a lot of time. (More than 4,00,000 rows)
Primary Account ID Secondary Account ID Account Name Translated Name Created on Date Amount Today Amount Total Split Reamrks New ID
1234 245 Julia Julia 24-May-20 530 45 N
2345 Julia Julia 24-Sep-20 N
3456 42 Sara Sara 24-Aug-20 230 Y
4567 Sara Sara 24-Sep-20 Y
5678 Matt Matt 24-Jun-20 N
6789 Matt Matt 24-Sep-20 N
7890 58 Robert Robert 24-Feb-20 525 21 N
1937 Robert Robert 24-Sep-20 N
7854 55 Robert Robert 24-Jan-20 543 74 N
Conditions:
Only those accounts can be merged where we have "N" in Split Column and Amount_Total & Amount_Today is Blank.
Expected Output:
Value in Secondary_Account_ID or not.
Example: Row 2 does not have any Secondary Registry ID and does not have any value in Amount_Total & Amount_Todat but Row 1 has the value in Secondary_Account_ID, so in this case, Row 2 can be merged to Row 1 because both have the same name. In the remarks columns, it should give me Winner account have secondary id(row 2 & row 1) and copy the Account ID from row 1 and paste in (row 2 & row 1) (Column "New ID")
Expected Output:
If duplicate accounts have Amount_Total and Amount_Today then it should not be merged.
Expected Output:
If duplicate accounts do not have any value in Secondary_Account_ID then it will check for Amount_today or Amount_total column, if the value is there in these two columns then the account which does not have values in these two columns will be merged to another one.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for one account then that account will be considered as a winner account.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for more than one account then that account which has the maximum value in Amount_Total will be considered as winner account.
Expected Output:
If Secondary_Account_ID, Total_Amount, and Today_Amount is blank then it should consider the oldest account.
Expected Output:

Search for string pattern in dataframe column, return each occurence and join to another dataframe

I'm trying to do something similar to How to loop through pandas df column, finding if string contains any string from a separate pandas df column?, specifically the second problem
My initial dataframe logs contains the following:
ID DB USER MDX_TEXT
1 DB1 JOE SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS,[ATTR2].[ATTR2].[THE_ATTR2] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
2 DB1 JAY SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS, [ATTR3].[ATTR3].[THE_ATTR3] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
Using regex, I then extract the unique instances of MDX_TEXT per ID
# Step 1: Define a pattern to extract MDX regex pattern
import re
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\](?:\.(?:Members|\[Q\d\]))?)')
# Step 2: Create a dataframe to store distinct list of MDX Query attributes, based on pattern
extrpat = (
logs['MDX_TEXT'].str.extractall(pattern)
.drop_duplicates()
.to_numpy()
)
# Step 3: Create a dataframe to store distinct list of attributes used in a query
attr= pd.DataFrame(data=extrpat)
attr.rename({0: 'attrname'}, inplace=True, axis=1)
# Preview the first 5 lines of the attributes dataframe
attr.head()
Which results in:
attrname
[THE_ATTR1]
[THE_ATTR2]
[THE_ATTR3]
[THE_ATTR1]
[THE_ATTR3]
What I would like, is in addition to extracting the unique attributes in step 2, to also extract the ID and USER, like this:
ID USER attrname
1 JOE [THE_ATTR1]
1 JOE [THE_ATTR2]
1 JOE [THE_ATTR3]
2 JAY [THE_ATTR1]
2 JAY [THE_ATTR3]
and then finally, join the attr and logs dataframes on the ID. The idea is to bring in a third dataframe users:
USER LOC
JOE NY
JIL NJ
MAC CA
...which I will join with the aforementioned on USER to end up with this:
ID USER LOC attrname
1 JOE NY [THE_ATTR1]
1 JOE NY [THE_ATTR2]
1 JOE NY [THE_ATTR3]
2 JAY NJ [THE_ATTR1]
2 JAY NJ [THE_ATTR3]
(edited).
The pattern piece is a good start, but then you have to merge / join it with the original dataframe:
df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')
# extract the attributes.
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})
# join the result with the original dataframe.
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()
# take just the last part of each attribute name.
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]
The result is:
ID USER attrname
0 1 JOE [ATTR1]
1 1 JOE [ATTR2]
2 1 JOE [ATTR3]
3 2 JAY [ATTR1]
4 2 JAY [ATTR3]

Categories

Resources