Identifying difference of rows with a similar column using pandas

Identifying difference of rows with a similar column using pandas - python

I have written a script to parse a csv file. The csv file contains an ID and timestamp.
df = pd.read_csv(dataset_path, names = ['ID','TSTAMP','DIFF'], delimiter=';')
d = {'min':'TSTAMP-INIT','max':'TSTAMP-FIN'}
df = df.groupby(['UID'])['TSTAMP'].agg([min, max]).reset_index().rename(columns=d)
df['DIFF'] = (df['TSTAMP-FIN'] - df['TSTAMP-INIT'])
If you think about this as the csv file (the dots indicate other elements in the series)
3w]{;1495714405280
...
3w]{;1495714405340
...
3w]{;1495714571213
...
3w]{;1495714571317
...
3w]{;1495714405280
...
3w]{;1495714405340
...
3w]{;1495714571213
...
3w]{;1495714571317
the df gives me output as the difference between the first and last occurrence of 3w]{
UID DIFF
0 3w]{ 166037
Instead when I want the output to be the difference of consecutive ID's.
UID DIFF
0 3w]{ 60
1 3w]{ 104
...
What am I missing?

on the UID column you are aggregating timestamp and then picking up min and max for that uid and then taking the difference. but for your requirement select the two columns and then rank them and do a self join on it with uid and rank = rank-1. or you can apply Rolling() pandas method.

Related

Python script to find the number of date columns in a csv file and update the date format to MM-DD-YYYY

I get a file everyday with around 15 columns. Somedays there are 2 date columns and some days one date column. Also the date format on somedays is YYYY-MM-DD and on some its DD-MM-YYYY. Task is to convert the 2 or 1 date columns to MM-DD-YYYY. Sample data in csv file for few columns :
Execution_date
Extract_date
Requestor_Name
Count
2023-01-15
2023-01-15
John Smith
7
Sometimes we dont get the second column above - extract_date :
Execution_date
Requestor_Name
Count
17-01-2023
Andrew Mill
3
Task is to find all the date columns in the file and change the date format to MM-DD-YYYY.
So the sample output of above 2 files will be :
Execution_date
Extract_date
Requestor_Name
Count
01-15-2023
01-15-2023
John Smith
7
Execution_date
Requestor_Name
Count
01-17-2023
Andrew Mill
3
I am using pandas and can't figure out how to deal with the missing second column on some days and the change of the date value format.
I can hardcode the 2 column names and change the format by :
df['Execution_Date'] = pd.to_datetime(df['Execution_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
df['Extract_Date'] = pd.to_datetime(df['Extract_Date'], format='%d-%m-%Y').dt.strftime('%m-%d-%Y')
This will only work when the file has 2 columns and the values are in DD-MM-YYYY format.
Looking for guidance on how to dynamically find the number of date columns and the date value format so that I can use it in my above 2 lines of code. If not then any other solution would also work for me. I can use powershell if it can't be done in python. But I am guessing there will be a lot more avenues in python to do this than we will have in powershell.

The following loads a CSV file into a dataframe, checks each value (that is a str) to see if it matches one of the date formats, and if it does rearranges the date to the format you're looking for. Other values are untouched.
import pandas as pd
import re
df = pd.read_csv("today.csv")
# compiling the patterns ahead of time saves a lot of processing power later
d_m_y = re.compile(r"(\d\d)-(\d\d)-(\d\d\d\d)")
d_m_y_replace = r"\2-\1-\3"
y_m_d = re.compile(r"(\d\d\d\d)-(\d\d)-(\d\d)")
y_m_d_replace = r"\2-\3-\1"
def change_dt(value):
if isinstance(value, str):
if d_m_y.fullmatch(value):
return d_m_y.sub(d_m_y_replace, value)
elif y_m_d.fullmatch(value):
return y_m_d.sub(y_m_d_replace, value)
return value
new_df = df.applymap(change_dt)
However, if there are other columns containing dates that you don't want to change, and you just want to specify the columns to be altered, use this instead of the last line above:
cols = ["Execution_date", "Extract_date"]
for col in cols:
if col in df.columns:
df[col] = df[col].apply(change_dt)
You can convert the columns to datetimes if you wish.

You can use a function to check all column names that contain "date" and use .fillna to try other formats (add all possible formats).
import pandas as pd
def convert_to_datetime(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
for column in df.columns[df.columns.str.contains(column_name, case=False)]:
df[column] = (
pd.to_datetime(df[column], format="%d-%m-%Y", errors="coerce")
.fillna(pd.to_datetime(df[column], format="%Y-%m-%d", errors="coerce"))
).dt.strftime("%m-%d-%Y")
return df
data1 = {'Execution_date': '2023-01-15', 'Extract_date': '2023-01-15', 'Requestor_Name': "John Smith", 'Count': 7}
df1 = pd.DataFrame(data=[data1])
data2 = {'Execution_date': '17-01-2023', 'Requestor_Name': 'Andrew Mill', 'Count': 3}
df2 = pd.DataFrame(data=[data2])
final1 = convert_to_datetime(df=df1, column_name="date")
print(final1)
final2 = convert_to_datetime(df=df2, column_name="date")
print(final2)
Output:
Execution_date Extract_date Requestor_Name Count
0 01-15-2023 01-15-2023 John Smith 7
Execution_date Requestor_Name Count
0 01-17-2023 Andrew Mill 3

How to filter dataframe only by month and year?

I want to select many cells which are filtered only by month and year. For example there are 01.01.2017, 15.01.2017, 03.02.2017 and 15.02.2017 cells. I want to group these cells just looking at the month and year information. If they are in january, They should be grouped together.
Output Expectation:
01.01.2017 ---- 1
15.01.2017 ---- 1
03.02.2017 ---- 2
15.02.2017 ---- 2
Edit: I have 2 datasets in different excels as you can see below.
first data
second data
What I m trying to do is that I want to get 'Su Seviye' data for every 'DH_ID' seperately from first data. And then I want to paste these data into 'Kuyu Yüksekliği' column in the second data. But the problems are that every 'DH_ID' is in different sheets and although there are only month and year data in first database, second database has day information additionally. How can I produce this kind of codes?
import pandas as pd
df = pd.read_excel('...Gözlem kuyu su seviyeleri- 2017.xlsx', sheet_name= 'GÖZLEM KUYULARI1', header=None)
df2 = pd.read_excel('...YERALTI SUYU GÖZLEM KUYULARI ANALİZ SONUÇLAR3.xlsx', sheet_name= 'HJ-8')
HJ8 = df.iloc[:, [0,5,7,9,11,13,15,17,19,21,23,25,27,29]]
##writer = pd.ExcelWriter('yıllarsuseviyeler.xlsx')
##HJ8.to_excel(writer)
##writer.save()
rb = pd.read_excel('...yıllarsuseviyeler.xlsx')
rb.loc[0,7]='01.2022'
rb.loc[0,9]='02.2022'
rb.loc[0,11]='03.2022'
rb.loc[0,13]='04.2022'
rb.loc[0,15]='05.2021'
rb.loc[0,17]='06.2022'
rb.loc[0,19]='07.2022'
rb.loc[0,21]='08.2022'
rb.loc[0,23]='09.2022'
rb.loc[0,25]='10.2022'
rb.loc[0,27]='11.2022'
rb.loc[0,29]='12.2022'
You can see what I have done above.

First, you can convert date column to Datetime object, then get the year and month part with to_period, at last get the group number with ngroup().
df['group'] = df.groupby(pd.to_datetime(df['date'], format='%d.%m.%Y').dt.to_period('M')).ngroup() + 1
date group
0 01.01.2017 1
1 15.01.2017 1
2 03.02.2017 2
3 15.02.2017 2

Apply row logic on date while extracting only multiple columns of a dataframe

I am extracting a data frame in pandas and want to only extract rows where the date is after a variable.
I can do this in multiple steps but would like to know if it is possible to apply all logic in one call for best practice.
Here is my code
import pandas as pd
self.min_date = "2020-05-01"
#Extract DF from URL
self.df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
self.df = self.df.loc[["Date of case" < self.min_date], ["Subject","Reference","Date of case"]]
return(self.df)
I keep getting the error: "IndexError: Boolean index has wrong length: 1 instead of 100"
I cannot find the solution online because every answer is too specific to the scenario of the person that asked the question.
e.g. this solution only works for if you are calling one column: How to select rows from a DataFrame based on column values?
I appreciate any help.

Replace this:
["Date of case" < self.min_date]
with this:
self.df["Date of case"] < self.min_date
That is:
self.df = self.df.loc[self.df["Date of case"] < self.min_date,
["Subject","Reference","Date of case"]]

You have a slight syntax issue.
Keep in mind that it's best practice to convert string dates into pandas datetime objects using pd.to_datetime.
min_date = pd.to_datetime("2020-05-01")
#Extract DF from URL
df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
df['Date of case'] = pd.to_datetime(df['Date of case'])
df = df.loc[df["Date of case"] > min_date, ["Subject","Reference","Date of case"]]
Output:
Subject Reference Date of case
0 Salmonella enterica ser. Enteritidis (presence... 2020.2145 2020-05-22
1 migration of primary aromatic amines (0.4737 m... 2020.2131 2020-05-22
2 celery undeclared on green juice drink from Ge... 2020.2118 2020-05-22
3 aflatoxins (B1 = 29.4 µg/kg - ppb) in shelled ... 2020.2146 2020-05-22
4 too high content of E 200 - sorbic acid (1772 ... 2020.2125 2020-05-22

Compare two columns in two csv files in python

I have two csv files with same columns name:
In file1 I got all the people who made a test and all the status (passed/missed)
In file2 I only have those who missed the test
I'd like to compare file1.column1 and file2.column1
If they match then compare file1.column4 and file2.column4
If they are different remove item line from file2
I can't figure how to do that.
I looked things with pandas but I didn't manage to do anything that works
What I have is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
Doe;25/03/1958;garage;Missed;02/04/2019
What I want to get is:
file1.csv:
name;DOB;service;test status;test date
Smith;12/12/2012;compta;Missed;01/01/2019
foo;02/11/1989;office;Passed;01/01/2019
bar;03/09/1972;sales;Passed;02/03/2018
Doe;25/03/1958;garage;Missed;02/04/2019
Smith;12/12/2012;compta;Passed;04/05/2019
file2.csv:
name;DOB;service;test status;test date
Doe;25/03/1958;garage;Missed;02/04/2019

So first you will have to open:
import pandas as pd
df1 = pd.read_csv('file1.csv',delimiter=';')
df2 = pd.read_csv('file2.csv',delimiter=';')
Treating the data frame, because of white spaces found
df1.columns= df1.columns.str.strip()
df2.columns= df2.columns.str.strip()
# Assuming only strings
df1 = df1.apply(lambda column: column.str.strip())
df2 = df2.apply(lambda column: column.str.strip())
The solution expected, Assuming that your name is UNIQUE.
Merging the files
new_merged_df = df2.merge(df1[['name','test status']],'left',on=['name'],suffixes=('','file1'))
DataFrame Result:
name DOB service test status test date test statusfile1
0 Smith 12/12/2012 compta Missed 01/01/2019 Missed
1 Smith 12/12/2012 compta Missed 01/01/2019 Passed
2 Doe 25/03/1958 garage Missed 02/04/2019 Missed
Filtering based on the requirements and removing the rows with the name with different test status.
filter = new_merged_df['test status'] != new_merged_df['test statusfile1']
# Check if there is different values
if len(new_merged_df[filter]) > 0:
drop_names = list(new_merged_df[filter]['name'])
# Removing the values that we don't want
new_merged_df = new_merged_df[~new_merged_df['name'].isin(drop_names)]
Removing columns and storing
# Saving as a file with the same schema as file2
new_merged_df.drop(columns=['test statusfile1'],inplace=True)
new_merged_df.to_csv('file2.csv',delimiter=';',index=False)
Result
name DOB service test status test date
2 Doe 25/03/1958 garage Missed 02/04/2019

Python pandas dataframe sort_values does not work

I have the following pandas data frame which I want to sort by 'test_type'
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
My code to load the dataframe and sort it is, the first print line prints the data frame above.
df = pd.read_csv(file) #reads from a csv file
print df
df = df.sort_values(by=['test_type'], ascending=True)
print '\nAfter sort...'
print df
After doing the sort and printing the dataframe content, the data frame still looks like below.
Program output:
After sort...
test_type tps mtt mem cpu 90th
0 sso_1000 205.263559 4139.031090 24.175933 34.817701 4897.4766
1 sso_1500 201.127133 5740.741266 24.599400 34.634209 6864.9820
2 sso_2000 203.204082 6610.437558 24.466267 34.831947 8005.9054
3 sso_500 189.566836 2431.867002 23.559557 35.787484 2869.7670
I expect row 3 (test type: sso_500 row) to be on top after sorting. Can someone help me figure why it's not working as it should?

Presumbaly, what you're trying to do is sort by the numerical value after sso_. You can do this as follows:
import numpy as np
df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)
This
splits the strings at _
converts what's after this character to the numerical value
Finds the indices sorted according to the numerical values
Reorders the DataFrame according to these indices
Example
In [15]: df = pd.DataFrame({'test_type': ['sso_1000', 'sso_500']})
In [16]: df.sort_values(by=['test_type'], ascending=True)
Out[16]:
test_type
0 sso_1000
1 sso_500
In [17]: df.ix[np.argsort(df.test_type.str.split('_').str[-1].astype(int).values)]
Out[17]:
test_type
1 sso_500
0 sso_1000

Alternatively, you could also extract the numbers from test_type and sort them. Followed by reindexing DF according to those indices.
df.reindex(df['test_type'].str.extract('(\d+)', expand=False) \
.astype(int).sort_values().index).reset_index(drop=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.