Pandas excel difference generator - python

I am trying to create a python program that gives me the differences between 2 big excel files with multiple sheets. I got it to print the results to an excel, but apparently when one of the cells contains datetime data then the operation of multiplying a boolean dataframe with the dataframe that contains dates doesn't work anymore. I get the following error :
TypeError: unsupported operand type(s) for *: 'bool' and 'datetime.datetime'
'EDIT' : I just realized this method doesn't work for strings neither (It only works for pure numerical data). What would be a better way to do this, that works on strings, numbers and time data?
#start of program
import pandas as pd
from pandas import ExcelWriter
import numpy as np
df1 = pd.read_excel('4_Input EfE_2030.xlsm',None)
df2 = pd.read_excel('5_Input EfE_2030.xlsm',None)
keys1=df1.keys()
keys2=df2.keys()
writer = ExcelWriter('test1.xlsx')
#loop for all sheets and create new dataframes with the differences
for x in keys1:
df3 = pd.read_excel('4_Input EfE_2030.xlsm',sheetname=x,header=None)
df4 = pd.read_excel('5_Input EfE_2030.xlsm',sheetname=x,header=None)
dif = df3 != df4
df=dif*df3
df2=dif*df4
nrcolumns=len(df.columns)
#when there are no differences in the entire sheet the dataframe will be empty. Add 1 to row indexes so the number coincides with excel rownumbers
if not df.empty:
# df.columns = ['A']
df.index = np.arange(1, len(df) + 1)
if not df2.empty:
# df2.columns = ['A']
df2.index = np.arange(1, len(df) + 1)
#delete rows with all 0
df = df.loc[~(df == 0).all(axis=1)]
df2 = df2.loc[~(df2 == 0).all(axis=1)]
#create new df with the data of the 2 sheets
result = pd.concat([df,df2],axis=1)
print(result)
result.to_excel(writer,sheet_name=x)

Updated answer
Approach
This is an interesting question. Another approach is to compare the column values in one Excel worksheet against the column values in another Excel worksheet by using the Panel data structure offered by Pandas. This data structure stores data as a 3-dimensional array. With the data from two Excel worksheets stored in a Panel we can then compare rows across worksheets that are uniquely identified by one or a combination of columns (e.g, a unique ID). to make this comparison by applying a custom function that compares the value in each cell of each column in one worksheet to the value in the same cell of the same column in the second worksheet. One benefit of this approach is that the datatype of each value no longer matters since we're just comparing values (e.g., 1 == 1, 'my name' == 'my name', etc.).
Assumptions
This approach makes several assumptions about your data:
The rows in each of the worksheets share one or a combination of columns that uniquely identify each row.
The columns of interest for comparison exist in both worksheets and share the same column headers.
(There may be other assumptions I'm failing to notice.)
Implementation
The implementation of this approach is a bit involved. Also, because I do not have access to your data, I cannot customize the implementation specifically to your data. With that said, I'll implement this approach using some dummy data shown below.
"Old" dataset:
id col_num col_str col_datetime
1 123 My string 1 2001-12-04
2 234 My string 2 2001-12-05
3 345 My string 3 2001-12-06
"New" dataset:
id col_num col_str col_datetime
1 123 My string 1 MODIFIED 2001-12-04
3 789 My string 3 2001-12-10
4 456 My string 4 2001-12-07
Notice the following differences about these two dataframes:
col_str in the row with id 1 is different
col_num in the row with id 3 is different
col_datetime in the row with id 3 is different
The row with id 2 exists in "old" but not "new"
The row with id 4 exists in "new" but not "old"
Okay, let's get started. First, we read in the datasets into separate dataframes:
df_old = pd.read_excel('old.xlsx', 'Sheet1', na_values=['NA'])
df_new = pd.read_excel('new.xlsx', 'Sheet1', na_values=['NA'])
Then we add a new version column to each dataframe to keep our thinking straight. We'll also use this column later to separate out rows from the "old" and "new" dataframes into their own separate dataframes:
df_old['VER'] = 'OLD'
df_new['VER'] = 'NEW'
Then we concatenate the "old" and "new" datasets into a single dataframe. Notice that the ignore_index parameter is set to True so that we ignore the index as it is not meaningful for this operation:
df_full = pd.concat([df_old, df_new], ignore_index=True)
Now we're going to identify all of the duplicate rows that exist across the two dataframes. These are rows where all of the column values are the same across the "old" and "new" dataframes. In other words, these are rows where no differences exist:
Once identified, we drop these duplicates rows. What we're left with are the rows that (a) are different between the two dataframes, (b) exist in the "old" dataframe but not the "new" dataframe, and (c) exist in the "new" dataframe but not the "old" dataframe:
df_diff = df_full.drop_duplicates(subset=['id', 'col_num', 'col_str', 'col_datetime'])
Next we identify and extract the values for id (i.e., the primary key across the "old" and "new" dataframes) for the rows that exist in both the "old" and "new" dataframes. It's important to note that these ids do not include rows that exist in one or the other dataframes but not both (i.e., rows removed or rows added):
diff_ids = df_diff.set_index('id').index.get_duplicates()
Now we restrict df_full to only those rows identified by the ids in diff_ids:
df_diff_ids = df_full[df_full['id'].isin(diff_ids)]
Now we move the duplicate rows from the "old" and "new" dataframes into separate dataframes that we can plug into a Panel data structure for comparison:
df_diff_old = df_diff_ids[df_diff_ids['VER'] == 'OLD']
df_diff_new = df_diff_ids[df_diff_ids['VER'] == 'NEW']
Next we set the index for both of these dataframes to the primary key (i.e., id). This is necessary for Panel to work effectively:
df_diff_old.set_index('id', inplace=True)
df_diff_new.set_index('id', inplace=True)
We slot both of these dataframes into a Panel data structure:
df_panel = pd.Panel(dict(df1=df_diff_old, df2=df_diff_new))
Finally we make our comparison using a custom function (find_diff) and the apply method:
def find_diff(x):
return x[0] if x[0] == x[1] else '{} -> {}'.format(*x)
df_diff = df_panel.apply(find_diff, axis=0)
If you print out the contents of df_diff you can easily note the which values changed between the "old" and "new" dataframes:
col_num col_str col_datetime
id
1 123 My string 1 -> My string 1 MODIFIED 2001-12-04 00:00:00
3 345 -> 789 My string 3 2001-12-06 00:00:00 -> 2001-12-10 00:00:00
Improvements
There are a few improvements I'll leave to you to make to this implementation.
Add a binary (1/0) flag that indicates if a one or more values in a
row changed
Identify which rows in the "old" dataframe were removed
(i.e., are not present in the "new" dataframe)
Identify which rows in the
"new" dataframe were added (i.e., not present in the "old" dataframe)
Original answer
Issue:
The issue is that you cannot perform arithmetic operations on datetimes.
However, you can perform arithmetic operations on timedeltas.
I can think of a few solutions that might help you:
Solution 1:
Convert your datetimes to strings.
If I'm understanding your problem correctly, you're comparing Excel worksheets for differences, correct? If this is the case, then I don't think it matters if the datetimes are represented as explicit datetimes (i.e., you're not performing any datetime calculations).
To implement this solution you would modify your pd.read_excel()' calls and explicitly set thedtypesparameter to convert yourdatetimes` to strings:
df1 = pd.read_excel('4_Input EfE_2030.xlsm', dtypes={'LABEL FOR DATETIME COL 1': str})
Solution 2:
Convert your datetimes to timedeltas.
For each datetime column, you can use: pd.Timedelta(df['LABEL FOR DATETIME COL'])
Overall, without seeing your data, I believe Solution 1 is the most straightforward.

Related

Using pandas I need to create a new column that takes a value from a previous row

I have many rows of data and one of the columns is a flag. I have 3 identifiers that need to match between rows.
What I have:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag
What I need:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag, previous_flag
I need to find flag from the row where partnumber matches, and where the previousdatetime1(current row*) == datetime1(other row)*, and the previousdatetime2(current row) == datetime2(other row).
*To note, the rows are not necessarily in order so the previous row may come later in the dataframe
I'm not quite sure where to start. I got this logic working in PBI using a LookUpValue and basically finding where partnumber = Value(partnumber), datetime1 = Value(datetime1), datetime2 = Value(datetime2). Thanks for the help!
Okay, so assuming you've read this in as a pandas dataframe df1:
(1) Make a copy of the dataframe:
df2=df1.copy()
(2) For sanity, drop some columns in df2
df2.drop(['previousdatetime1','previousdatetime2'],axis=1,inplace=True)
Now you have a df2 that has columns:
['partnumber','datetime1','datetime2','flag']
(3) Merge the two dataframes
newdf=df1.merge(df2,how='left',left_on=['partnumber','previousdatetime1'],right_on=['partnumber','datetime1'],suffixes=('','_previous'))
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','partnumber_previous','datetime1_previous','datetime2_previous','flag_previous']
(4) Drop the unnecessary columns
newdf.drop(['partnumber_previous', 'datetime1_previous', 'datetime2_previous'],axis=1,inplace=True)
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','flag_previous']

Pandas fill dataframe from another dataframe where the [non-] index doesnt always overlap

I have a bunch of dataframes where I want to pull out a single column from each and merge them into another dataframe with a timestamp column that is not indexed.
So e.g. all the dataframes look like:
[Index] [time] [col1] [col2] [etc]
0 2020-04-21T18:00:00Z 1 2 ...
All of the dataframes have a 'time' column and a 'col1' column. Because the 'time' column does not necessarily overlap, I made a new dataframe with a join of all the dataframes (that I added to a dictionary)
di = ... #dictionary of all the dataframes of interest
for key in di:
temptimeslist = di[key]['time'].tolist()
fulltimeslist.extend(x for x in temptimeslist if x not in fulltimeslist)
datadf['time'] = fulltimeslist #make a new df and add this as a column
(i'm sure there's an easier way to do the above, any suggestions are welcome). Note that for a number of reasons, translatting the ISO datetime format into a datetime and setting that as an index is not ideal.
The dumb way to do what I want is obvious enough:
for key in di:
datadf[key] = float("NaN")
tempdf = di[key] #could skip this probably
for i in range(len(datadf)):
if tempdf.time[tempdf.time==datadf.time[i]].index.tolist():
if len(tempdf.time[tempdf.time==datadf.time[i]].index.tolist())==1: #make sure value only shows up once, could reasonably skip this and put protection in elsewhere
datadf.loc[i,key] = float(tempdf[colofinterest][tempdf.time[tempdf.time==datadf.time[i]].index.tolist()])
#i guess i could do the above backwards so i loop over only the shorter dataframe to save some time.
but this seems needlessly long for python...I originally tried the pandas merge and join methods but got various keyerrors when trying them...same goes for 'in' statements inside the if statements.
e.g, I've tried thins like
datadf.join(Nodes_dict[key],datadf['time']==Nodes_dict[key]['time'],how="left").select()
but this fails.
I guess the question boils down to the following steps:
1) given 2 dataframes with a column of strings (times in iso format), find the indexes in the larger one for where they match the shorter one (or vice versa)
2) given that list of indexes, populate a separate column in the larger df using values from the smaller df but only in the correct spots and nan otherwise

How to avoid _x _y columns using pandas

I've checked other questions here but I don't think they've answered my issue (though it is quite possible I don't understand the solution).
I have daily data CSV files and have created a year-long pandas dataframe with a datetime index. I'm trying to merge all of these CSVs onto the main DataFrame and populate the columns, but I end up with hundreds of columns with the _x _y appendix as they all have the same column names.
I want to populate all these columns in-place, I know there must be a logical way of doing so but I can't seem to find it.
Edit to add info:
The original dataframe has several columns, of which I use a subset.
Index SOC HiTemp LowTemp UploadTime Col_B Col_C Col_D Col_E
0 55 24 22 2019-01-01T00:02:00 z z z z
1
2
I create an empty dataframe with the datetimeindex I want then run a loop for all of the CSV files.
datindex = pd.DatetimeIndex(start="01/01/2019",periods = 525600, freq = 'T')
master_index = pd.DataFrame(index=datindex)
for fname in os.listdir('.'):
data = pd.read_csv(fname)
data["UploadTime"] = data["UploadTime"].str.replace('T','-').str[:-3]
data["UploadTime"] = pd.to_datetime(data["UploadTime"], format="%Y-%m-%d-
%H:%M")
data.drop_duplicates(subset="UploadTime", keep='first', inplace=True)
data.set_index("UploadTime", inplace=True)
selection = data[['Soc','EDischarge', 'EGridCharge',
'Echarge','Einput','Pbat','PrealL1','PrealL2','PrealL3']].copy(deep=True)
master_index = master_index.merge(selection, how= "left", left_index=True,right_index=True)
The initial merge creates the appropriate columns in master_index, but each subsequent merge creates a new set of columns: I want them to fill up the same columns, overwriting the NaN that the initial merge put there. In this way I should end up with as complete a dataset as possible (some days and timestamps are missing)
If you're talking about the headers as the 'appendix', you probably need to skip the first line before opening the CSVReader.
EDIT: This assumes all the columns in the csv's are sequenced the same, otherwise you'd have to map to a list after reading in the header

python pandas difference between df_train["x"] and df_train[["x"]]

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

pandas dataframe: Changing from single index to multi-column index

In python pandas I have a dataframe
df_aaa:
date data otherdata symbol
2015/1/1 11 12 aaa
2015/2/1 21 22 aaa
2015/3/1 31 31 aaa
df_all:
2015/1/1 31 31 bbb
Currently the index of both is by date .
I want to append df_aaa to df_all, and have them with a composite index of both symbol and date.
How do I do that?
Basically the following are all one question: How do I set a multi-index and use it when appending. Can I do it with different column order? Do I need to refresh? Etc.:
I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?
Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols
Do I have to "drop" the existing index before creating the new one?
Is there such a thing as a dataframe with data but no index?
Must a (single) index be of unique values?
When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?
And what is the default when I give these methods an array. Python docs are daunting, and I can't find my hands or legs in them. Giving some good examples would help...
Do I have to add anything (like axis=1) when setting the index?
How do I set the index to be the data in a column. (And why does sometimes using ['symbol', 'date'] as a parameter, give me a new column with those two values, instead of setting the index on the existing values of the columns with those two names?)
After I append and assuming the old index is correct do I need to 'update' the index (perhaps using reindex?) or since I told the dataframe that the index is in a certain column, is my data correctly indexed?
And since my dataframes (will) have indices on the same column name, can I do an append of df_aaa on df_all even if df_all was defined to have the columns originally in a different order. (say: ['symbol', 'date', 'data', 'otherdata'] with symbol the first column)?
You can just concatenate them and then set the index.
df_aaa = df_aaa.reset_index()
df_all = df_all.reset_index()
df = df_aaa.append(df_all).set_index(['symbol', 'date'])
Note that this would work only if your dataframes have the same column.s
If you must perform multiple appends in the future, the best thing to do would be to get one of them in the shape of the other, perform the concatenation, and reset index as needed.
I'll answer all your questions one by one.
I'm not sure if a multi-index is an index that has multiple 'columns'
(or rows), or is it the ability to have more than one index (and any
of them could be for multiple columns or rows). Or are both correct?
It depends on what axis you're referring to. Along the row (0th axis), you have 2 or more columns forming a MultiIndex. Similarly for along the columns (1st axis).
Must I first set the index of both dataframes to a multi-index, so the
append will work? (otherwise I'll have duplicates for different
symbols
No need. Although you could, not doing so would be simpler in this case.
Do I have to "drop" the existing index before creating the new one?
No, just that the columns must align (column name and number of columns should be the same).
Is there such a thing as a dataframe with data but no index?
No. All rows are indexed. Even if there is no column as the index, the index is a monotonically increasing number. The model followed here is similar to that in RDBMs.
Must a (single) index be of unique values?
In general, the must, so rows can be uniquely identified. If you have a MultiIndex, each combination of values that make up the index must be unique.
When do I use which of the following dataframe methods: set_index(),
reindex(), reset_index(), set_level, reset_level?
This is a broad question. It depends, when do you want to operate on the index and if so, what do you want to do with it? Look at the documentation for each one carefully.
Just append df's and reset_index() to be able to set_index() with keys argument. Here's oneliner:
df_all = df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
And here is full working sample.
In [1]: import pandas as pd
...: from io import StringIO
...:
In [2]: df_aaa = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 11 12 aaa
...: 2015/2/1 21 22 aaa
...: 2015/3/1 31 31 aaa
...: """), sep="\s+", index_col='date')
...:
In [3]: df_all = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 31 31 bbb"""), sep="\s+", index_col='date')
...:
In [4]: df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
Out[4]:
data otherdata
symbol date
bbb 2015/1/1 31 31
aaa 2015/1/1 11 12
2015/2/1 21 22
2015/3/1 31 31
Here is what I gather from the answers and dragging through the docs:
There is a "default index" which is a "row-number" for each row, and which is not part of any of the columns.
When merging with that index, there (seems to be) no need to re-index.
But if I want to change the index after it was made "non-standard" I have to "reset_index()" and turn it back to the default, and then from there I can create the new multi index (as explained in the revisioned answer below)
A multi-index is one that has more than one key (i.e. if indexing the rows, then more than one column will be used).
I'm still not sure if you have to re-index a column after a merge, but according to this it seems you get an automatically generated new "default index" and have to save the old one, remove the index before merge (reset_index) and set it again when done.
The other question about the index replacing a column - I'll check and get back here.
This is a follow-up.

Categories

Resources