I have three different dataframes which all contain a column with certain IDs.
DF_1
DF_2
DF_3
What I am trying to achieve is to create an Excel sheet with the ID as its name with the dataframes as the sheets 'DF_1, DF_2, DF_3' per unique value. So '1.xlsx' should contain three sheets (the dataframes) with only the records that are of associated with that ID. The thing I get stuck at is either getting the multiple sheets or only the corresponding values per unique value.
for name, r in df_1.groupby("ID"):
r.groupby("ID").to_excel(f'{name}.xlsx', index=False)
This piece of code gives me the correct output, but only for df_1. I get 5 Excel files with the corresponding rows per ID, but only one sheet, namely for df_1. I can't figure out how to include df_2 and df_3 per ID. When I try to use the following piece of code with nested loops, I get all the rows instead of per unique value:
writer = pd.ExcelWriter(f'{name}.xlsx')
r.to_excel(writer, sheet_name=f'{name}_df1')
r.to_excel(writer, sheet_name=f'{name}_df2')
r.to_excel(writer, sheet_name=f'{name}_df3')
writer.save()
There is more data transformation going on before this part, and the final dataframes are the once that are needed eventually. Frankly, I have no idea how to fix this or how to achieve this. Hopefully, someone has some insightful comments.
Can you try the following:
unique_ids = df_1['ID'].unique()
for name in unique_ids:
writer = pd.ExcelWriter(f'{name}.xlsx')
r1 = df_1[df_1['ID'].eq(name)]
r1.to_excel(writer, sheet_name=f'{name}_df1')
r2 = df_2[df_2['ID'].eq(name)]
r2.to_excel(writer, sheet_name=f'{name}_df2')
r3 = df_3[df_3['ID'].eq(name)]
r.to_excel(writer, sheet_name=f'{name}_df3')
writer.save()
Related
Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')
I've been struggling with this assignment quite long now and I cant seem to get the correct solution. My problem is as follows: I have imported two worksheets (A & B) from an Excelfile and assigned these to two Dataframe variables. Worksheets A & B have rows with similar names however the cells in Worksheet B are empty whereas the cells in Worksheet A contain numbers. My assignment is to copy the rows from worksheet A to worksheet B. This needs to be done in a specific order. For example: in worksheet A the row sales revenue is at index 1, but in worksheet B the row sales revenue is at index 5. Does anybody know how to do this? Im an absolute beginner with python/pandas. I have included a printscreen of the situation
you need to merge the two tables using left join
#remove all Nan columns
df_data3 = df_data3[['Unnamed:0']]
df_data3 = pd.merge(df_data3,df_data2,on='Unnamed:0',how='left')
see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more option play with it a bit to better understand how this work
it is similar to SQL join
I have a CSV file that contains multiple tables. Each table has a title, and variable number of rows and columns (these numbers may vary between files). The titles, as well as names of rows and columns may also change between different files I will need to parse in the future, so I cannot hardcode them. some columns may contain empty cells as well.
Here is a screenshot of an example CSV file with this structure:
I need to find a solution that will parse all the tables from the CSV into Pandas DFs. Ideally the final output would be an Excel file, where each table is saved as a sheet, and the name of each sheet will be the corresponding table title.
I tried the suggested solution in this post but it kept failing in identifying the start/end of the tables. When I used a simpler version of the input csv file, the suggested code only returned one table.
I would appreciate any assistance!!
You could try this:
df = pd.read_csv("file.csv")
dfs = []
start = 0
for i, row in df.iterrows():
if all(row.isna()): # Empty row
# Remove empty columns
temp_df = df.loc[start:i, :].dropna(how="all", axis=1)
if start: # Grab header, except for first df
new_header = temp_df.iloc[0]
temp_df = temp_df[1:]
temp_df.columns = new_header
temp_df = temp_df.dropna(how="all", axis=0)
dfs.append(temp_df)
start = i + 1
Then, you can reach each df by calling dfs[0], dfs[1], ...
I have one excel file with several identical structured sheets on it (same headers and number of columns) (sheetsname: 01,02,...,12).
How can I get this into one dataframe?
Right now I would load it all seperate with:
df1 = pd.read_excel('path.xls', sheet_name='01')
df2 = pd.read_excel('path.xls', sheet_name='02')
...
and would then concentate it.
What is the most pythonic way to do it and get directly one dataframe with all the sheets? Also assumping I do not know every sheetname in advance.
read the file as:
collection = pd.read_excel('path.xls', sheet_name=None)
combined = pd.concat([value.assign(sheet_source=key)
for key,value in collection.items()],
ignore_index=True)
sheet_name = None ensures all the sheets are read in.
collection is a dictionary, with the sheet_name as key, and the actual data as the values. combined uses the pandas concat method to get you one dataframe. I added the extra column sheet_source, in case you need to track where the data for each row comes from.
You can read more about it on the pandas doco
you can use:
df_final = pd.concat([pd.read_excel('path.xls', sheet_name="{:02d}".format(sheet)) for sheet in range(12)], axis=0)
I am trying to create a python program that gives me the differences between 2 big excel files with multiple sheets. I got it to print the results to an excel, but apparently when one of the cells contains datetime data then the operation of multiplying a boolean dataframe with the dataframe that contains dates doesn't work anymore. I get the following error :
TypeError: unsupported operand type(s) for *: 'bool' and 'datetime.datetime'
'EDIT' : I just realized this method doesn't work for strings neither (It only works for pure numerical data). What would be a better way to do this, that works on strings, numbers and time data?
#start of program
import pandas as pd
from pandas import ExcelWriter
import numpy as np
df1 = pd.read_excel('4_Input EfE_2030.xlsm',None)
df2 = pd.read_excel('5_Input EfE_2030.xlsm',None)
keys1=df1.keys()
keys2=df2.keys()
writer = ExcelWriter('test1.xlsx')
#loop for all sheets and create new dataframes with the differences
for x in keys1:
df3 = pd.read_excel('4_Input EfE_2030.xlsm',sheetname=x,header=None)
df4 = pd.read_excel('5_Input EfE_2030.xlsm',sheetname=x,header=None)
dif = df3 != df4
df=dif*df3
df2=dif*df4
nrcolumns=len(df.columns)
#when there are no differences in the entire sheet the dataframe will be empty. Add 1 to row indexes so the number coincides with excel rownumbers
if not df.empty:
# df.columns = ['A']
df.index = np.arange(1, len(df) + 1)
if not df2.empty:
# df2.columns = ['A']
df2.index = np.arange(1, len(df) + 1)
#delete rows with all 0
df = df.loc[~(df == 0).all(axis=1)]
df2 = df2.loc[~(df2 == 0).all(axis=1)]
#create new df with the data of the 2 sheets
result = pd.concat([df,df2],axis=1)
print(result)
result.to_excel(writer,sheet_name=x)
Updated answer
Approach
This is an interesting question. Another approach is to compare the column values in one Excel worksheet against the column values in another Excel worksheet by using the Panel data structure offered by Pandas. This data structure stores data as a 3-dimensional array. With the data from two Excel worksheets stored in a Panel we can then compare rows across worksheets that are uniquely identified by one or a combination of columns (e.g, a unique ID). to make this comparison by applying a custom function that compares the value in each cell of each column in one worksheet to the value in the same cell of the same column in the second worksheet. One benefit of this approach is that the datatype of each value no longer matters since we're just comparing values (e.g., 1 == 1, 'my name' == 'my name', etc.).
Assumptions
This approach makes several assumptions about your data:
The rows in each of the worksheets share one or a combination of columns that uniquely identify each row.
The columns of interest for comparison exist in both worksheets and share the same column headers.
(There may be other assumptions I'm failing to notice.)
Implementation
The implementation of this approach is a bit involved. Also, because I do not have access to your data, I cannot customize the implementation specifically to your data. With that said, I'll implement this approach using some dummy data shown below.
"Old" dataset:
id col_num col_str col_datetime
1 123 My string 1 2001-12-04
2 234 My string 2 2001-12-05
3 345 My string 3 2001-12-06
"New" dataset:
id col_num col_str col_datetime
1 123 My string 1 MODIFIED 2001-12-04
3 789 My string 3 2001-12-10
4 456 My string 4 2001-12-07
Notice the following differences about these two dataframes:
col_str in the row with id 1 is different
col_num in the row with id 3 is different
col_datetime in the row with id 3 is different
The row with id 2 exists in "old" but not "new"
The row with id 4 exists in "new" but not "old"
Okay, let's get started. First, we read in the datasets into separate dataframes:
df_old = pd.read_excel('old.xlsx', 'Sheet1', na_values=['NA'])
df_new = pd.read_excel('new.xlsx', 'Sheet1', na_values=['NA'])
Then we add a new version column to each dataframe to keep our thinking straight. We'll also use this column later to separate out rows from the "old" and "new" dataframes into their own separate dataframes:
df_old['VER'] = 'OLD'
df_new['VER'] = 'NEW'
Then we concatenate the "old" and "new" datasets into a single dataframe. Notice that the ignore_index parameter is set to True so that we ignore the index as it is not meaningful for this operation:
df_full = pd.concat([df_old, df_new], ignore_index=True)
Now we're going to identify all of the duplicate rows that exist across the two dataframes. These are rows where all of the column values are the same across the "old" and "new" dataframes. In other words, these are rows where no differences exist:
Once identified, we drop these duplicates rows. What we're left with are the rows that (a) are different between the two dataframes, (b) exist in the "old" dataframe but not the "new" dataframe, and (c) exist in the "new" dataframe but not the "old" dataframe:
df_diff = df_full.drop_duplicates(subset=['id', 'col_num', 'col_str', 'col_datetime'])
Next we identify and extract the values for id (i.e., the primary key across the "old" and "new" dataframes) for the rows that exist in both the "old" and "new" dataframes. It's important to note that these ids do not include rows that exist in one or the other dataframes but not both (i.e., rows removed or rows added):
diff_ids = df_diff.set_index('id').index.get_duplicates()
Now we restrict df_full to only those rows identified by the ids in diff_ids:
df_diff_ids = df_full[df_full['id'].isin(diff_ids)]
Now we move the duplicate rows from the "old" and "new" dataframes into separate dataframes that we can plug into a Panel data structure for comparison:
df_diff_old = df_diff_ids[df_diff_ids['VER'] == 'OLD']
df_diff_new = df_diff_ids[df_diff_ids['VER'] == 'NEW']
Next we set the index for both of these dataframes to the primary key (i.e., id). This is necessary for Panel to work effectively:
df_diff_old.set_index('id', inplace=True)
df_diff_new.set_index('id', inplace=True)
We slot both of these dataframes into a Panel data structure:
df_panel = pd.Panel(dict(df1=df_diff_old, df2=df_diff_new))
Finally we make our comparison using a custom function (find_diff) and the apply method:
def find_diff(x):
return x[0] if x[0] == x[1] else '{} -> {}'.format(*x)
df_diff = df_panel.apply(find_diff, axis=0)
If you print out the contents of df_diff you can easily note the which values changed between the "old" and "new" dataframes:
col_num col_str col_datetime
id
1 123 My string 1 -> My string 1 MODIFIED 2001-12-04 00:00:00
3 345 -> 789 My string 3 2001-12-06 00:00:00 -> 2001-12-10 00:00:00
Improvements
There are a few improvements I'll leave to you to make to this implementation.
Add a binary (1/0) flag that indicates if a one or more values in a
row changed
Identify which rows in the "old" dataframe were removed
(i.e., are not present in the "new" dataframe)
Identify which rows in the
"new" dataframe were added (i.e., not present in the "old" dataframe)
Original answer
Issue:
The issue is that you cannot perform arithmetic operations on datetimes.
However, you can perform arithmetic operations on timedeltas.
I can think of a few solutions that might help you:
Solution 1:
Convert your datetimes to strings.
If I'm understanding your problem correctly, you're comparing Excel worksheets for differences, correct? If this is the case, then I don't think it matters if the datetimes are represented as explicit datetimes (i.e., you're not performing any datetime calculations).
To implement this solution you would modify your pd.read_excel()' calls and explicitly set thedtypesparameter to convert yourdatetimes` to strings:
df1 = pd.read_excel('4_Input EfE_2030.xlsm', dtypes={'LABEL FOR DATETIME COL 1': str})
Solution 2:
Convert your datetimes to timedeltas.
For each datetime column, you can use: pd.Timedelta(df['LABEL FOR DATETIME COL'])
Overall, without seeing your data, I believe Solution 1 is the most straightforward.