How to avoid _x _y columns using pandas - python

I've checked other questions here but I don't think they've answered my issue (though it is quite possible I don't understand the solution).
I have daily data CSV files and have created a year-long pandas dataframe with a datetime index. I'm trying to merge all of these CSVs onto the main DataFrame and populate the columns, but I end up with hundreds of columns with the _x _y appendix as they all have the same column names.
I want to populate all these columns in-place, I know there must be a logical way of doing so but I can't seem to find it.
Edit to add info:
The original dataframe has several columns, of which I use a subset.
Index SOC HiTemp LowTemp UploadTime Col_B Col_C Col_D Col_E
0 55 24 22 2019-01-01T00:02:00 z z z z
1
2
I create an empty dataframe with the datetimeindex I want then run a loop for all of the CSV files.
datindex = pd.DatetimeIndex(start="01/01/2019",periods = 525600, freq = 'T')
master_index = pd.DataFrame(index=datindex)
for fname in os.listdir('.'):
data = pd.read_csv(fname)
data["UploadTime"] = data["UploadTime"].str.replace('T','-').str[:-3]
data["UploadTime"] = pd.to_datetime(data["UploadTime"], format="%Y-%m-%d-
%H:%M")
data.drop_duplicates(subset="UploadTime", keep='first', inplace=True)
data.set_index("UploadTime", inplace=True)
selection = data[['Soc','EDischarge', 'EGridCharge',
'Echarge','Einput','Pbat','PrealL1','PrealL2','PrealL3']].copy(deep=True)
master_index = master_index.merge(selection, how= "left", left_index=True,right_index=True)
The initial merge creates the appropriate columns in master_index, but each subsequent merge creates a new set of columns: I want them to fill up the same columns, overwriting the NaN that the initial merge put there. In this way I should end up with as complete a dataset as possible (some days and timestamps are missing)

If you're talking about the headers as the 'appendix', you probably need to skip the first line before opening the CSVReader.
EDIT: This assumes all the columns in the csv's are sequenced the same, otherwise you'd have to map to a list after reading in the header

Related

Using pandas I need to create a new column that takes a value from a previous row

I have many rows of data and one of the columns is a flag. I have 3 identifiers that need to match between rows.
What I have:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag
What I need:
partnumber, datetime1, previousdatetime1, datetime2, previousdatetime2, flag, previous_flag
I need to find flag from the row where partnumber matches, and where the previousdatetime1(current row*) == datetime1(other row)*, and the previousdatetime2(current row) == datetime2(other row).
*To note, the rows are not necessarily in order so the previous row may come later in the dataframe
I'm not quite sure where to start. I got this logic working in PBI using a LookUpValue and basically finding where partnumber = Value(partnumber), datetime1 = Value(datetime1), datetime2 = Value(datetime2). Thanks for the help!
Okay, so assuming you've read this in as a pandas dataframe df1:
(1) Make a copy of the dataframe:
df2=df1.copy()
(2) For sanity, drop some columns in df2
df2.drop(['previousdatetime1','previousdatetime2'],axis=1,inplace=True)
Now you have a df2 that has columns:
['partnumber','datetime1','datetime2','flag']
(3) Merge the two dataframes
newdf=df1.merge(df2,how='left',left_on=['partnumber','previousdatetime1'],right_on=['partnumber','datetime1'],suffixes=('','_previous'))
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','partnumber_previous','datetime1_previous','datetime2_previous','flag_previous']
(4) Drop the unnecessary columns
newdf.drop(['partnumber_previous', 'datetime1_previous', 'datetime2_previous'],axis=1,inplace=True)
Now you have a newdf that has columns:
['partnumber','datetime1','previousdatetime1','datetime2','previousdatetime2','flag','flag_previous']

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

Pandas data frame not allowing me to drop first empty column in python?

I have read in some data from a csv, and there were a load of spare columns and rows that were not needed. I've managed to get rid of most of them, but the first column is showing as an NaN and will not drop despite several attempts. This means I cannot promote the titles in row 0 to headers. I have tried the below:
df = pd.read_csv(("List of schools.csv"))
df = df.iloc[3:]
df.dropna(how='all', axis=1, inplace =True)
df.head()
But I am still getting this returned:
Any help please? I'm a newbie
You can improve your read_csv() operation.
Avloss can tell your "columns" are indices because they are bold. Looking at your output, there are two things of note.
The "columns" are bold implying that pandas read them in as part of the index of the DataFrame rather than as values
There is no information above the horizontal line at the top indicating there are currently no column names. The top row of the csv file that contains the column names is being read in as values.
To solve your column deletion problem, you should first improve your read_csv() operation by being more explicit. Your current code is placing column headers in the data and placing some of the data in the indicies. Since you have the operation df = df.iloc[3:] in your code, I'm assuming the data in your csv file doesn't start until the 4th row. Try this:
header_row = 3 #or 4 - I am bad at zero-indexing
df = pd.read_csv('List of schools.csv', header=header_row, index_col=False)
df.dropna(how='all', axis=1, inplace =True)
This code should read the column names in as column names and not index any of the columns, giving you a cleaner DataFrame to work from when dropping NA values.
those aren't columns, those are indices. You can convert them to columns by doing
df = df.reset_index()

inner join with huge dataframes (~2 million columns)

I am trying to join two data frames (df1 and df2) based on matching values from one column (called 'Names') that is found in each data frame. I have tried this using R's inner_join function as well as Python's pandas merge function, and have been able to get both to work successfully on smaller subsets of my data. I think my problem is with the size of my data frames.
My data frames are as follows:
df1 has the 'Names' column with 5 additional columns and has ~900 rows.
df2 has the 'Names' column with ~2million additional columns and has ~900 rows.
I have tried (in R):
df3 <- inner_join(x = df1, y = df2, by = 'Name')
I have also tried (in Python where df1 and df2 are Pandas data frames):
df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)
(where the 'Name' column is at index 1 of df1 and at index 0 of df2)
When I apply the above to my full data frames, it runs for a very long time and eventually crashes. Additionally, I suspect that the problem may be with the 2 million columns of my df2, so I tried sub-setting it (row-wise) into smaller data frames. My plan was to join the small subsets of df2 with df1 and then row bind the new data frames together at the end. However, joining even the smaller partitioned df2s was unsuccessful.
I would appreciate any suggestions anyone would be able to provide.
Thank you everyone for your help! Using data.table as #shadowtalker suggested, sped up the process tremendously. Just for reference in case anyone is trying to do something similar, df1 was approximately 400 mb and my df2 file was approximately 3gb.
I was able to accomplish the task as follows:
library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]
This is a really ugly workaround where I break up df2's columns and add them piece by piece. Not sure it will work, but it might be worth a try:
# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")
# Then I save all the column headers (excluding
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]
# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000
# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1
for i in range(num_loops):
# For each run of the loop, we determine which rows will get added
this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]
# You also need to add the "Name" column to make sure
# you get the observations in the right order
this_column_sublist = np.append("Name",this_column_sublist)
# Finally, merge with just the subset of df2
df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")
Like I said, it's an ugly workaround, but it just might work.

Pandas excel difference generator

I am trying to create a python program that gives me the differences between 2 big excel files with multiple sheets. I got it to print the results to an excel, but apparently when one of the cells contains datetime data then the operation of multiplying a boolean dataframe with the dataframe that contains dates doesn't work anymore. I get the following error :
TypeError: unsupported operand type(s) for *: 'bool' and 'datetime.datetime'
'EDIT' : I just realized this method doesn't work for strings neither (It only works for pure numerical data). What would be a better way to do this, that works on strings, numbers and time data?
#start of program
import pandas as pd
from pandas import ExcelWriter
import numpy as np
df1 = pd.read_excel('4_Input EfE_2030.xlsm',None)
df2 = pd.read_excel('5_Input EfE_2030.xlsm',None)
keys1=df1.keys()
keys2=df2.keys()
writer = ExcelWriter('test1.xlsx')
#loop for all sheets and create new dataframes with the differences
for x in keys1:
df3 = pd.read_excel('4_Input EfE_2030.xlsm',sheetname=x,header=None)
df4 = pd.read_excel('5_Input EfE_2030.xlsm',sheetname=x,header=None)
dif = df3 != df4
df=dif*df3
df2=dif*df4
nrcolumns=len(df.columns)
#when there are no differences in the entire sheet the dataframe will be empty. Add 1 to row indexes so the number coincides with excel rownumbers
if not df.empty:
# df.columns = ['A']
df.index = np.arange(1, len(df) + 1)
if not df2.empty:
# df2.columns = ['A']
df2.index = np.arange(1, len(df) + 1)
#delete rows with all 0
df = df.loc[~(df == 0).all(axis=1)]
df2 = df2.loc[~(df2 == 0).all(axis=1)]
#create new df with the data of the 2 sheets
result = pd.concat([df,df2],axis=1)
print(result)
result.to_excel(writer,sheet_name=x)
Updated answer
Approach
This is an interesting question. Another approach is to compare the column values in one Excel worksheet against the column values in another Excel worksheet by using the Panel data structure offered by Pandas. This data structure stores data as a 3-dimensional array. With the data from two Excel worksheets stored in a Panel we can then compare rows across worksheets that are uniquely identified by one or a combination of columns (e.g, a unique ID). to make this comparison by applying a custom function that compares the value in each cell of each column in one worksheet to the value in the same cell of the same column in the second worksheet. One benefit of this approach is that the datatype of each value no longer matters since we're just comparing values (e.g., 1 == 1, 'my name' == 'my name', etc.).
Assumptions
This approach makes several assumptions about your data:
The rows in each of the worksheets share one or a combination of columns that uniquely identify each row.
The columns of interest for comparison exist in both worksheets and share the same column headers.
(There may be other assumptions I'm failing to notice.)
Implementation
The implementation of this approach is a bit involved. Also, because I do not have access to your data, I cannot customize the implementation specifically to your data. With that said, I'll implement this approach using some dummy data shown below.
"Old" dataset:
id col_num col_str col_datetime
1 123 My string 1 2001-12-04
2 234 My string 2 2001-12-05
3 345 My string 3 2001-12-06
"New" dataset:
id col_num col_str col_datetime
1 123 My string 1 MODIFIED 2001-12-04
3 789 My string 3 2001-12-10
4 456 My string 4 2001-12-07
Notice the following differences about these two dataframes:
col_str in the row with id 1 is different
col_num in the row with id 3 is different
col_datetime in the row with id 3 is different
The row with id 2 exists in "old" but not "new"
The row with id 4 exists in "new" but not "old"
Okay, let's get started. First, we read in the datasets into separate dataframes:
df_old = pd.read_excel('old.xlsx', 'Sheet1', na_values=['NA'])
df_new = pd.read_excel('new.xlsx', 'Sheet1', na_values=['NA'])
Then we add a new version column to each dataframe to keep our thinking straight. We'll also use this column later to separate out rows from the "old" and "new" dataframes into their own separate dataframes:
df_old['VER'] = 'OLD'
df_new['VER'] = 'NEW'
Then we concatenate the "old" and "new" datasets into a single dataframe. Notice that the ignore_index parameter is set to True so that we ignore the index as it is not meaningful for this operation:
df_full = pd.concat([df_old, df_new], ignore_index=True)
Now we're going to identify all of the duplicate rows that exist across the two dataframes. These are rows where all of the column values are the same across the "old" and "new" dataframes. In other words, these are rows where no differences exist:
Once identified, we drop these duplicates rows. What we're left with are the rows that (a) are different between the two dataframes, (b) exist in the "old" dataframe but not the "new" dataframe, and (c) exist in the "new" dataframe but not the "old" dataframe:
df_diff = df_full.drop_duplicates(subset=['id', 'col_num', 'col_str', 'col_datetime'])
Next we identify and extract the values for id (i.e., the primary key across the "old" and "new" dataframes) for the rows that exist in both the "old" and "new" dataframes. It's important to note that these ids do not include rows that exist in one or the other dataframes but not both (i.e., rows removed or rows added):
diff_ids = df_diff.set_index('id').index.get_duplicates()
Now we restrict df_full to only those rows identified by the ids in diff_ids:
df_diff_ids = df_full[df_full['id'].isin(diff_ids)]
Now we move the duplicate rows from the "old" and "new" dataframes into separate dataframes that we can plug into a Panel data structure for comparison:
df_diff_old = df_diff_ids[df_diff_ids['VER'] == 'OLD']
df_diff_new = df_diff_ids[df_diff_ids['VER'] == 'NEW']
Next we set the index for both of these dataframes to the primary key (i.e., id). This is necessary for Panel to work effectively:
df_diff_old.set_index('id', inplace=True)
df_diff_new.set_index('id', inplace=True)
We slot both of these dataframes into a Panel data structure:
df_panel = pd.Panel(dict(df1=df_diff_old, df2=df_diff_new))
Finally we make our comparison using a custom function (find_diff) and the apply method:
def find_diff(x):
return x[0] if x[0] == x[1] else '{} -> {}'.format(*x)
df_diff = df_panel.apply(find_diff, axis=0)
If you print out the contents of df_diff you can easily note the which values changed between the "old" and "new" dataframes:
col_num col_str col_datetime
id
1 123 My string 1 -> My string 1 MODIFIED 2001-12-04 00:00:00
3 345 -> 789 My string 3 2001-12-06 00:00:00 -> 2001-12-10 00:00:00
Improvements
There are a few improvements I'll leave to you to make to this implementation.
Add a binary (1/0) flag that indicates if a one or more values in a
row changed
Identify which rows in the "old" dataframe were removed
(i.e., are not present in the "new" dataframe)
Identify which rows in the
"new" dataframe were added (i.e., not present in the "old" dataframe)
Original answer
Issue:
The issue is that you cannot perform arithmetic operations on datetimes.
However, you can perform arithmetic operations on timedeltas.
I can think of a few solutions that might help you:
Solution 1:
Convert your datetimes to strings.
If I'm understanding your problem correctly, you're comparing Excel worksheets for differences, correct? If this is the case, then I don't think it matters if the datetimes are represented as explicit datetimes (i.e., you're not performing any datetime calculations).
To implement this solution you would modify your pd.read_excel()' calls and explicitly set thedtypesparameter to convert yourdatetimes` to strings:
df1 = pd.read_excel('4_Input EfE_2030.xlsm', dtypes={'LABEL FOR DATETIME COL 1': str})
Solution 2:
Convert your datetimes to timedeltas.
For each datetime column, you can use: pd.Timedelta(df['LABEL FOR DATETIME COL'])
Overall, without seeing your data, I believe Solution 1 is the most straightforward.

Categories

Resources