pandas dataframe: Changing from single index to multi-column index - python

In python pandas I have a dataframe
df_aaa:
date data otherdata symbol
2015/1/1 11 12 aaa
2015/2/1 21 22 aaa
2015/3/1 31 31 aaa
df_all:
2015/1/1 31 31 bbb
Currently the index of both is by date .
I want to append df_aaa to df_all, and have them with a composite index of both symbol and date.
How do I do that?
Basically the following are all one question: How do I set a multi-index and use it when appending. Can I do it with different column order? Do I need to refresh? Etc.:
I'm not sure if a multi-index is an index that has multiple 'columns' (or rows), or is it the ability to have more than one index (and any of them could be for multiple columns or rows). Or are both correct?
Must I first set the index of both dataframes to a multi-index, so the append will work? (otherwise I'll have duplicates for different symbols
Do I have to "drop" the existing index before creating the new one?
Is there such a thing as a dataframe with data but no index?
Must a (single) index be of unique values?
When do I use which of the following dataframe methods: set_index(), reindex(), reset_index(), set_level, reset_level?
And what is the default when I give these methods an array. Python docs are daunting, and I can't find my hands or legs in them. Giving some good examples would help...
Do I have to add anything (like axis=1) when setting the index?
How do I set the index to be the data in a column. (And why does sometimes using ['symbol', 'date'] as a parameter, give me a new column with those two values, instead of setting the index on the existing values of the columns with those two names?)
After I append and assuming the old index is correct do I need to 'update' the index (perhaps using reindex?) or since I told the dataframe that the index is in a certain column, is my data correctly indexed?
And since my dataframes (will) have indices on the same column name, can I do an append of df_aaa on df_all even if df_all was defined to have the columns originally in a different order. (say: ['symbol', 'date', 'data', 'otherdata'] with symbol the first column)?

You can just concatenate them and then set the index.
df_aaa = df_aaa.reset_index()
df_all = df_all.reset_index()
df = df_aaa.append(df_all).set_index(['symbol', 'date'])
Note that this would work only if your dataframes have the same column.s
If you must perform multiple appends in the future, the best thing to do would be to get one of them in the shape of the other, perform the concatenation, and reset index as needed.
I'll answer all your questions one by one.
I'm not sure if a multi-index is an index that has multiple 'columns'
(or rows), or is it the ability to have more than one index (and any
of them could be for multiple columns or rows). Or are both correct?
It depends on what axis you're referring to. Along the row (0th axis), you have 2 or more columns forming a MultiIndex. Similarly for along the columns (1st axis).
Must I first set the index of both dataframes to a multi-index, so the
append will work? (otherwise I'll have duplicates for different
symbols
No need. Although you could, not doing so would be simpler in this case.
Do I have to "drop" the existing index before creating the new one?
No, just that the columns must align (column name and number of columns should be the same).
Is there such a thing as a dataframe with data but no index?
No. All rows are indexed. Even if there is no column as the index, the index is a monotonically increasing number. The model followed here is similar to that in RDBMs.
Must a (single) index be of unique values?
In general, the must, so rows can be uniquely identified. If you have a MultiIndex, each combination of values that make up the index must be unique.
When do I use which of the following dataframe methods: set_index(),
reindex(), reset_index(), set_level, reset_level?
This is a broad question. It depends, when do you want to operate on the index and if so, what do you want to do with it? Look at the documentation for each one carefully.

Just append df's and reset_index() to be able to set_index() with keys argument. Here's oneliner:
df_all = df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
And here is full working sample.
In [1]: import pandas as pd
...: from io import StringIO
...:
In [2]: df_aaa = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 11 12 aaa
...: 2015/2/1 21 22 aaa
...: 2015/3/1 31 31 aaa
...: """), sep="\s+", index_col='date')
...:
In [3]: df_all = pd.read_csv(StringIO("""date data otherdata symbol
...: 2015/1/1 31 31 bbb"""), sep="\s+", index_col='date')
...:
In [4]: df_all.append(df_aaa).reset_index().set_index(keys=['symbol', 'date'])
Out[4]:
data otherdata
symbol date
bbb 2015/1/1 31 31
aaa 2015/1/1 11 12
2015/2/1 21 22
2015/3/1 31 31

Here is what I gather from the answers and dragging through the docs:
There is a "default index" which is a "row-number" for each row, and which is not part of any of the columns.
When merging with that index, there (seems to be) no need to re-index.
But if I want to change the index after it was made "non-standard" I have to "reset_index()" and turn it back to the default, and then from there I can create the new multi index (as explained in the revisioned answer below)
A multi-index is one that has more than one key (i.e. if indexing the rows, then more than one column will be used).
I'm still not sure if you have to re-index a column after a merge, but according to this it seems you get an automatically generated new "default index" and have to save the old one, remove the index before merge (reset_index) and set it again when done.
The other question about the index replacing a column - I'll check and get back here.
This is a follow-up.

Related

Pandas: How to Squash Multiple Rows into One Row with More Columns

I'm looking for a way to convert 5 rows in a pandas dataframe into one row with 5 times the amount of columns (so I have the same information, just squashed into one row). Let me explain:
I'm working with hockey game statistics. Currently, there are 5 rows representing the same game in different situations, each with 111 columns. I want to convert these 5 rows into one row (so that one game is represented by one row) but keep the information contained in the different situations. In other words, I want to convert 5 rows, each with 111 columns into one row with 554 columns (554=111*5 minus one since we're joining on gameId).
Here is my DF head:
So, as an example, we can see the first 5 rows have gameId = 2008020001, but each have a different situation (i.e. other, all, 5on5, 4on5, and 5on4). I'd like these 5 rows to be converted into one row with gameId = 2008020001, and with columns labelled according to their situation.
For example, I want columns for all unblockedShotAttemptsAgainst, 5on5 unblockedShotAttemptsAgainst, 5on4 unblockedShotAttemptsAgainst, 4on5 unblockedShotAttemptsAgainst, and other unblockedShotAttemptsAgainst (and the same for every other stat).
Any info would be greatly appreciated. It's also worth mentioning that my dataset is fairly large (177990 rows), so an efficient solution is desired. The resulting dataframe should have one-fifth the rows and 5 times the columns. Thanks in advance!
---- What I've Tried Already ----
I tried to do this using df.apply() and some nested for loops, but it got very ugly very quickly and was incredibly slow. I think pandas has a better way of doing this, but I'm not sure how.
Looking at other SO answers, I initially thought it might have something to do with df.pivot() or df.groupby(), but I couldn't figure it out. Thanks again!
It sounds like what you are looking for is pd.get_dummies()
cols = df.columns
#get dummies
df1 = pd.get_dummies(df, columns = ['situation'])
#drop all columns from existing df, including original col passed into get dummies
df1.drop(cols, axis=1 , inplace=True)
#add dummy cols to original df
df = pd.concat([df, df1], axis=1)
#drop duplicate rows
df.groupby(cols).first()
For the last line you can also use df.drop_duplicates() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

How to get related value between two dataframes

If I have two dataframes. Dataframe A have an a_id column, dataframe B have an b_id column and a b_value column. How can I join A and B on a_id = b_id and get C with id and max(b_value)?
enter image description here
you can use the concat function in pandas to append either columns or rows from one DataFrame to another.
heres an example:
# Read in first 10 lines of surveys table
survey_sub = surveys_df.head(10)
# Grab the last 10 rows
survey_sub_last10 = surveys_df.tail(10)
# Reset the index values to the second dataframe appends properly
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values
When i concatenate DataFrames, i need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one. It will automatically detect whether the column names are the same and will stack accordingly. axis=1 will stack the columns in the second DataFrame to the RIGHT of the first DataFrame. To stack the data vertically, i need to make sure we have the same columns and associated column format in both datasets. When i stack horizontally, i want to make sure what i am doing makes sense (i.e. the data are related in some way).

How to append rows to a Pandas dataframe, and have it turn multiple overlapping cells (with the same index) into a single value, instead of a series?

I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.

How do I dynamically update a column in pandas with the value of the column to its left?

I have a dataframe with a series of columns that contain boolean values, one column for each month of the year. Here's a snippet of the df:
df
I'm trying to update the 2019.04_flag, 2019.05_flag, etc columns with the last valid value. I know that I can use df[2019.04_flag].fillna(2019.03_flag), but I don't want to write 11 fillna lines. Is there a means of updating the value dynamically? I've tried to use the fillna method with the ffill parameter here df with ffill, but as you can see it doesn't propagate across the row.
Edited
I would look into the pandas fillna method, documentation is here. It has different methods for filling NaN -- I think "ffill" would suit your needs. It fills the NaN with the last valid entry. Try the following:
df = df.fillna(method = "ffill", axis = 1)
Setting axis = 1 will perform the imputation across the columns, the axis I believe you want (a single row across columns).

Pandas excel difference generator

I am trying to create a python program that gives me the differences between 2 big excel files with multiple sheets. I got it to print the results to an excel, but apparently when one of the cells contains datetime data then the operation of multiplying a boolean dataframe with the dataframe that contains dates doesn't work anymore. I get the following error :
TypeError: unsupported operand type(s) for *: 'bool' and 'datetime.datetime'
'EDIT' : I just realized this method doesn't work for strings neither (It only works for pure numerical data). What would be a better way to do this, that works on strings, numbers and time data?
#start of program
import pandas as pd
from pandas import ExcelWriter
import numpy as np
df1 = pd.read_excel('4_Input EfE_2030.xlsm',None)
df2 = pd.read_excel('5_Input EfE_2030.xlsm',None)
keys1=df1.keys()
keys2=df2.keys()
writer = ExcelWriter('test1.xlsx')
#loop for all sheets and create new dataframes with the differences
for x in keys1:
df3 = pd.read_excel('4_Input EfE_2030.xlsm',sheetname=x,header=None)
df4 = pd.read_excel('5_Input EfE_2030.xlsm',sheetname=x,header=None)
dif = df3 != df4
df=dif*df3
df2=dif*df4
nrcolumns=len(df.columns)
#when there are no differences in the entire sheet the dataframe will be empty. Add 1 to row indexes so the number coincides with excel rownumbers
if not df.empty:
# df.columns = ['A']
df.index = np.arange(1, len(df) + 1)
if not df2.empty:
# df2.columns = ['A']
df2.index = np.arange(1, len(df) + 1)
#delete rows with all 0
df = df.loc[~(df == 0).all(axis=1)]
df2 = df2.loc[~(df2 == 0).all(axis=1)]
#create new df with the data of the 2 sheets
result = pd.concat([df,df2],axis=1)
print(result)
result.to_excel(writer,sheet_name=x)
Updated answer
Approach
This is an interesting question. Another approach is to compare the column values in one Excel worksheet against the column values in another Excel worksheet by using the Panel data structure offered by Pandas. This data structure stores data as a 3-dimensional array. With the data from two Excel worksheets stored in a Panel we can then compare rows across worksheets that are uniquely identified by one or a combination of columns (e.g, a unique ID). to make this comparison by applying a custom function that compares the value in each cell of each column in one worksheet to the value in the same cell of the same column in the second worksheet. One benefit of this approach is that the datatype of each value no longer matters since we're just comparing values (e.g., 1 == 1, 'my name' == 'my name', etc.).
Assumptions
This approach makes several assumptions about your data:
The rows in each of the worksheets share one or a combination of columns that uniquely identify each row.
The columns of interest for comparison exist in both worksheets and share the same column headers.
(There may be other assumptions I'm failing to notice.)
Implementation
The implementation of this approach is a bit involved. Also, because I do not have access to your data, I cannot customize the implementation specifically to your data. With that said, I'll implement this approach using some dummy data shown below.
"Old" dataset:
id col_num col_str col_datetime
1 123 My string 1 2001-12-04
2 234 My string 2 2001-12-05
3 345 My string 3 2001-12-06
"New" dataset:
id col_num col_str col_datetime
1 123 My string 1 MODIFIED 2001-12-04
3 789 My string 3 2001-12-10
4 456 My string 4 2001-12-07
Notice the following differences about these two dataframes:
col_str in the row with id 1 is different
col_num in the row with id 3 is different
col_datetime in the row with id 3 is different
The row with id 2 exists in "old" but not "new"
The row with id 4 exists in "new" but not "old"
Okay, let's get started. First, we read in the datasets into separate dataframes:
df_old = pd.read_excel('old.xlsx', 'Sheet1', na_values=['NA'])
df_new = pd.read_excel('new.xlsx', 'Sheet1', na_values=['NA'])
Then we add a new version column to each dataframe to keep our thinking straight. We'll also use this column later to separate out rows from the "old" and "new" dataframes into their own separate dataframes:
df_old['VER'] = 'OLD'
df_new['VER'] = 'NEW'
Then we concatenate the "old" and "new" datasets into a single dataframe. Notice that the ignore_index parameter is set to True so that we ignore the index as it is not meaningful for this operation:
df_full = pd.concat([df_old, df_new], ignore_index=True)
Now we're going to identify all of the duplicate rows that exist across the two dataframes. These are rows where all of the column values are the same across the "old" and "new" dataframes. In other words, these are rows where no differences exist:
Once identified, we drop these duplicates rows. What we're left with are the rows that (a) are different between the two dataframes, (b) exist in the "old" dataframe but not the "new" dataframe, and (c) exist in the "new" dataframe but not the "old" dataframe:
df_diff = df_full.drop_duplicates(subset=['id', 'col_num', 'col_str', 'col_datetime'])
Next we identify and extract the values for id (i.e., the primary key across the "old" and "new" dataframes) for the rows that exist in both the "old" and "new" dataframes. It's important to note that these ids do not include rows that exist in one or the other dataframes but not both (i.e., rows removed or rows added):
diff_ids = df_diff.set_index('id').index.get_duplicates()
Now we restrict df_full to only those rows identified by the ids in diff_ids:
df_diff_ids = df_full[df_full['id'].isin(diff_ids)]
Now we move the duplicate rows from the "old" and "new" dataframes into separate dataframes that we can plug into a Panel data structure for comparison:
df_diff_old = df_diff_ids[df_diff_ids['VER'] == 'OLD']
df_diff_new = df_diff_ids[df_diff_ids['VER'] == 'NEW']
Next we set the index for both of these dataframes to the primary key (i.e., id). This is necessary for Panel to work effectively:
df_diff_old.set_index('id', inplace=True)
df_diff_new.set_index('id', inplace=True)
We slot both of these dataframes into a Panel data structure:
df_panel = pd.Panel(dict(df1=df_diff_old, df2=df_diff_new))
Finally we make our comparison using a custom function (find_diff) and the apply method:
def find_diff(x):
return x[0] if x[0] == x[1] else '{} -> {}'.format(*x)
df_diff = df_panel.apply(find_diff, axis=0)
If you print out the contents of df_diff you can easily note the which values changed between the "old" and "new" dataframes:
col_num col_str col_datetime
id
1 123 My string 1 -> My string 1 MODIFIED 2001-12-04 00:00:00
3 345 -> 789 My string 3 2001-12-06 00:00:00 -> 2001-12-10 00:00:00
Improvements
There are a few improvements I'll leave to you to make to this implementation.
Add a binary (1/0) flag that indicates if a one or more values in a
row changed
Identify which rows in the "old" dataframe were removed
(i.e., are not present in the "new" dataframe)
Identify which rows in the
"new" dataframe were added (i.e., not present in the "old" dataframe)
Original answer
Issue:
The issue is that you cannot perform arithmetic operations on datetimes.
However, you can perform arithmetic operations on timedeltas.
I can think of a few solutions that might help you:
Solution 1:
Convert your datetimes to strings.
If I'm understanding your problem correctly, you're comparing Excel worksheets for differences, correct? If this is the case, then I don't think it matters if the datetimes are represented as explicit datetimes (i.e., you're not performing any datetime calculations).
To implement this solution you would modify your pd.read_excel()' calls and explicitly set thedtypesparameter to convert yourdatetimes` to strings:
df1 = pd.read_excel('4_Input EfE_2030.xlsm', dtypes={'LABEL FOR DATETIME COL 1': str})
Solution 2:
Convert your datetimes to timedeltas.
For each datetime column, you can use: pd.Timedelta(df['LABEL FOR DATETIME COL'])
Overall, without seeing your data, I believe Solution 1 is the most straightforward.

Categories

Resources