Pandas recalculate index after a concatenation

Pandas recalculate index after a concatenation - python

I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.

If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.

After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2

This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.

Related

Take rows of ith duplicated index and put in i number of dataframes

This is a bit tricky to put into words, but I'll give it a try. I have a dataframe with duplicated indices as provided below.
a = [0.00000, 0.071928, 1.294, 2.592563, 0.000318, 2.575291, 0.439986, 2.232147, 6.091523, 2.075441, 0.96152]
b = [0.00000, 0.399791, 1.302446, 1.388957, 1.276451, 1.527568, 1.614107, 2.686325, 4.167600, 6.135689, 5.945807]
df = pd.DataFrame({'a' : a, 'b' : b})
df.index = [1,1,1,1,1,2,2,3,3,3,4]
I want the row of the first duplicated index for every number to be appended to df1, and the row of the second duplicated index to be appended to df2, etc; the first time indices 1, 2, 3, 4... n have a duplicate, those rows get appended to dataframe 1. The second time indices 1, 2, 3, 4...n have a duplicate, those rows get appended to dataframe 2, and so on. Ideally, it would look something like this if concatenated for the first three duplicates under the 'index' column:
Any idea how to go about this? I've tried to run df[df.duplicated(subset = ['index'])] in a for loop to widdle down the df to the very first duplicates, but it doesn't seem to work the way I think it will.

Slicing out the duplicate indices via cumcount and using concat to stitch together the resulting sub-dataframes will do the job.
cols = df.columns
df['id'] = df.index
pd.concat([df[df.groupby('id').cumcount()==i][cols] for i in range(0, max(df.groupby('id').cumcount().values))], axis=1)

Replace values in a slice of columns in a pandas dataframe with a value based on a condition

I have a large Pandas dataframe, and want to replace some values in a subset of the columns based on a condition.
Specifically, I want to replace the values that are greater than one with 1 in every column to the right of the 9th column.
Because the dataframe is so large and growing in both the number of rows and columns over time, I cannot manually specify the names of the columns to change values in. Rather, I just need to specify that column 10 and greater should be inspected for values > 1.
After looking at many different Stack Overflow posts and Pandas documentation, I tried:
df.iloc[df[:,10: ] > 1] = 1
However, this gives me the error “unhashable type: ‘slice’”.
I then tried:
df[df.iloc[:, 10:] > 1] = 1
and
df[df.loc[:, df.columns[10:]] > 1] = 1
as per 2 suggestions in the comments, but both of those give me the error “Cannot do inplace boolean setting on mixed-types with a non np.nan value”.
Does anyone know why I’m getting these errors and/or what I should change about my code to avoid them?
Thank you!

1. DataFrame.where
We can use iloc to select all the columns to the right of 9th column, then using where we can replace the values in the slice of dataframe where the condition x.le(1) is False.
df.iloc[:, 10:] = df.iloc[:, 10:].where(lambda x: x.le(1), 1)
2. DataFrame.clip
Alternatively we can use clip where we can define the upper limit as 1 which assigns all the values greater than 1 in the slice of dataframe to 1.
df.iloc[:, 10:] = df.iloc[:, 10:].clip(upper=1)

Pandas: best way to add a total row that calculates the sum of specific (multiple) columns while preserving the data type

I am trying to create a row at the bottom of a dataframe to show the sum of certain columns. I am under the impression that this shall be a really simple operation, but to my surprise, none of the methods I found on SO works for me in one step.
The methods that I've found on SO:
df.loc['TOTAL'] = df.sum()
This doesn't work for me as long as there are non-numeric columns in the dataframe. I need to select the columns first and then concat the non-numeric columns back
df.append(df.sum(numeric_only=True), ignore_index=True)
This won't preserve my data types. Integer column will be converted to float.
df3.loc['Total', 'ColumnA']= df['ColumnA'].sum()
I can only use this to sum one column.
I must have missed something in the process as this is not that hard an operation. Please let me know how I can add a sum row while preserving the data type of the dataframe.
Thanks.
Edit:
First off, sorry for the late update. I was on the road for the last weekend
Example:
df1 = pd.DataFrame(data = {'CountyID': [77, 95], 'Acronym': ['LC', 'NC'], 'Developable': [44490, 56261], 'Protected': [40355, 35943],
'Developed': [66806, 72211]}, index = ['Lehigh', 'Northampton'])
What I want to get would be
Please ignore the differences of the index.
It's a little tricky for me because I don't need to get the sum for the column 'County ID' since it's for specific indexing. So the question is more about getting the sum of specific numeric columns.
Thanks again.

Here is some toy data to use as an example:
df = pd.DataFrame({'A':[1.0,2.0,3.0],'B':[1,2,3],'C':['A','B','C']})
So that we can preserve the dtypes after the sum, we will store them as d
d = df.dtypes
Next, since we only want to sum numeric columns, pass numeric_only=True to sum(), but follow similar logic to your first attempt
df.loc['Total'] = df.sum(numeric_only=True)
And finally, reset the dtypes of your DataFrame to their original values.
df.astype(d)
A B C
0 1.0 1 A
1 2.0 2 B
2 3.0 3 C
Total 6.0 6 NaN

To select the numeric columns, you can do
df_numeric = df.select_dtypes(include = ['int64', 'float64'])
df_num_cols = df_numeric.columns
Then do what you did first (using what I found here)
df.loc['Total'] = pd.Series(df[df_num_cols].sum(), index = [df_num_cols])

Python: for cycle in range over files

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way

You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.

As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.

Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas recalculate index after a concatenation - python

If your index is autogenerated and you don't want to keep it, you can use the ignore_index option. ` train_df = pd.concat(train_class_df_list, ignore_index=True) This will autogenerate a new index for you, and my guess is that this is exactly what you are after.

This should work: train_df.reset_index(inplace=True, drop=True) Set drop to True to avoid an additional column in your dataframe.

Related

Take rows of ith duplicated index and put in i number of dataframes

Replace values in a slice of columns in a pandas dataframe with a value based on a condition

Pandas: best way to add a total row that calculates the sum of specific (multiple) columns while preserving the data type

Python: for cycle in range over files

pandas appending df1 to df2 get 0s/NaNs in result

Categories

Resources