Append Data to Pandas Dataframe - python

I have the following pandas dataframe:
df = pd.DataFrame({'A': [1, 2, 3, 7], 'B': [4, 5, 6, 29]})
I'm working on a for loop that grabs an index and then appends data to the end of that row.
How do I append columns C, D, E for a given index of the table? Let's say on iteration one, the index is 2:
A B C D E
0 1 4 0 0 0
1 2 5 0 0 0
2 3 6 34 12 23
3 7 29 0 0 0
On the next iteration of the for loop, the index might be 1. Then the dataframe would be:
A B C D E
0 1 4 0 0 0
1 2 5 8 11 4
2 3 6 34 12 23
3 7 29 0 0 0
How do I do this?

You can target specific rows by using loc and providing the index.
For example:
df.loc[5:'D']=10
This will add the value 10, to the column D of row index 5.
Your question states that you want to add new columns depending on the row index. This doesn't make sense, because a dataframe is not like a NoSQL document where you can just add columns independent of other rows.
What you should do is have all your columns already added to your dataframe, then add values as you go.
To add multiple values:
df.loc[5, ['D', 'B']] = 10

Related

How to get column values corresponding to other columns minima in MultiIndex [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Get the index of n maximum values in a column in dataframe

I have a data frame and I want to get the index and value of the 4 maximum values in each rows. For example, in the following df, in column a, 10, 6, 7, 8 are four maximum values.
import pandas as pd
df = pd.DataFrame()
df['a'] = [10, 2, 3, -1,4,5,6,7,8]
df['id'] = [100, 2, 3, -1,4,5,0,1,2]
df
The output which I want is:
Try nlargest,
df.nlargest(4, 'a').reset_index()
Output:
index a id
0 0 10 100
1 8 8 2
2 7 7 1
3 6 6 0
You can sort the a column
out = (df.sort_values('a', ascending=False).iloc[:4]
.sort_index(ascending=True)
.reset_index())
print(out)
index a id
0 0 10 100
1 6 6 0
2 7 7 1
3 8 8 2

Pandas GroupBy and select rows with the minimum value in a specific column

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

How to sort a dataframe based on idxmax?

I have a dataframe like this:
A B C
0 1 2 1
1 3 -8 10
2 10 3 -20
3 50 7 1
I would like to rearrange its columns based on the index of the maximal absolute value in each column. In column A, the maximal absolute value is in row 3, in B it is row 1 and in C it is row 2 which means that my new dataframe should be in the order B C A.
Currently I do this as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 3, 10, 50], 'B': [2, -8, 3, 7], 'C': [1, 10, -20, 1]})
indMax = abs(df).idxmax(axis=0)
df = df[np.argsort(indMax)]
So I first determine the indices of the maximal value per column which are stored in indMax, then I sort it and rearrange the dataframe accordingly which gives me the desired output:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
My question is whether there is the possibility to pass the function idxmax directly to a sort function and change the dataframe inplace.
IIUC the following does what you want:
In [69]
df.ix[:,df.abs().idxmax().sort_values().index]
Out[69]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
Here we determine the idxmax in the abs values, sort the values and pass the index to index the df.
As to sorting in place you can just assign back to the df.
For a pre 0.17.0 version the following works:
In [75]:
df.ix[:,df.abs().idxmax().sort(inplace=False).index]
Out[75]:
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50
This is ugly, but it seems to work using reindex_axis:
import numpy as np
>>> df.reindex_axis(df.columns[list(np.argsort(abs(df).idxmax(axis=0)))], axis=1)
B C A
0 2 1 1
1 -8 10 3
2 3 -20 10
3 7 1 50

Efficiently integrate a series into a pandas dataframe

I have a pandas dataframe with index [0, 1, 2...], and a list something like this: [1, 2, 2, 0, 1...].
I'd like to add a 'count' column to the dataframe, that reflects the number of times the digit in the index is referenced in the list.
Given the example lists above, the 'count' column would have the value 2 at index 2, because 2 occurred twice (so far). Is there a more efficient way to do this than iterating over the list?
Well here is a way of doing it, first load the list into a df, then add the 'occurrence' column using value_counts and then merge this to your orig df:
In [61]:
df = pd.DataFrame({'a':np.arange(10)})
l=[1,2,2,0,1]
df1 = pd.DataFrame(l, columns=['data'])
df1['occurence'] = df1['data'].map(df1['data'].value_counts())
df1
Out[61]:
data occurence
0 1 2
1 2 2
2 2 2
3 0 1
4 1 2
In [65]:
df.merge(s, left_index=True, right_on='data',how='left').fillna(0).drop_duplicates().reset_index(drop=True)
Out[65]:
a data count
0 0 0 1
1 1 1 2
2 2 2 2
3 3 3 0
4 4 4 0
5 5 5 0
6 6 6 0
7 7 7 0
8 8 8 0
9 9 9 0
Counting occurences of numbers in a dataframe is easy in pandas
You just use the Series.value_counts method.
Then you join the grouped dataframe with the original one using the pandas.merge function.
Setting up a DataFrame like the one you have:
df = pd.DataFrame({'nomnom':np.random.choice(['cookies', 'biscuits', 'cake', 'lie'], 10)})
df is now a DataFrame with some arbitrary data in it (since you said you had more data in there).
nomnom
0 biscuits
1 lie
2 biscuits
3 cake
4 lie
5 cookies
6 cake
7 cake
8 cake
9 cake
Setting up a list like the one you have:
yourlist = np.random.choice(10, 10)
yourlist is now:
array([2, 9, 2, 3, 4, 8, 5, 8, 6, 8])
The actual code you need (TLDR;):
counts = pd.DataFrame(pd.value_counts(yourlist))
pd.merge(left=df, left_index=True,
right=counts, right_index=True,
how='left').fillna(0)

Categories

Resources