I have a pandas dataframe that looks like this:
Area1 Area2
1 2
1 4
1 5
1 9
2 8
2 16
2 4
2 1
3 8
3 9
How can I convert 'Area2' column so that it becomes a list of values for each 'Area1' column
So the output I would want is:
Area1 Area2
1 2, 4, 5, 9
2 8, 16, 4, 1
3 8, 9
I have done this in R previously:
df %>% group_by(Area1) %>% summarise(Area2= toString(sort(unique(Area2))))
I have been trying out groupby() and agg() but have had no success.
Could someone explain what I can use once I have grouped the data using df.groupby('Area1')
Many thanks in advance for any suggestions.
You can groupby and apply list
import pandas as pd
df=pd.read_csv("test.csv")
df.groupby('Area1')['Area2'].apply(list)
The R snippet does string concatenation.
The following line keeps the original type of Area2.
import pandas as pd
df.groupby('Area1').Area2.apply(pd.Series.tolist).reset_index()
Related
This question already has answers here:
How do I obtain the second highest value in a row?
(3 answers)
Closed 10 months ago.
Good morning! I have a three column dataframe and need to find the second largest value per each row
DATA=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
A B C
0 10 23 12
1 11 8 7
2 4 3 11
3 5 4 9
I tried using nlargest but it seems to be column based and can't find a pandas solution for this problem. Thank you in advance!
import pandas as pd
df=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
# find the second largest value for each row
df['largest2'] = df.apply(lambda x: x.nlargest(2).iloc[1], axis=1)
print(df.head())
result:
A B C largest2
0 10 23 12 12
1 11 8 7 8
2 4 3 11 4
3 5 4 9 5
In A Python List
mylist = [1, 2, 8, 3, 12]
print(sorted(mylist, reverse=True)[1])
In A Python Pandas List
import pandas as pd
df=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
print(sorted(df['A'].nlargest(4))[3])
print(sorted(df['B'].nlargest(4))[3])
print(sorted(df['C'].nlargest(4))[3])
In A Python Pandas List mk.2
import pandas as pd
df=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
num_of_rows = len(df.index)
second_highest = num_of_rows - 2
print(sorted(df['A'].nlargest(num_of_rows))[second_highest])
print(sorted(df['B'].nlargest(num_of_rows))[second_highest])
print(sorted(df['C'].nlargest(num_of_rows))[second_highest])
In A Python Pandas List mk.3
import pandas as pd
df=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
col_names
num_of_rows = len(df.index)
second_highest = num_of_rows - 2
for col_name in col_names:
print(sorted(df[col_name].nlargest(num_of_rows))[second_highest])
In A Python Pandas List mk.4
import pandas as pd
df=pd.DataFrame({"A":[10,11,4,5],"B":[23,8,3,4],"C":[12,7,11,9]})
top_n = (len(df.columns))
pd.DataFrame({n: df.T[col].nlargest(top_n).index.tolist()
for n, col in enumerate(df.T)}).T
df.apply(pd.Series.nlargest, axis=1, n=2)
I have code that creates a list of dataframes with the same structure. My wish is to append all of these dataframes together but add a column to the new dataframe that identifies which dataframe the row originally came from.
I easily appended the list of dataframes with:
import pandas as pd
df_rosters = pd.concat(list_of_rosters)
However, I haven't been able to figure how to add a column with the original dataframe name or index. I've found a bunch of examples suggesting to use the keys argument; but each example has hardcoded keys. The size of my list is constantly changing so I need to figure out how to dynamically add in the keys.
Thanks in advance!
Let's assign an indicator column to each DataFrame in the list. (Names can be zipped together with the list of DataFrames or created by something like enumerate):
With enumerate
pd.concat(d.assign(df_name=f'{i:02d}') for i, d in enumerate(list_of_rosters))
0 1 df_name
0 4 7 00
1 7 1 00
2 9 5 00
0 8 1 01
1 1 8 01
2 2 6 01
Or with zip:
pd.concat(d.assign(df_name=name)
for name, d in zip(['name1', 'name2'], list_of_rosters))
0 1 df_name
0 4 7 name1
1 7 1 name1
2 9 5 name1
0 8 1 name2
1 1 8 name2
2 2 6 name2
Setup:
import numpy as np
import pandas as pd
np.random.seed(5)
list_of_rosters = [
pd.DataFrame(np.random.randint(1, 10, (3, 2))),
pd.DataFrame(np.random.randint(1, 10, (3, 2)))
]
list_of_rosters:
[ 0 1
0 4 7
1 7 1
2 9 5,
0 1
0 8 1
1 1 8
2 2 6]
I have a column in a DataFrame named fatalities in which few of the values are like below:
data[''fatalities']= [1, 4, , 10, 1+8, 5, 2+9, , 16, 4+5]
I want the values of like '1+8', '2+9', etc to be converted to its aggregated value i.e,
data[''fatalities']= [1, 4, , 10, 9, 5, 11, , 16, 9]
I not sure how to write a code to perform above aggregation for one of the column in pandas DataFrame in Python. But when I tried with the below code its throwing an error.
def addition(col):
col= col.split('+')
col= int(col[0]) + int(col[1])
return col
data['fatalities']= [addition(row) for row in data['fatalities']]
Error:
IndexError: list index out of range
Use pandas.eval what is different like pure python eval:
data['fatalities'] = pd.eval(data['fatalities'])
print (data)
fatalities
0 1
1 4
2 10
3 9
4 5
5 11
6 16
7 9
But because this working only to 100 rows because bug:
AttributeError: 'PandasExprVisitor' object has no attribute 'visit_Ellipsis'
Then solution is:
data['fatalities'] = data['fatalities'].apply(pd.eval)
using .map and .astype(str) to force conversion if you have mixed data types.
df['fatalities'].astype(str).map(eval)
print(df)
fatalities
0 1
1 4
2 10
3 9
4 5
5 11
6 16
7 9
This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
Suppose I have the following data frame:
import pandas as pd
df = pd.DataFrame()
df['ID'] = 1, 1, 1, 2, 2, 3, 3
df['a'] = 3, 5, 6, 3, 8, 1, 2
I want to create a for loop that loops over ID and returns the sum of 'a' for that ID. So far I have this:
for i in df['ID']:
print(i, df.loc[df['ID'] == i, 'a'].sum())
However this returns multiples of the same value like so:
1 14
1 14
1 14
2 11
2 11
3 3
3 3
How do I edit my pool so that once it has returned the value for 'id' == 1 it moves on to the next id value rather than just down to the next row?
I'm looking to get the following:
1 14
2 11
3 3
Thanks in advance!
This is much better suited to groupby rather than looping (as are many pandas dataframe problems):
>>> df.groupby('ID')['a'].sum()
ID
1 14
2 11
3 3
Name: a, dtype: int64
However, just to explain where your loop went wrong, you can just loop through the unique values of df['ID'], rather than all rows:
for i in df['ID'].unique():
print(i, df.loc[df['ID'] == i, 'a'].sum())
1 14
2 11
3 3
I have a dataframe in python pandas with several columns taken from a CSV file.
For instance, data =:
Day P1S1 P1S2 P1S3 P2S1 P2S2 P2S3
1 1 2 2 3 1 2
2 2 2 3 5 4 2
And what I need is to get the sum of all columns which name starts with P1... something like P1* with a wildcard.
Something like the following which gives an error:
P1Sum = data["P1*"]
Is there any why to do this with pandas?
I found the answer.
Using the data, dataframe from the question:
from pandas import *
P1Channels = data.filter(regex="P1")
P1Sum = P1Channels.sum(axis=1)
List comprehensions on columns allow more filters in the if condition:
In [1]: df = pd.DataFrame(np.arange(15).reshape(5, 3), columns=['P1S1', 'P1S2', 'P2S1'])
In [2]: df
Out[2]:
P1S1 P1S2 P2S1
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [3]: df.loc[:, [x for x in df.columns if x.startswith('P1')]].sum(axis=1)
Out[3]:
0 1
1 7
2 13
3 19
4 25
dtype: int64
Thanks for the tip jbssm, for anyone else looking for a sum total, I ended up adding .sum() at the end, so:
P1Sum= P1Channels.sum(axis=1).sum()