This question already has answers here:
Split a Pandas column of lists into multiple columns
(11 answers)
Closed 1 year ago.
I have 4 Values data as output of a function. Here's my data
Name Grade
usia (75,78,90,52)
shdh (85,68,60,72)
fbjg (95,58,65,66)
Here's what I want
Name Math English Physics Chemistry
usia 75 78 90 52
shdh 85 68 60 72
fbjg 95 58 65 66
Use DataFrame constructor with DataFrame.pop for remove original column Grade:
import ast
#if strings inputs instead tuples
#df['Grade'] = df['Grade'].apply(ast.literal_eval)
cols = ['Math','English','Physics','Chemistry']
df[cols] = pd.DataFrame(df.pop('Grade').tolist(), index=df.index)
print (df)
Name Math English Physics Chemistry
0 usia 75 78 90 52
1 shdh 85 68 60 72
2 fbjg 95 58 65 66
Related
I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69
My df looks as follows:
Roll Name Age Physics English Maths
0 A1 Max 16 87 79 90
1 A2 Lisa 15 47 75 60
2 A3 Luna 17 83 49 95
3 A4 Ron 16 86 79 93
4 A5 Silvia 15 57 99 91
I'd like to add the columns Physics, English, and Maths and display the results in a separate column 'Grade'.
I've tried the code:
df['Physics'] + df['English'] + df['Maths']
But it just concatenates. I am not taught about the lambda function yet.
How do I go about this?
df['Grade'] = df['Physics'] + df['English'] + df['Maths']
it concatenates maybe your data is in **String** just convert into float or integer.
Check Data Types First by using df.dtypes
Try:
df["total"] = df[["Physics", "English", "Maths"]].sum(axis=1)
df
Check Below code, Its is possible you columns are in string format, belwo will solve that:
import pandas as pd
df = pd.DataFrame({"Physics":['1','2','3'],"English":['1','2','3'],"Maths":['1','2','3']})
df['Total'] = df['Physics'].astype('int') +df['English'].astype('int') +df['Maths'].astype('int')
df
Output:
When creating a dataframe as below (instructions from here), the order of the columns changes from "Day, Visitors, Bounce Rate" to "Bounce Rate, Day, Visitors"
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats)
Gives:
Bounce Rate Day Visitors
0 65 1 43
1 67 2 34
2 78 3 65
3 65 4 56
4 45 5 29
5 52 6 76
How can the order be kept in tact? (i.e. Day, Visitors, Bounce Rate)
One approach is to use columns
Ex:
import pandas as pd
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
print(df)
Output:
Day Visitors Bounce Rate
0 1 43 65
1 2 34 67
2 3 65 78
3 4 56 65
4 5 29 45
5 6 76 52
Dictionaries are not considered to be ordered in Python <3.7.
You can use collections.OrderedDict instead:
from collections import OrderedDict
web_stats = OrderedDict([('Day', [1,2,3,4,5,6]),
('Visitors', [43,34,65,56,29,76]),
('Bounce Rate', [65,67,78,65,45,52])])
df = pd.DataFrame(web_stats)
If you don't want to write the column names which becomes really inconvenient if you have multiple keys you may use
df = pd.DataFrame(web_stats, columns = web_stats.keys())
I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))
This question already has answers here:
Remove row with null value from pandas data frame
(5 answers)
Closed 5 years ago.
If I have the the following dataframe. If there is a null in either Participation, Homework, Test, Presentation (if there is a null is any of the four columns), then I want to remove that row. How do I achieve this in Pandas.
Name Participation Homework Test Presentation Attendance
Andrew 92 Null 85 95 88
John 95 88 98 Null 90
Carrie 82 99 96 89 92
Simone 100 91 88 99 90
Here, I would want to remove everyone except for Carrie and Simone from the dataframe. How do I achieve this in pandas?
I found this on Stackoverflow, which I think may help df = df[pd.notnull(df['column_name'])], but is there anyway I can do this for all four columns (so a subset) instead of each column individually?
Thanks!
You can skip the replace if you use ne:
df[df.ne('Null').all(1)]
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90
Preparation, let's replace that string 'Null' with np.nan first.
Now, let's try this using notnull, all with axis=1:
df[df.replace('Null',np.nan).notnull().all(1)]
Output:
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90
Or using isnull, any, and ~:
df[~df.replace('Null',np.nan).isnull().any(1)]
replace + dropna
df.replace({'Null':np.nan}).dropna()
Out[504]:
Name Participation Homework Test Presentation Attendance
2 Carrie 82 99 96 89 92
3 Simone 100 91 88 99 90