Related
Hi I would like to give a final score to the students based on current Score + Score for their favourite subject.
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
for subj in df['Favourite_Subject'].unique():
mask = (df['Favourite_Subject'] == subj)
df['Final_Score'] = df[mask].apply(lambda row: row['Current_Score'] + row[subj], axis=1)
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English NaN
1 nick 30 42 23 21 Math NaN
2 juli 39 14 40 38 Science 79.0
When I apply the above function, I got NaN in the other 2 entries for 'Final_Score' column, how do I get the following result without overwriting with NaN? Thanks!
Name Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
We can use lookup to find the scores corresponding to the Favourite_Subject then add them with the Current_Score to calculate Final_Score
i = df.columns.get_indexer(df['Favourite_Subject'])
df['Final_Score'] = df['Current_Score'] + df.values[df.index, i]
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
You do not need a loop, you can apply this directly to the dataframe:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
df['Final_Score'] = df.apply(lambda x: x['Current_Score'] + x[x['Favourite_Subject']], axis=1)
You can use .apply() on axis=1 and get the column label from the column value of column Favourite_Subject to get the value of the corresponding column. Then, add the result to column Current_Score with df['Current_Score'], as follows:
df['Final_Score'] = df['Current_Score'] + df.apply(lambda x: x[x['Favourite_Subject']], axis=1)
Result:
print(df)
Name Current_Score English Science Math Favourite_Subject Final_Score
0 tom 31 50 30 20 English 81
1 nick 30 42 23 21 Math 51
2 juli 39 14 40 38 Science 79
Seems like you are overwriting the previous values during each loop which is why you only have the Final score for the final row when the loop ends.
Here is my implementation:
import pandas as pd
new_data = [['tom', 31, 50, 30, 20, 'English'], ['nick', 30, 42, 23, 21, 'Math'], ['juli', 39, 14, 40, 38, 'Science']]
df = pd.DataFrame(new_data, columns = ['Name','Current_Score','English','Science','Math','Favourite_Subject'])
favsubj = df['Favourite_Subject'].to_list()
final_scores = []
for i in range(0,len(df)):
final_scores.append(df['Current_Score'].iloc[i] + df[favsubj[i]].iloc[i])
df['Final_Score'] = final_scores
sample dataframe:
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]}
I want to find below df['L2']
I studied pandas rolling,groupby,etcs, cannot solve it.
please read L2 formula & givee me a opinion
L2 formula
L2(Jan-20) = 24
-------------------
sales 2020-01
0 2020-01 24
-------------------
L2(Feb-20) = 132 (sum of below matrix 2x2)
sales 2020-01 2020-02
0 2020-01 24 24
1 2020-02 42 42
-------------------
L2(Mar-20) = 154 (sum of matrix 2x2)
sales 2020-02 2020-03
0 2020-02 42 24
1 2020-03 18 70
-------------------
L2(Apr-20) = 187 (sum of below maxtrix 2x2)
sales 2020-03 2020-04
0 2020-03 70 44
1 2020-04 70 3
output
Unnamed: 0 sales Jan-20 Feb-20 Mar-20 Apr-20 May-20 L2 L3
0 0 Jan-20 24 24 64 22 11 24 24
1 1 Feb-20 42 42 24 11 35 132 132
2 2 Mar-20 18 18 70 44 74 154 326
3 3 Apr-20 68 68 70 3 12 187 350
4 4 May-20 24 24 88 5 69 89 545
5 5 Jun-20 30 30 57 78 51 203 433
Values=f.values[:,1:]
L2=[]
RANGE=Values.shape[0]
for a in range(RANGE):
if a==0:
result=Values[a,a]
else:
if Values[a-1:a+1,a-1:a+1].shape==(2,1):
result=np.sum(Values[a-1:a+1,a-2:a])
else:
result=np.sum(Values[a-1:a+1,a-1:a+1])
L2.append(result)
print(L2)
L2 output:-->[24, 132, 154, 187, 89, 203]
f["L2"]=L2
f:
import pandas as pd
import numpy as np
# make a dataset
df = pd.DataFrame({'sales': ['2020-01','2020-02','2020-03','2020-04','2020-05','2020-06'],
'2020-01': [24,42,18,68,24,30],
'2020-02': [24,42,18,68,24,30],
'2020-03': [64,24,70,70,88,57],
'2020-04': [22,11,44,3,5,78],
'2020-05': [11,35,74,12,69,51]})
print(df)
# datawork(L2)
for i in range(0,df.shape[0]):
if i==0:
df.loc[i,'L2']=df.loc[i,'2020-01']
else:
if i!=df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i:i+2].sum().sum()
if i==df.shape[0]-1:
df.loc[i,'L2']=df.iloc[i-1:i+1,i-1:i+1].sum().sum()
print(df)
# sales 2020-01 2020-02 2020-03 2020-04 2020-05 L2
#0 2020-01 24 24 64 22 11 24.0
#1 2020-02 42 42 24 11 35 132.0
#2 2020-03 18 18 70 44 74 154.0
#3 2020-04 68 68 70 3 12 187.0
#4 2020-05 24 24 88 5 69 89.0
#5 2020-06 30 30 57 78 51 203.0
I tried another method.
this method uses reshape long(in python : melt), but I applyed reshape long twice in python because time frequency of sales and other columns in df is monthly and not daily, so I did reshape long one more time to make int column corresponding to monthly date.
(I have used Stata more often than python, in Stata, I can only do reshape long one time because it has monthly time frequency, and reshape task is much easier than that of pandas, python)
if you are interested, take a look
# 00.module
import pandas as pd
import numpy as np
from order import order # https://stackoverflow.com/a/68464246/16478699
# 0.make a dataset
df = pd.DataFrame({'sales': ['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', '2020-06'],
'2020-01': [24, 42, 18, 68, 24, 30],
'2020-02': [24, 42, 18, 68, 24, 30],
'2020-03': [64, 24, 70, 70, 88, 57],
'2020-04': [22, 11, 44, 3, 5, 78],
'2020-05': [11, 35, 74, 12, 69, 51]}
)
df.to_stata('dataset.dta', version=119, write_index=False)
print(df)
# 1.reshape long(in python: melt)
t = list(df.columns)
t.remove('sales')
df_long = df.melt(id_vars='sales', value_vars=t, var_name='var', value_name='val')
df_long['id'] = list(range(1, df_long.shape[0] + 1)) # make id for another resape long
print(df_long)
# 2.another reshape long(in python: melt, reason: make int(col name: tid) corresponding to monthly date of sales and monthly columns in df)
df_long2 = df_long.melt(id_vars=['id', 'val'], value_vars=['sales', 'var'])
df_long2['tid'] = df_long2['value'].apply(lambda x: 1 + list(df_long2.value.unique()).index(x))
print(df_long2)
# 3.back to wide form with tid(in python: pd.pivot)
df_wide = pd.pivot(df_long2, index=['id', 'val'], columns='variable', values=['value', 'tid'])
df_wide.columns = df_wide.columns.map(lambda x: x[1] if x[0] == 'value' else f'{x[0]}_{x[1]}') # change multiindex columns name into just normal columns name
df_wide = df_wide.reset_index()
print(df_wide)
# 4.make values of L2
for i in df_wide.tid_sales.unique():
if list(df_wide.tid_sales.unique()).index(i) + 1 == len(df_wide.tid_sales.unique()):
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i - 1) | (
df_wide['tid_var'] == i - 2))), 'val'].sum()
else:
df_wide.loc[df_wide['tid_sales'] == i, 'L2'] = df_wide.loc[(((df_wide['tid_sales'] == i) | (
df_wide['tid_sales'] == i - 1)) & ((df_wide['tid_var'] == i) | (
df_wide['tid_var'] == i - 1))), 'val'].sum()
print(df_wide)
# 5.back to shape of df with L2(reshape wide, in python: pd.pivot)
df_final = df_wide.drop(columns=df.filter(regex='^tid')) # no more columns starting with tid needed
df_final = pd.pivot(df_final, index=['sales', 'L2'], columns='var', values='val').reset_index()
df_final = order(df_final, 'L2', f_or_l='last') # order function is made by me
print(df_final)
I have a dataset of four years' worth of ACT participation percentages by state entitled 'part_ACT'. Here's a snippet of it:
Index State ACT17 ACT18 ACT19 ACT20
0 Alabama 100 100 100 100
1 Alaska 65 33 38 33
2 Arizona 62 66 73 71
3 Arkansas 100 100 100 100
4 California 31 27 23 19
5 Colorado 100 30 27 25
6 Connecticut 31 26 22 19
I'm trying to produce a line graph with each of the four column headings on the x-axis and their values on the y-axis (1-100). I would prefer to display all of these line graphs into a single figure.
What's the easiest way to do this? I'm fine with Pandas, Matplotlib, Seaborn, or whatever. Thanks much!
One solution is to melt the df and plot with hue
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame({
'State': ['A', 'B', 'C', 'D'],
'x18': sorted(np.random.randint(0, 100, 4)),
'x19': sorted(np.random.randint(0, 100, 4)),
'x20': sorted(np.random.randint(0, 100, 4)),
'x21': sorted(np.random.randint(0, 100, 4)),
})
df_melt = df.melt(id_vars='State', var_name='year')
sns.relplot(
kind='line',
data=df_melt,
x='year', y='value',
hue='State'
)
Creating a plot is all about the shape of the DataFrame.
One way to accomplish this is by converting the DataFrame from wide to long, with melt, but this isn't necessary.
The primary requirement, is set 'State' as the index.
Plots can be generated directly with df, or df.T (.T is the transpose of the DataFrame).
The OP requests a line plot, but this is discrete data, and the correct way to visualize discrete data is with a bar plot, not a line plot.
pandas v1.2.3, seaborn v0.11.1, and matplotlib v3.3.4
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut'],
'ACT17': [100, 65, 62, 100, 31, 100, 31],
'ACT18': [100, 33, 66, 100, 27, 30, 26],
'ACT19': [100, 38, 73, 100, 23, 27, 22],
'ACT20': [100, 33, 71, 100, 19, 25, 19]}
df = pd.DataFrame(data)
# set State as the index - this is important
df.set_index('State', inplace=True)
# display(df)
ACT17 ACT18 ACT19 ACT20
State
Alabama 100 100 100 100
Alaska 65 33 38 33
Arizona 62 66 73 71
Arkansas 100 100 100 100
California 31 27 23 19
Colorado 100 30 27 25
Connecticut 31 26 22 19
# display(df.T)
State Alabama Alaska Arizona Arkansas California Colorado Connecticut
ACT17 100 65 62 100 31 100 31
ACT18 100 33 66 100 27 30 26
ACT19 100 38 73 100 23 27 22
ACT20 100 33 71 100 19 25 19
Plot 1
Use pandas.DataFrame.plot
df.T.plot()
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
# get rid of the ticks between the labels - not necessary
plt.xticks(ticks=range(0, len(df.T)))
plt.show()
Plot 2 & 3
Use pandas.DataFrame.plot with kind='bar' or kind='barh'
The bar plot is much better at conveying the yearly changes in the data, and allows for an easy comparison between states.
df.plot(kind='bar')
plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
kind='bar'
kind='barh'
Plot 4
Use seaborn.lineplot
Will correctly plot a line plot from a wide dataframe with the columns and index labels.
sns.lineplot(data=df.T)
plt.legend(title='State', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:
Date----Golfer---Tournament-----Score---Player Total Rounds Played
2008-01-01---Tiger Woods----Invented Tournament R1---72---50
2008-01-01---Phil Mickelson----Invented Tournament R1---73---108
I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (i.e. instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.
Thanks,
Tom
Here is one way:
df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
For example:
import pandas as pd
df = pd.DataFrame([['A', 70, 50],
['B', 72, 55],
['A', 73, 45],
['A', 71, 60],
['B', 74, 55],
['A', 72, 65]],
columns=['Golfer', 'Rounds', 'Played'])
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()
# Golfer Rounds Played Rounds CumSum
# 0 A 70 50 70
# 1 B 72 55 72
# 2 A 73 45 143
# 3 A 71 60 214
# 4 B 74 55 146
# 5 A 72 65 286
I have e database as normal txt named DB.TXT ( delimiter Tab is applied only the numbers),like this:
Date Id I II III IV V
17-jan-13 aa 47 56 7 74 58
18-jan-13 ab 86 2 30 40 75
19-jan-13 ac 72 64 41 81 80
20-jan-13 ad 51 26 43 61 32
21-jan-13 ae 31 62 32 25 75
22-jan-13 af 60 83 18 35 5
23-jan-13 ag 29 8 47 12 69
I would like to know the code in Python for skip first line (Date, I, II, III, IV, V) and the first two columns ( Date and Id), while reading a text file. (With numbers residues should do sums and multiplications etc.)
After reading the txt file, it will appear like this:
47 56 7 74 58
86 2 30 40 75
72 64 41 81 80
51 26 43 61 32
31 62 32 25 75
60 83 18 35 5
29 8 47 12 69
The file is format txt, not CSV.
If you are only going to do calculations on the rows, you can simply do:
with open("data.txt") as fh:
fh.next()
for line in fh:
line = line.split() # This split works equally well for tabs and other spaces
do_something(line[2:])
If your needs are more complex, you're better off using a library like Pandas, which can take care of headers and label columns, as well as regex delimiters, and gives you easy access to columns:
import pandas
data = pandas.read_csv("blah.txt", sep="\s+", index_col=[0,1])
data.values # array of values as requested
data.sum() # sum of each column
data.product(axis=1) # product of each row
etc...
sep is a regex since you said it's not always \t, and index_col makes the first two columns column labels.
"the code in python" is pretty broad. Using numpy, it's:
In [21]: np.genfromtxt('db.txt',dtype=None,skip_header=1,usecols=range(2,6))
Out[21]:
array([[47, 56, 7, 74],
[86, 2, 30, 40],
[72, 64, 41, 81],
[51, 26, 43, 61],
[31, 62, 32, 25],
[60, 83, 18, 35],
[29, 8, 47, 12]])
Using the csv module, to skip the first line, just advance the file iterator by calling next(f). To skip the first two rows you could use row = row[2:]:
import csv
with open(filename, 'rb') as f:
next(f) # skip the first line
for row in csv.reader(f, delimiter='\t'):
row = row[2:] # skip the first two columns
row = map(int, row) # map the strings to ints