Using python apply function to add columns to dataframe?

Using python apply function to add columns to dataframe? - python

Lets say I have the following dataframe:
fix_id lg home_team away_team
9887 30 Leganes Alaves
9886 30 Valencia Las Palmas
9885 30 Celta Vigo Real Sociedad
9884 30 Girona Atletico Madrid
and I run an apply function over all the rows of the dataframe. The output of the apply function is the following pandas series:
9887 ({'defense': '74', 'midfield': '75', 'attack': '74', 'overall': '75'},
{'defense': '74', 'midfield': '75', 'attack': '77', 'overall': '75'}),
9886 ({'defense': '80', 'midfield': '80', 'attack': '80', 'overall': '80'},
{'defense': '75', 'midfield': '74', 'attack': '77', 'overall': '75'}),
...
How could add the output dictionaries as new columns to my dataframe. I want to add all eight of them to the same row.
I will be glad to get any guidance. Not necessarily a code. Maybe just instruct me how to, and I will try?
Thanks.

Supposing your output is stored in Series s you can do the following:
pd.concat([df, s.apply(pd.Series)[0].apply(pd.Series), s.apply(pd.Series)[1].apply(pd.Series)], axis=1)
Example
df = pd.DataFrame({'lg': {9887: 30, 9886: 30, 9885: 30, 9884: 30}, 'home_team': {9887: 'Leganes', 9886: 'Valencia', 9885: 'Celta Vigo', 9884: 'Girona'}, 'away_team': {9887: 'Alaves', 9886: 'Las Palmas', 9885: 'Real Sociedad', 9884: 'Atletico Madrid'}})
s = pd.Series({9887: ({'defense': '74', 'midfield': '75', 'attack': '74', 'overall': '75'}, {'defense': '74', 'midfield': '75', 'attack': '77', 'overall': '75'}), 9886: ({'defense': '80', 'midfield': '80', 'attack': '80', 'overall': '80'}, {'defense': '75', 'midfield': '74', 'attack': '77', 'overall': '75'})})
print(df)
# lg home_team away_team
#9887 30 Leganes Alaves
#9886 30 Valencia Las Palmas
#9885 30 Celta Vigo Real Sociedad
#9884 30 Girona Atletico Madrid
print(s)
#9887 ({'defense': '74', 'midfield': '75', 'attack':...
#9886 ({'defense': '80', 'midfield': '80', 'attack':...
#dtype: object
df = pd.concat([df, s.apply(pd.Series)[0].apply(pd.Series), s.apply(pd.Series)[1].apply(pd.Series)], axis=1)
# lg home_team away_team defense ... defense midfield attack overall
#9884 30 Girona Atletico Madrid NaN ... NaN NaN NaN NaN
#9885 30 Celta Vigo Real Sociedad NaN ... NaN NaN NaN NaN
#9886 30 Valencia Las Palmas 80 ... 75 74 77 75
#9887 30 Leganes Alaves 74 ... 74 75 77 75
[4 rows x 11 columns]

Try something like this:
def mymethod(row):
# Here whatever operation you have in mind, for example summing two columns of the row:
return row['A']+row['B']
df['newCol'] = df.apply(lambda row: mymethod(row), axis=1)

df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})),
left_index=True, right_index=True)

Related

I am outputting an altered tsv file to a new tsv file but the format is wrong

I've been working on this lab project to take in some tsv values of a "grade book" run some calculations, and finally output them to a new report.txt file but whenever I get to the end to print out my report.txt the format of the file doesn't match what I'd expect from a row by row tsv file and is instead a giant list.
So my question is how can I convert my list into a proper line by line tsv like below?
Barrett Edan 70 45 59 F
Bradshaw Reagan 96 97 88 A
Charlton Caius 73 94 80 B
Mayo Tyrese 88 61 36 D
Stern Brenda 90 86 45 C
Averages: midterm1 83.40, midterm2 76.60, final 61.60
My input is commented out in the StudentInfo.tsv= below
My current output is listed below the code.
# TODO: Declare any necessary variables here.
import csv
studentgrades=[]
s_grades=[]
all_grades=[]
all_grades_computed=[]
# TODO: Read a file name from the user and read the tsv file here.
#StudentInfo.tsv= Barrett Edan 70 45 59
# Bradshaw Reagan 96 97 88
# Charlton Caius 73 94 80
# Mayo Tyrese 88 61 36
# Stern Brenda 90 86 45
with open('StudentInfo.tsv', 'r') as file:
studentgrades = csv.reader(file, delimiter='\t')
sgrades=list(studentgrades)
for row in sgrades:
avg2=0
rowstr=' '
for i in row[2:]:
avg2+=int(i)
#print(i)
#print(avg2)
studentavg=float(int(avg2)/len(row[2:]))
if studentavg >= 90:
s_grades.append('A')
elif 80<=studentavg<90:
s_grades.append('B')
elif 70<=studentavg<80:
s_grades.append('C')
elif 60<=studentavg<70:
s_grades.append('D')
else:
s_grades.append('F')
print('{} average: {:.2f}'.format(rowstr.join(row[0:2]), studentavg))
#print('test length',len(row[2:]))
#print(row[2:])
print()
print(s_grades)
print(sgrades)
# for i in s_grades:
# sgrades.append(i)
# print(sgrades)
for i in range(len(s_grades)):
all_grades.append(str(sgrades[i])+str(s_grades[i]))
print(all_grades)
print()
# TODO: Compute student grades and exam averages, then output results to a text file here.
student1=sgrades[0]
student2=sgrades[1]
student3=sgrades[2]
student4=sgrades[3]
student5=sgrades[4]
print('Averages:',end=' ')
midterm1=(int(student1[2])+int(student2[2])+int(student3[2])+int(student4[2])+int(student5[2]))/len(sgrades)
#print(student1[2])
print('midterm1','{:.2f},'.format(midterm1),end=' ')
midterm2=(int(student1[3])+int(student2[3])+int(student3[3])+int(student4[3])+int(student5[3]))/len(sgrades)
#print(student1[3])
print('midterm2','{:.2f},'.format(midterm2),end=' ')
final=(int(student1[4])+int(student2[4])+int(student3[4])+int(student4[4])+int(student5[4]))/len(sgrades)
print('final','{:.2f}'.format(final))
all_grades_computed=['Averages:','midterm1','{:.2f},'.format(midterm1),'midterm2','{:.2f},'.format(midterm2), 'final','{:.2f}'.format(final)]
with open('report.txt', 'w+') as report:
csv_writer=csv.writer(report, delimiter='\t')
csv_writer.writerow(all_grades)
csv_writer.writerow(all_grades_computed)
with open('report.txt','r') as report:
reports=csv.reader(report, delimiter='\t')
reportlist=list(reports)
print(reportlist)
Barrett Edan average: 58.00
Bradshaw Reagan average: 93.67
Charlton Caius average: 82.33
Mayo Tyrese average: 61.67
Stern Brenda average: 73.67
['F', 'A', 'B', 'D', 'C']
[['Barrett', 'Edan', '70', '45', '59'], ['Bradshaw', 'Reagan', '96', '97', '88'], ['Charlton', 'Caius', '73', '94', '80'], ['Mayo', 'Tyrese', '88', '61', '36'], ['Stern', 'Brenda', '90', '86', '45']]
["['Barrett', 'Edan', '70', '45', '59']F", "['Bradshaw', 'Reagan', '96', '97', '88']A", "['Charlton', 'Caius', '73', '94', '80']B", "['Mayo', 'Tyrese', '88', '61', '36']D", "['Stern', 'Brenda', '90', '86', '45']C"]
Averages: midterm1 83.40, midterm2 76.60, final 61.60
[["['Barrett', 'Edan', '70', '45', '59']F", "['Bradshaw', 'Reagan', '96', '97', '88']A", "['Charlton', 'Caius', '73', '94', '80']B", "['Mayo', 'Tyrese', '88', '61', '36']D", "['Stern', 'Brenda', '90', '86', '45']C"], ['Averages:', 'midterm1', '83.40,', 'midterm2', '76.60,', 'final', '61.60']]

Is there a way to do this in Python?

I have a data frame that looks like this:
data = {'State': ['24', '24', '24',
'24','24','24','24','24','24','24','24','24'],
'County code': ['001', '001', '001',
'001','002','002','002','002','003','003','003','003'],
'TT code': ['123', '123', '123',
'123','124','124','124','124','125','125','125','125'],
'BLK code': ['221', '221', '221',
'221','222','222','222','222','223','223','223','223'],
'Age Code': ['1', '1', '2', '2','2','2','2','2','2','1','2','1']}
df = pd.DataFrame(data)
essentially I want to just have where only the TT code where the age code is 2 and there are no 1's. So I just want to have the data frame where:
'State': ['24', '24', '24', '24'],
'County code': ['002','002','002','002',],
'TT code': ['124','124','124','124',],
'BLK code': ['222','222','222','222'],
'Age Code': ['2','2','2','2']
is there a way to do this?

IIUC, you want to keep only the TT groups where there are only Age groups with value '2'?
You can use a groupby.tranform('all') on the boolean Series:
df[df['Age Code'].eq('2').groupby(df['TT code']).transform('all')]
output:
State County code TT code BLK code Age Code
4 24 002 124 222 2
5 24 002 124 222 2
6 24 002 124 222 2
7 24 002 124 222 2

This should work.
df111['Age Code'] = "2"
I am just wondering why the choice of string for valueType of integer

Pandas - Resort Column Location

I have a pandas frame. When I print the columns (shown below), its turns out that my columns are out of order. Is there a way to sort only the first 30 columns so they are in order (30,60,90...900)?
[in] df.columns
[out] Index(['120', '150', '180', '210', '240', '270', '30', '300', '330', '360',
'390', '420', '450', '480', '510', '540', '570', '60', '600', '630',
'660', '690', '720', '750', '780', '810', '840', '870', '90', '900',
'Item', 'Price', 'Size', 'Time', 'Type', 'Unnamed: 0'],
dtype='object')
The fixed frame would be as follows:
[out] Index(['30','60','90,'120', '150', '180', '210', '240', '270','300', '330', '360',
'390', '420', '450', '480', '510', '540', '570','600', '630',
'660', '690', '720', '750', '780', '810', '840', '870','900',
'Item', 'Price', 'Size', 'Time', 'Type', 'Unnamed: 0'],
dtype='object')

If you know that the columns will be named 30 through 900 in multiples of 30, you can generate that explicitly like this:
c = [str(i) for i in range(30, 901, 30)]
Then add it to the other columns:
c = c + ['Item', 'Price', 'Size', 'Time', 'Type', 'Unnamed: 0']
Then you should be able to access it as df[c]

You need select first column names, convert to int and sort. Then convert back to str if necessary and use reindex_axis:
np.sort(df.columns[:30].astype(int)).astype(str).tolist() +
df.columns[30:].tolist()
Sample:
df = pd.DataFrame(np.arange(36).reshape(1,-1),
columns=['120', '150', '180', '210', '240', '270', '30', '300',
'330', '360','390', '420', '450', '480', '510', '540', '570', '60', '600', '630',
'660', '690', '720', '750', '780', '810', '840', '870', '90', '900',
'Item', 'Price', 'Size', 'Time', 'Type', 'Unnamed: 0'])
print (df)
120 150 180 210 240 270 30 300 330 360 ... 840 870 90 \
0 0 1 2 3 4 5 6 7 8 9 ... 26 27 28
900 Item Price Size Time Type Unnamed: 0
0 29 30 31 32 33 34 35
[1 rows x 36 columns]
df = df.reindex_axis(np.sort(df.columns[:30].astype(int)).astype(str).tolist() +
df.columns[30:].tolist(), axis=1)
print (df)
30 60 90 120 150 180 210 240 270 300 ... 810 840 870 \
0 6 17 28 0 1 2 3 4 5 7 ... 25 26 27
900 Item Price Size Time Type Unnamed: 0
0 29 30 31 32 33 34 35
[1 rows x 36 columns]

BeautifulSoup - Scraping Multiple Tables from a page?

I'm trying to scrape the content from this URL which contains multiple tables. The desired output would be:
NAME FG% FT% 3PM REB AST STL BLK TO PTS SCORE
Team Jackson (0-8) .4313 .7500 21 71 34 11 12 15 189 1-8-0
Team Keyrouze (4-4) .4441 .8090 31 130 71 18 13 45 373 8-1-0
Nutz Vs. Draymond Green (4-4) .4292 .8769 30 86 66 15 9 28 269 3-6-0
Team Pauls 2 da Wall (3-5) .4784 .8438 40 123 64 18 20 30 316 6-3-0
Team Noey (2-6) .4350 .7679 21 125 62 20 9 33 278 7-2-0
YOU REACH, I TEACH (2-5-1) .4810 .7432 20 114 56 30 7 50 277 2-7-0
Kris Kaman His Pants (5-3) .4328 .8000 20 74 59 20 5 27 238 3-6-0
Duke's Balls In Daniels Face (3-4-1) .5000 .7045 42 139 38 27 22 30 303 6-3-0
Knicks Tape (5-3) .5000 .8152 34 143 92 12 9 47 397 4-5-0
Suck MyDirk (5-3) .4734 .8814 29 106 86 22 17 40 435 5-4-0
In Porzingod We Trust (4-4) .4928 .7222 27 180 95 16 16 46 423 7-2-0
Team Aguilar (6-1-1) .4718 .7053 28 177 65 12 35 48 413 2-7-0
Team Li (7-0-1) .4714 .8118 35 134 74 17 17 47 368 6-3-0
Team Iannetta (4-4) .4527 .7302 22 125 90 20 13 44 288 3-6-0
If it's too difficult to format the tables like that, I'd like to know how I can scrape all the tables? My code to scrape all rows is like this:
tableStats = soup.find('table', {'class': 'tableBody'})
rows = tableStats.findAll('tr')
for row in rows:
print(row.string)
But it only prints the value "TEAM" and nothing else... Why doesn't it contain all the rows in the table?
Thanks.

Instead of looking for the table tag, you should look for the rows directly with a more dependable class, such as linescoreTeamRow. This code snippet does the trick,
from bs4 import BeautifulSoup
import requests
a = requests.get("http://games.espn.com/fba/scoreboard?leagueId=224165&seasonId=2017")
soup = BeautifulSoup(a.text, 'lxml')
# searching for the rows directly
rows = soup.findAll('tr', {'class': 'linescoreTeamRow'})
# you will need to isolate elements in the row for the table
for row in rows:
print row.text

Found a way to exactly get the 2-D matrix I specified in the question. It's stored as the list teams.
Code:
from bs4 import BeautifulSoup
import requests
source_code = requests.get("http://games.espn.com/fba/scoreboard?leagueId=224165&seasonId=2017")
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml')
teams = []
rows = soup.findAll('tr', {'class': 'linescoreTeamRow'})
# Creates a 2-D matrix.
for row in range(len(rows)):
team_row = []
columns = rows[row].findAll('td')
for column in columns:
team_row.append(column.getText())
print(team_row)
# Add each team to a teams matrix.
teams.append(team_row)
Output:
['Team Jackson (0-10)', '', '.4510', '.8375', '41', '135', '101', '23', '11', '50', '384', '', '5-4-0']
['YOU REACH, I TEACH (3-6-1)', '', '.4684', '.7907', '22', '169', '103', '22', '10', '32', '342', '', '4-5-0']
['Nutz Vs. Draymond Green (4-6)', '', '.4552', '.8372', '30', '157', '68', '15', '16', '39', '356', '', '2-7-0']
["Jesse's Blue Balls (4-5-1)", '', '.4609', '.7576', '47', '158', '71', '30', '20', '38', '333', '', '7-2-0']
['Team Noey (4-6)', '', '.4763', '.8261', '42', '164', '70', '25', '29', '44', '480', '', '5-4-0']
['Suck MyDirk (6-3-1)', '', '.4733', '.8403', '54', '160', '132', '23', '11', '47', '544', '', '4-5-0']
['Kris Kaman His Pants (5-5)', '', '.4569', '.8732', '53', '138', '105', '27', '21', '53', '465', '', '6-3-0']
['Team Aguilar (6-3-1)', '', '.4433', '.7229', '40', '202', '68', '30', '22', '54', '452', '', '3-6-0']
['Knicks Tape (6-3-1)', '', '.4406', '.8824', '52', '172', '108', '24', '13', '49', '513', '', '6-3-0']
['Team Iannetta (4-6)', '', '.5321', '.6923', '24', '146', '94', '32', '16', '60', '428', '', '3-6-0']
['In Porzingod We Trust (6-4)', '', '.4694', '.6364', '37', '216', '133', '31', '21', '77', '468', '', '4-5-0']
['Team Keyrouze (6-4)', '', '.4705', '.8854', '51', '135', '108', '25', '17', '43', '550', '', '5-4-0']
['Team Li (8-1-1)', '', '.4369', '.8182', '57', '203', '130', '34', '22', '54', '525', '', '6-3-0']
['Team Pauls 2 da Wall (5-5)', '', '.4780', '.5970', '27', '141', '47', '19', '25', '28', '263', '', '3-6-0']

Method to get two separate lists after parsing output

I want to have 2 separate lists from following output :-
>>> a = """
... ===================================================================
... IO Statistics
... Interval: 2.000 secs
... Column #0: COUNT(frame.time)frame.time
... | Column #0
Time | COUNT
... Time | COUNT
... 000.000-002.000 1921
... 002.000-004.000 2000
... 004.000-006.000 1999
... 006.000-008.000 1999
... 008.000-010.000 1995
... 010.000-012.000 1997
... 012.000-014.000 1999
... 014.000-016.000 2001
... 016.000-018.000 2004
... 018.000-020.000 1995
... 020.000-022.000 1997
... 022.000-024.000 2007
... 024.000-026.000 2003
... 026.000-028.000 1998
... 028.000-030.000 1995
... 030.000-032.000 1994
... 032.000-034.000 2001
... 034.000-036.000 2008
... 036.000-038.000 1996
... 038.000-040.000 1996
... 040.000-042.000 95
... ===================================================================
... """
Current code with output :-
>>> print re.findall(r'\s*(?P<first>\d+\.\d+)\-\d+\.\d+\s*(?P<id>\d+)\s*',a)
[('000.000', '1921'), ('002.000', '2000'), ('004.000', '1999'), ('006.000', '1999'), ('008.000', '1995'), ('010.000', '1997'), ('012.000', '1999'), ('014.000', '2001'), ('016.000', '2004'), ('018.000', '1995'), ('020.000', '1997'), ('022.000', '2007'), ('024.000', '2003'), ('026.000', '1998'), ('028.000', '1995'), ('030.000', '1994'), ('032.000', '2001'), ('034.000', '2008'), ('036.000', '1996'), ('038.000', '1996'), ('040.000', '95')]
Here I am getting one list with 2 combined values but desired output is :-
['0','2','4','6','8',...,'38','40'] -> 1st list
['1241', '1272', '1315', '1371', '1195', '1299', '1305', '1391', '1463', '1454', '1392', '1438', '1362', '1491', '1392', '1422', '1425', '1486', '1449', '1487', '1402', '1420', '1330', '1458', '1420', '144'] -> 2nd list
It will be helpful if someone can suggest a way to achieve desired output.

Use zip(*..) to transpose your output to two separate lists:
lst1, lst2 = zip(*re.findall(r'\s*(?P<first>\d+\.\d+)\-\d+\.\d+\s*(?P<id>\d+)\s*',a))
To get just the integer portion of the values in lst1 you'd need to interpret them as floats first then map those back to just the rounded values:
lst1 = [format(float(i), '.0f') for i in lst1]
Demo:
>>> zip(*re.findall(r'\s*(?P<first>\d+\.\d+)\-\d+\.\d+\s*(?P<id>\d+)\s*',a))
[('000.000', '002.000', '004.000', '006.000', '008.000', '010.000', '012.000', '014.000', '016.000', '018.000', '020.000', '022.000', '024.000', '026.000', '028.000', '030.000', '032.000', '034.000', '036.000', '038.000', '040.000'), ('1921', '2000', '1999', '1999', '1995', '1997', '1999', '2001', '2004', '1995', '1997', '2007', '2003', '1998', '1995', '1994', '2001', '2008', '1996', '1996', '95')]
>>> lst1, lst2 = zip(*re.findall(r'\s*(?P<first>\d+\.\d+)\-\d+\.\d+\s*(?P<id>\d+)\s*',a))
>>> [format(float(i), '.0f') for i in lst1]
['0', '2', '4', '6', '8', '10', '12', '14', '16', '18', '20', '22', '24', '26', '28', '30', '32', '34', '36', '38', '40']
>>> lst2
('1921', '2000', '1999', '1999', '1995', '1997', '1999', '2001', '2004', '1995', '1997', '2007', '2003', '1998', '1995', '1994', '2001', '2008', '1996', '1996', '95')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using python apply function to add columns to dataframe? - python

Try something like this: def mymethod(row): # Here whatever operation you have in mind, for example summing two columns of the row: return row['A']+row['B'] df['newCol'] = df.apply(lambda row: mymethod(row), axis=1)

df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), left_index=True, right_index=True)

Related

I am outputting an altered tsv file to a new tsv file but the format is wrong

Is there a way to do this in Python?

Pandas - Resort Column Location

BeautifulSoup - Scraping Multiple Tables from a page?

Method to get two separate lists after parsing output

Categories

Resources