I have the following pandas dataframe, only showing one column
0 Atlantic Division
1 Tampa Bay Lightning*
2 Boston Bruins*
3 Toronto Maple Leafs*
4 Florida Panthers
5 Detroit Red Wings
6 Montreal Canadiens
7 Ottawa Senators
8 Buffalo Sabres
9 Metropolitan Division
10 Washington Capitals*
11 Pittsburgh Penguins*
12 Philadelphia Flyers*
13 Columbus Blue Jackets*
14 New Jersey Devils*
15 Carolina Hurricanes
16 New York Islanders
17 New York Rangers
18 Central Division
19 Nashville Predators*
20 Winnipeg Jets*
21 Minnesota Wild*
22 Colorado Avalanche*
23 St. Louis Blues
24 Dallas Stars
25 Chicago Blackhawks
26 Pacific Division
27 Vegas Golden Knights*
28 Anaheim Ducks*
29 San Jose Sharks*
30 Los Angeles Kings*
31 Calgary Flames
32 Edmonton Oilers
33 Vancouver Canucks
34 Arizona Coyotes
35 Atlantic Division
36 Montreal Canadiens*
37 Ottawa Senators*
38 Boston Bruins*
39 Toronto Maple Leafs*
40 Tampa Bay Lightning
41 Florida Panthers
42 Detroit Red Wings
43 Buffalo Sabres
44 Metropolitan Division
45 Washington Capitals*
46 Pittsburgh Penguins*
47 Columbus Blue Jackets*
48 New York Rangers*
49 New York Islanders
50 Philadelphia Flyers
51 Carolina Hurricanes
52 New Jersey Devils
53 Central Division
54 Chicago Blackhawks*
55 Minnesota Wild*
56 St. Louis Blues*
57 Nashville Predators*
58 Winnipeg Jets
59 Dallas Stars
60 Colorado Avalanche
61 Pacific Division
62 Anaheim Ducks*
63 Edmonton Oilers*
64 San Jose Sharks*
65 Calgary Flames*
66 Los Angeles Kings
67 Arizona Coyotes
68 Vancouver Canucks
69 Atlantic Division
70 Florida Panthers*
71 Tampa Bay Lightning*
72 Detroit Red Wings*
73 Boston Bruins
74 Ottawa Senators
75 Montreal Canadiens
76 Buffalo Sabres
77 Toronto Maple Leafs
78 Metropolitan Division
79 Washington Capitals*
80 Pittsburgh Penguins*
81 New York Rangers*
82 New York Islanders*
83 Philadelphia Flyers*
84 Carolina Hurricanes
85 New Jersey Devils
86 Columbus Blue Jackets
87 Central Division
88 Dallas Stars*
89 St. Louis Blues*
90 Chicago Blackhawks*
91 Nashville Predators*
92 Minnesota Wild*
93 Colorado Avalanche
94 Winnipeg Jets
95 Pacific Division
96 Anaheim Ducks*
97 Los Angeles Kings*
98 San Jose Sharks*
99 Arizona Coyotes
100 Calgary Flames
101 Vancouver Canucks
102 Edmonton Oilers
103 Atlantic Division
104 Montreal Canadiens*
105 Tampa Bay Lightning*
106 Detroit Red Wings*
107 Ottawa Senators*
108 Boston Bruins
109 Florida Panthers
110 Toronto Maple Leafs
111 Buffalo Sabres
112 Metropolitan Division
113 New York Rangers*
114 Washington Capitals*
115 New York Islanders*
116 Pittsburgh Penguins*
117 Columbus Blue Jackets
118 Philadelphia Flyers
119 New Jersey Devils
120 Carolina Hurricanes
121 Central Division
122 St. Louis Blues*
123 Nashville Predators*
124 Chicago Blackhawks*
125 Minnesota Wild*
126 Winnipeg Jets*
127 Dallas Stars
128 Colorado Avalanche
129 Pacific Division
130 Anaheim Ducks*
131 Vancouver Canucks*
132 Calgary Flames*
133 Los Angeles Kings
134 San Jose Sharks
135 Edmonton Oilers
136 Arizona Coyotes
137 Atlantic Division
138 Boston Bruins*
139 Tampa Bay Lightning*
140 Montreal Canadiens*
141 Detroit Red Wings*
142 Ottawa Senators
143 Toronto Maple Leafs
144 Florida Panthers
145 Buffalo Sabres
146 Metropolitan Division
147 Pittsburgh Penguins*
148 New York Rangers*
149 Philadelphia Flyers*
150 Columbus Blue Jackets*
151 Washington Capitals
152 New Jersey Devils
153 Carolina Hurricanes
154 New York Islanders
155 Central Division
156 Colorado Avalanche*
157 St. Louis Blues*
158 Chicago Blackhawks*
159 Minnesota Wild*
160 Dallas Stars*
161 Nashville Predators
162 Winnipeg Jets
163 Pacific Division
164 Anaheim Ducks*
165 San Jose Sharks*
166 Los Angeles Kings*
167 Phoenix Coyotes
168 Vancouver Canucks
169 Calgary Flames
170 Edmonton Oilers
Name: team, dtype: object
I need to create one additional column with the city name.
At first look the regex would be simple (the first word) should be the city name, and the rest is the team name.
However some cities have 2 words (Los Angeles, St Louis ,etc)
Is there a possibility to do this with regex or it has to be done manually?
Update: I tried the following:
nhl_df['city']=nhl_df['team'].str.extract(r'^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$')
But I get this error:
ValueError: Wrong number of items passed 2, placement implies 1
You can try something like that:
^(?:([\w.]{1,5}\s\w+)|(\w+)|)(?:\s\w+)+\*?$
Here you should look for city name in first or second group.
This pattern uses assumption that first part of two-word city names has no more than 5 symbols. The result might not be so clean, but seems to work fine on given example.
You can use
^([\w.]{1,5}(?:\s\w+)?\w*)
See the regex demo. Details:
^ - start of string
([\w.]{1,5}(?:\s\w+)?\w*) - Capturing group 1:
[\w.]{1,5} - one to five word or dot chars
(?:\s\w+)? - an optional occurrence of a whitespace and then one or more word chars
\w* - zero or more word chars.
Pandas test:
import pandas as pd
nhl_df = pd.DataFrame({"team":["Atlantic Division","Tampa Bay Lightning*","Boston Bruins*","Toronto Maple Leafs*","Florida Panthers","Detroit Red Wings","Montreal Canadiens","Ottawa Senators","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Philadelphia Flyers*","Columbus Blue Jackets*","New Jersey Devils*","Carolina Hurricanes","New York Islanders","New York Rangers","Central Division","Nashville Predators*","Winnipeg Jets*","Minnesota Wild*","Colorado Avalanche*","St. Louis Blues","Dallas Stars","Chicago Blackhawks","Pacific Division","Vegas Golden Knights*","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Calgary Flames","Edmonton Oilers","Vancouver Canucks","Arizona Coyotes","Atlantic Division","Montreal Canadiens*","Ottawa Senators*","Boston Bruins*","Toronto Maple Leafs*","Tampa Bay Lightning","Florida Panthers","Detroit Red Wings","Buffalo Sabres","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","Columbus Blue Jackets*","New York Rangers*","New York Islanders","Philadelphia Flyers","Carolina Hurricanes","New Jersey Devils","Central Division","Chicago Blackhawks*","Minnesota Wild*","St. Louis Blues*","Nashville Predators*","Winnipeg Jets","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Edmonton Oilers*","San Jose Sharks*","Calgary Flames*","Los Angeles Kings","Arizona Coyotes","Vancouver Canucks","Atlantic Division","Florida Panthers*","Tampa Bay Lightning*","Detroit Red Wings*","Boston Bruins","Ottawa Senators","Montreal Canadiens","Buffalo Sabres","Toronto Maple Leafs","Metropolitan Division","Washington Capitals*","Pittsburgh Penguins*","New York Rangers*","New York Islanders*","Philadelphia Flyers*","Carolina Hurricanes","New Jersey Devils","Columbus Blue Jackets","Central Division","Dallas Stars*","St. Louis Blues*","Chicago Blackhawks*","Nashville Predators*","Minnesota Wild*","Colorado Avalanche","Winnipeg Jets","Pacific Division","Anaheim Ducks*","Los Angeles Kings*","San Jose Sharks*","Arizona Coyotes","Calgary Flames","Vancouver Canucks","Edmonton Oilers","Atlantic Division","Montreal Canadiens*","Tampa Bay Lightning*","Detroit Red Wings*","Ottawa Senators*","Boston Bruins","Florida Panthers","Toronto Maple Leafs","Buffalo Sabres","Metropolitan Division","New York Rangers*","Washington Capitals*","New York Islanders*","Pittsburgh Penguins*","Columbus Blue Jackets","Philadelphia Flyers","New Jersey Devils","Carolina Hurricanes","Central Division","St. Louis Blues*","Nashville Predators*","Chicago Blackhawks*","Minnesota Wild*","Winnipeg Jets*","Dallas Stars","Colorado Avalanche","Pacific Division","Anaheim Ducks*","Vancouver Canucks*","Calgary Flames*","Los Angeles Kings","San Jose Sharks","Edmonton Oilers","Arizona Coyotes","Atlantic Division","Boston Bruins*","Tampa Bay Lightning*","Montreal Canadiens*","Detroit Red Wings*","Ottawa Senators","Toronto Maple Leafs","Florida Panthers","Buffalo Sabres","Metropolitan Division","Pittsburgh Penguins*","New York Rangers*","Philadelphia Flyers*","Columbus Blue Jackets*","Washington Capitals","New Jersey Devils","Carolina Hurricanes","New York Islanders","Central Division","Colorado Avalanche*","St. Louis Blues*","Chicago Blackhawks*","Minnesota Wild*","Dallas Stars*","Nashville Predators","Winnipeg Jets","Pacific Division","Anaheim Ducks*","San Jose Sharks*","Los Angeles Kings*","Phoenix Coyotes","Vancouver Canucks","Calgary Flames","Edmonton Oilers"]})
nhl_df['city']=nhl_df['team'].str.extract(r'^([\w.]{1,5}(?:\s\w+)?\w*)')
>>> nhl_df
team city
0 Atlantic Division Atlantic
1 Tampa Bay Lightning* Tampa Bay
2 Boston Bruins* Boston
3 Toronto Maple Leafs* Toronto
4 Florida Panthers Florida
.. ... ...
166 Los Angeles Kings* Los Angeles
167 Phoenix Coyotes Phoenix
168 Vancouver Canucks Vancouver
169 Calgary Flames Calgary
170 Edmonton Oilers Edmonton
^\S+(?=\s\S+$)
This regex gives you the first word of all teamnames that only consist of two words. The others you have to sort manually, because there is no way to tell just by pattern if the middle word is part of the city or the teamname.
Try using the below regex
/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/
Checkthis
function Replace(str) {
var result = str.replace(/(\d*\s*)([a-zA-Z\s]*)(\s)(\b([a-zA-z\*]*))$/gim, function (a, $1, $2, $3, $4) {
return `${$2}--${$4}`;
});
return result;
}
I've combine two DataFrames into one but can't figure out how to label "state_x" and "state_y" tp "West Coast and "East Coast". I will be plotting them later.
What I have so far:
West_quakes = pd.DataFrame({'state': ['California', 'Oregon', 'Washington', 'Alaska'],
'Occurrences': [18108, 376, 973, 12326]})
East_quakes = pd.DataFrame({'state': ['Maine', 'New Hampshire', 'Massachusetts',
'Connecticut', 'New York', 'New Jersey', 'Pennsylvania', 'Maryland',
'Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida'],
'Occurrences': [36, 13, 10, 5, 35, 10, 14, 2, 28, 17, 32, 14, 1]})
West_quakes.reset_index(drop=True).merge(East_quakes.reset_index(drop=True), left_index=True, right_index=True)
Output:
state_x Occurrences_x state_y Occurrences_y
0 California 18108 Maine 36
1 Oregon 376 New Hampshire 13
2 Washington 973 Massachusetts 10
3 Alaska 12326 Connecticut 5
Other merging methods I've tried but results in syntax error such as:
West_quake.set_index('West Coast', inplace=True)
East_quake.set_index('East Coast', inplace=True)
I'm really lost after searching on Google and searching on here.
Any help would be greatly appreciated.
Thank you.
Maybe you are looking for concat instead:
pd.concat((West_quakes, East_quakes))
gives:
state Occurrences
0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), keys=('West','East'))
which gives:
state Occurrences
West 0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
East 0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), axis=1, keys=('West','East'))
outputs:
West East
state Occurrences state Occurrences
0 California 18108.0 Maine 36
1 Oregon 376.0 New Hampshire 13
2 Washington 973.0 Massachusetts 10
3 Alaska 12326.0 Connecticut 5
4 NaN NaN New York 35
5 NaN NaN New Jersey 10
6 NaN NaN Pennsylvania 14
7 NaN NaN Maryland 2
8 NaN NaN Virginia 28
9 NaN NaN North Carolina 17
10 NaN NaN South Carolina 32
11 NaN NaN Georgia 14
12 NaN NaN Florida 1
I have a text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
I'm trying to read it into a pandas data frame like this:
df = pd.read_table('assist1.txt',
sep='\s+',
skiprows=6,
header=0,)
This code throws an exception - pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 31, saw 8.
I guess that's because of the space between the first and last name of the player (should be the value of the Player column).
Is there a way to achieve this?
Furthermore, it is a part of a larger text file that looks like this:
************************************************************************************************
English Premier Division - Saturday 25th May 2002
************************************************************************************************
================================================================================================
2001/2 Table
================================================================================================
Pos Team Pld Won Drn Lst For Ag Won Drn Lst For Ag Pts
--------------------------------------------------------------------------------------------------
1st C Man Utd 38 15 4 0 41 4 10 4 5 34 20 83
--------------------------------------------------------------------------------------------------
2nd Arsenal 38 15 2 2 38 9 11 3 5 28 14 83
3rd Leeds 38 15 4 0 33 8 9 4 6 36 37 80
4th Liverpool 38 13 4 2 25 7 9 2 8 26 24 72
5th Chelsea 38 16 1 2 44 18 4 5 10 24 33 66
6th Newcastle 38 11 5 3 40 23 7 3 9 25 33 62
7th Blackburn 38 11 3 5 36 24 5 5 9 23 30 56
8th Middlesbrough 38 9 7 3 31 19 5 6 8 20 29 55
9th Sunderland 38 8 5 6 31 30 8 2 9 22 25 55
10th West Ham 38 11 3 5 31 17 3 7 9 14 29 52
11th Tottenham 38 10 3 6 35 26 4 5 10 23 35 50
12th Leicester 38 7 5 7 23 20 6 4 9 26 28 48
13th Fulham 38 7 5 7 39 35 5 7 7 33 44 48
14th Ipswich 38 9 4 6 23 22 3 3 13 14 34 43
15th Charlton 38 5 5 9 18 26 5 4 10 16 30 39
16th Everton 38 8 4 7 30 28 1 5 13 11 36 36
17th Aston Villa 38 2 8 9 19 28 5 6 8 21 26 35
--------------------------------------------------------------------------------------------------
18th R Derby 38 6 4 9 25 28 3 3 13 14 39 34
19th R Southampton 38 5 7 7 34 34 1 4 14 12 35 29
20th R Bolton 38 6 3 10 25 31 1 4 14 15 40 28
================================================================================================
2001/2 Goals
================================================================================================
Pos Player Club Apps Gls
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 25
2nd Alan Shearer Newcastle 36 25
3rd Ruud van Nistelrooy Man Utd 26 23
4th Steve Marlet Fulham 38 20
5th Jimmy Floyd Hasselbaink Chelsea 30 (1) 20
6th Les Ferdinand Sunderland 27 (2) 17
7th Kevin Phillips Sunderland 36 17
8th Frédéric Kanouté West Ham 32 (3) 14
9th Marcus Bent Blackburn 28 (4) 13
10th Alen Boksic Middlesbrough 36 13
11th Eidur Gudjohnsen Chelsea 28 (3) 13
12th Luis Boa Morte Fulham 36 13
13th Michael Owen Liverpool 32 (1) 12
14th Dwight Yorke Man Utd 29 (1) 11
15th Henrik Pedersen Bolton 36 11
16th Juan Pablo Angel Aston Villa 34 (2) 11
17th Juan Sebastián Verón Man Utd 29 (2) 11
18th Shaun Bartlett Charlton 35 10
19th Matt Jansen Blackburn 28 (5) 10
20th Duncan Ferguson Everton 28 (5) 10
21st Ian Harte Leeds 37 10
22nd Bosko Balaban Aston Villa 36 10
23rd Robbie Fowler Liverpool 25 (3) 10
24th Georgi Kinkladze Derby 36 (1) 10
25th Hamilton Ricard Middlesbrough 28 (2) 10
26th Robert Pires Arsenal 24 (3) 9
27th Andrew Cole Man Utd 15 (5) 9
28th Rod Wallace Bolton 31 9
29th James Beattie Southampton 28 (1) 9
30th Robbie Keane Leeds 28 (8) 9
================================================================================================
2001/2 Assists
================================================================================================
Pos Player Club Apps Asts
-------------------------------------------------------------------------
1st David Beckham Man Utd 29 15
2nd Dean Gordon Middlesbrough 30 (1) 11
3rd John Collins Fulham 32 11
4th Ryan Giggs Man Utd 32 11
5th Kieron Dyer Newcastle 33 10
6th Sean Davis Fulham 23 (1) 10
7th Damien Duff Blackburn 30 (3) 10
8th Alan Smith Leeds 23 (6) 9
9th Jesper Grønkjær Chelsea 34 9
10th Andrejs Stolcers Fulham 28 9
11th Ian Harte Leeds 37 8
12th Eidur Gudjohnsen Chelsea 28 (3) 8
13th Robert Pires Arsenal 24 (3) 7
14th Lauren Arsenal 32 (1) 7
15th John Robinson Charlton 33 7
16th Michael Gray Sunderland 37 7
17th Henrik Pedersen Bolton 36 7
18th Anders Svensson Southampton 34 (2) 7
19th Lee Bowyer Leeds 32 7
20th Craig Hignett Blackburn 21 (6) 7
21st Paul Merson Aston Villa 27 7
22nd Teddy Sheringham Tottenham 37 7
23rd Steed Malbranque Fulham 16 (14) 7
24th Marian Pahars Southampton 37 7
25th Muzzy Izzet Leicester 28 7
26th Sergei Rebrov Tottenham 36 (1) 7
27th Julio Arca Sunderland 32 (1) 7
28th Christian Bassedas Newcastle 37 7
29th Juan Sebastián Verón Man Utd 29 (2) 7
30th Joe Cole West Ham 32 6
================================================================================================
2001/2 Average Rating
================================================================================================
Pos Player Club Apps Av R
-------------------------------------------------------------------------
1st Ruud van Nistelrooy Man Utd 26 8.54
2nd Thierry Henry Arsenal 34 8.09
3rd Alan Shearer Newcastle 36 7.97
4th Kieron Dyer Newcastle 33 7.94
5th Steve Marlet Fulham 38 7.89
6th Ian Harte Leeds 37 7.86
7th Andrew Cole Man Utd 15 (5) 7.85
8th Roy Keane Man Utd 19 7.84
9th Les Ferdinand Sunderland 27 (2) 7.83
10th Juan Sebastián Verón Man Utd 29 (2) 7.81
11th Eidur Gudjohnsen Chelsea 28 (3) 7.77
12th Jesper Grønkjær Chelsea 34 7.76
13th Michaël Silvestre Man Utd 32 7.72
14th Dean Gordon Middlesbrough 30 (1) 7.71
15th Michael Owen Liverpool 32 (1) 7.70
16th Patrick Vieira Arsenal 29 7.69
17th Robert Pires Arsenal 24 (3) 7.67
18th Ryan Giggs Man Utd 32 7.66
19th Dwight Yorke Man Utd 29 (1) 7.63
20th Mario Stanic Chelsea 29 (3) 7.63
21st Frédéric Kanouté West Ham 32 (3) 7.57
22nd Mark Viduka Leeds 21 7.57
23rd David Beckham Man Utd 29 7.55
24th Jimmy Floyd Hasselbaink Chelsea 30 (1) 7.55
25th Martin Taylor Blackburn 14 (8) 7.55
26th Titus Bramble Ipswich 33 7.55
27th Sol Campbell Arsenal 20 (1) 7.52
28th Mario Melchiot Chelsea 19 (2) 7.52
29th Stephane Henchoz Liverpool 29 7.52
30th Rio Ferdinand Leeds 36 (1) 7.51
================================================================================================
2001/2 Man of Match
================================================================================================
Pos Player Club Apps MoM
-------------------------------------------------------------------------
1st Thierry Henry Arsenal 34 8
2nd Ruud van Nistelrooy Man Utd 26 8
3rd Kieron Dyer Newcastle 33 6
4th Les Ferdinand Sunderland 27 (2) 6
5th Steve Marlet Fulham 38 6
6th Eidur Gudjohnsen Chelsea 28 (3) 6
7th Ian Harte Leeds 37 5
8th Richie Wellens Leicester 20 (9) 5
9th Henrik Pedersen Bolton 36 5
10th Alan Shearer Newcastle 36 5
11th Michael Owen Liverpool 32 (1) 4
12th Dean Gordon Middlesbrough 30 (1) 4
13th Matt Jansen Blackburn 28 (5) 4
14th Marcus Bent Blackburn 28 (4) 4
15th Kevin Campbell Everton 27 (4) 4
16th Titus Bramble Ipswich 33 4
17th Roy Keane Man Utd 19 4
18th Frédéric Kanouté West Ham 32 (3) 4
19th Patrick Vieira Arsenal 29 4
20th Hermann Hreidarsson Ipswich 34 4
21st Dennis Bergkamp Arsenal 22 (9) 4
22nd Jimmy Floyd Hasselbaink Chelsea 30 (1) 4
23rd Claus Lundekvam Southampton 27 (2) 4
24th Robert Pires Arsenal 24 (3) 3
25th Shaun Bartlett Charlton 35 3
26th Kevin Phillips Sunderland 36 3
27th Lucas Radebe Leeds 31 (1) 3
28th Ragnvald Soma West Ham 27 (3) 3
29th Dean Richards Tottenham 34 3
30th Wayne Quinn Liverpool 25 (4) 3
Ideally I would like to run a function that creates a data frame out of each table above, but can't figure it out.
Thanks
Thanks
another way you can specify the seperator as more than one space, and skiprows as a list of rows. I tried this and it gave me your expected output. You can write simple script to find which lines to be skipped and which to be considered.
df = pd.read_table('assist1.txt', sep='\s\s+', skiprows=[0,1,2,3,4,5,6,7,8,10], header=0,engine='python')
You're using whitespace as a delimiter, but this is fixed-length delimited, not whitespace delimited. You should google fixed-length parsing, e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html.
I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY