I have a text file from Kaggle of Clash Royale stats. It's in a format of Python Dictionaries. I am struggling to find out how to read that into a file in a meaningful way. Curious what the best way is to do this. It's a fairly complex Dict with Lists.
Original Dataset here:
https://www.kaggle.com/s1m0n38/clash-royale-matches-dataset
{'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'}
{'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '#LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}
According to this dataset's synopsis on kaggle, each dictionary represents a match between two players. I felt it would make sense to have each row in the dataframe represent all the characteristics of a single match.
This can be accomplished in a few short steps.
Store all the match dictionaries (each row of the dataset from kaggle) inside one list:
matches = [
{'players': {'right': {'deck': [['Mega Minion', '9'], ['Electro Wizard', '3'], ['Arrows', '11'], ['Lightning', '5'], ['Tombstone', '9'], ['The Log', '2'], ['Giant', '9'], ['Bowler', '5']], 'trophy': '4258', 'clan': 'TwoFiveOne', 'name': 'gpa raid'}, 'left': {'deck': [['Fireball', '9'], ['Archers', '12'], ['Goblins', '12'], ['Minions', '11'], ['Bomber', '12'], ['The Log', '2'], ['Barbarians', '12'], ['Royal Giant', '13']], 'trophy': '4325', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['2', '0'], 'time': '2017-07-12'},
{'players': {'right': {'deck': [['Ice Spirit', '10'], ['Valkyrie', '9'], ['Hog Rider', '9'], ['Inferno Tower', '9'], ['Goblins', '12'], ['Musketeer', '9'], ['Zap', '12'], ['Fireball', '9']], 'trophy': '4237', 'clan': 'The Wolves', 'name': 'TITAN'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4296', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['1', '0'], 'time': '2017-07-12'},
{'players': {'right': {'deck': [['Miner', '3'], ['Ice Golem', '9'], ['Spear Goblins', '12'], ['Minion Horde', '12'], ['Inferno Tower', '8'], ['The Log', '2'], ['Skeleton Army', '6'], ['Fireball', '10']], 'trophy': '4300', 'clan': '#LA PERLA NEGRA', 'name': 'Victor'}, 'left': {'deck': [['Royal Giant', '13'], ['Ice Wizard', '2'], ['Bomber', '12'], ['Knight', '12'], ['Fireball', '9'], ['Barbarians', '12'], ['The Log', '2'], ['Archers', '12']], 'trophy': '4267', 'clan': 'battusai', 'name': 'Supr4'}}, 'type': 'ladder', 'result': ['0', '1'], 'time': '2017-07-12'}
]
Create a dataframe from the above list, which will automatically populate columns that contain info for the type, time, and result of the match:
df = pd.DataFrame(matches)
Then, use some simple logic to populate columns containing info on the deck, trophy, clan, and name of both the left and right players in the match:
sides = ['right', 'left']
player_keys = ['deck', 'trophy', 'clan', 'name']
for side in sides:
for key in player_keys:
for i, row in df.iterrows():
df[side + '_' + key] = df['players'].apply(lambda x: x[side][key])
df = df.drop('players', axis=1) # no longer need this after populating the other columns
df = df.iloc[:, ::-1] # made sense to display columns in order of player info from left to right,
# followed by general match info at the far right of the dataframe
The resulting dataframe looks like this:
left_name left_clan left_trophy left_deck right_name right_clan right_trophy right_deck type time result
0 Supr4 battusai 4325 [[Fireball, 9], [Archers, 12], [Goblins, 12], ... gpa raid TwoFiveOne 4258 [[Mega Minion, 9], [Electro Wizard, 3], [Arrow... ladder 2017-07-12 [2, 0]
1 Supr4 battusai 4296 [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ... TITAN The Wolves 4237 [[Ice Spirit, 10], [Valkyrie, 9], [Hog Rider, ... ladder 2017-07-12 [1, 0]
2 Supr4 battusai 4267 [[Royal Giant, 13], [Ice Wizard, 2], [Bomber, ... Victor #LA PERLA NEGRA 4300 [[Miner, 3], [Ice Golem, 9], [Spear Goblins, 1... ladder 2017-07-12 [0, 1]
I saved your data to .json files, then just needed to loop through each line and treat it as it's own JSON file, then I used pandas.json_normalize to load it into a DataFrame and I made some guesses at how you wanted the df to look but I came up with this:
note: proper JSON needs to have double quotes not single so I used replace to work around this. Be careful that no data inside is destroyed using this.
note: The way I got this to work, I had to merge 'right' and 'left' so you are losing this data. If this is needed you could use a dict comp as a workaround
import json
import pandas as pd
with open('cr.json', 'r') as f:
df = None
for line in f:
data = json.loads(line.replace("'", '"'))
#needed to put the right and left keys together, maybe you can find a way around this, I wasn't
df1 = pd.json_normalize([data['players']['right'], data['players']['left']],
'deck',
['name', 'trophy', 'clan'],
meta_prefix='player.',
errors='ignore')
df = pd.concat([df, df1])
df.rename(columns={0: 'player.troop.name', 1: 'player.troop.level'},
inplace=True)
print(df)
This prints:
player.troop.name player.troop.level player.name player.clan \
0 Mega Minion 9 gpa raid TwoFiveOne
1 Electro Wizard 3 gpa raid TwoFiveOne
2 Arrows 11 gpa raid TwoFiveOne
3 Lightning 5 gpa raid TwoFiveOne
4 Tombstone 9 gpa raid TwoFiveOne
5 The Log 2 gpa raid TwoFiveOne
6 Giant 9 gpa raid TwoFiveOne
7 Bowler 5 gpa raid TwoFiveOne
8 Fireball 9 Supr4 battusai
9 Archers 12 Supr4 battusai
10 Goblins 12 Supr4 battusai
11 Minions 11 Supr4 battusai
12 Bomber 12 Supr4 battusai
13 The Log 2 Supr4 battusai
14 Barbarians 12 Supr4 battusai
15 Royal Giant 13 Supr4 battusai
0 Ice Spirit 10 TITAN The Wolves
1 Valkyrie 9 TITAN The Wolves
2 Hog Rider 9 TITAN The Wolves
3 Inferno Tower 9 TITAN The Wolves
4 Goblins 12 TITAN The Wolves
5 Musketeer 9 TITAN The Wolves
6 Zap 12 TITAN The Wolves
7 Fireball 9 TITAN The Wolves
8 Royal Giant 13 Supr4 battusai
9 Ice Wizard 2 Supr4 battusai
10 Bomber 12 Supr4 battusai
11 Knight 12 Supr4 battusai
12 Fireball 9 Supr4 battusai
13 Barbarians 12 Supr4 battusai
14 The Log 2 Supr4 battusai
15 Archers 12 Supr4 battusai
0 Miner 3 Victor #LA PERLA NEGRA
1 Ice Golem 9 Victor #LA PERLA NEGRA
2 Spear Goblins 12 Victor #LA PERLA NEGRA
3 Minion Horde 12 Victor #LA PERLA NEGRA
4 Inferno Tower 8 Victor #LA PERLA NEGRA
5 The Log 2 Victor #LA PERLA NEGRA
6 Skeleton Army 6 Victor #LA PERLA NEGRA
7 Fireball 10 Victor #LA PERLA NEGRA
8 Royal Giant 13 Supr4 battusai
9 Ice Wizard 2 Supr4 battusai
10 Bomber 12 Supr4 battusai
11 Knight 12 Supr4 battusai
12 Fireball 9 Supr4 battusai
13 Barbarians 12 Supr4 battusai
14 The Log 2 Supr4 battusai
15 Archers 12 Supr4 battusai
player.trophy
0 4258
1 4258
2 4258
3 4258
4 4258
5 4258
6 4258
7 4258
8 4325
9 4325
10 4325
11 4325
12 4325
13 4325
14 4325
15 4325
0 4237
1 4237
2 4237
3 4237
4 4237
5 4237
6 4237
7 4237
8 4296
9 4296
10 4296
11 4296
12 4296
13 4296
14 4296
15 4296
0 4300
1 4300
2 4300
3 4300
4 4300
5 4300
6 4300
7 4300
8 4267
9 4267
10 4267
11 4267
12 4267
13 4267
14 4267
15 4267
And df.iloc[0] is as follows:
player.troop.name Mega Minion
player.troop.level 9
player.name gpa raid
player.trophy 4258
player.clan TwoFiveOne
Name: 0, dtype: object
You can rework the json_normalize parameters how you see fit, but I hope this is more than enough to get you going
The other answers only work with the toy data, as presented in the OP. This answer deals with the actual file from Kaggle, and how to clean it.
The Kaggle file, matches.txt, is rows of nested dicts
Within the file, each row has 4 top level keys, ['players', 'type', 'result', 'time']
Read the file in, which will make each row a str type
Convert it from str to dict type with ast.literal_eval
Some of the rows are not correctly formatted, and will result in a SyntaxError
The data can be converted to a dataframe with pandas.json_normalize
Imports
import pandas as pd
from ast import literal_eval
Clean the File
# store the data
data = list()
# store the broken rows
broken_row = list()
# read in the file
with open('matches.txt', 'r', encoding='utf-8') as f:
# read the rows
rows = f.readlines()
for row in rows:
# try to convert a row from a string to dict
try:
row = literal_eval(row)
data.append(row)
except SyntaxError:
broken_row.append(row)
continue
Convert data to a long DataFrame
For each match, each 'players.right.deck', 'players.left.deck' gets a separate row.
# convert data to a dataframe
players = pd.json_normalize(data)
# add a unique id for each row, which can be used to identify players for a particular game
df['id'] = df.index
# split the list of lists in right.deck and left.deck to separate rows
players = df[['id', 'players.right.deck', 'players.left.deck']].apply(pd.Series.explode).reset_index(drop=True)
# drop the original columns
df.drop(columns=['players.right.deck', 'players.left.deck'], inplace=True)
# right.deck and left.deck are still a list with two values, which need to have separate columns
players[['right.deck.name', 'right.deck.number']] = pd.DataFrame(players.pop('players.right.deck').values.tolist())
players[['left.deck.name', 'left.deck.number']] = pd.DataFrame(players.pop('players.left.deck').values.tolist())
# separate the result column into two columns
df[['right.result', 'left.result']] = pd.DataFrame(df.pop('result').values.tolist())
# merge df with players
df = df.merge(players, on='id')
df.head(8)
type time players.right.trophy players.right.clan players.right.name players.left.trophy players.left.clan players.left.name id right.result left.result right.deck.name right.deck.number left.deck.name left.deck.number
0 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Mega Minion 9 Fireball 9
1 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Electro Wizard 3 Archers 12
2 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Arrows 11 Goblins 12
3 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Lightning 5 Minions 11
4 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Tombstone 9 Bomber 12
5 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 The Log 2 The Log 2
6 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Giant 9 Barbarians 12
7 ladder 2017-07-12 4258 TwoFiveOne gpa raid 4325 battusai Supr4 0 2 0 Bowler 5 Royal Giant 13
Convert data to a wide DataFrame
This option uses the flatten_json function.
For each match, each 'players.right.deck', 'players.left.deck' gets a separate column.
# convert data to a wide dataframe
df = pd.DataFrame([flatten_json(x) for x in data])
# display(df.head(3))
players_right_deck_0_0 players_right_deck_0_1 players_right_deck_1_0 players_right_deck_1_1 players_right_deck_2_0 players_right_deck_2_1 players_right_deck_3_0 players_right_deck_3_1 players_right_deck_4_0 players_right_deck_4_1 players_right_deck_5_0 players_right_deck_5_1 players_right_deck_6_0 players_right_deck_6_1 players_right_deck_7_0 players_right_deck_7_1 players_right_trophy players_right_clan players_right_name players_left_deck_0_0 players_left_deck_0_1 players_left_deck_1_0 players_left_deck_1_1 players_left_deck_2_0 players_left_deck_2_1 players_left_deck_3_0 players_left_deck_3_1 players_left_deck_4_0 players_left_deck_4_1 players_left_deck_5_0 players_left_deck_5_1 players_left_deck_6_0 players_left_deck_6_1 players_left_deck_7_0 players_left_deck_7_1 players_left_trophy players_left_clan players_left_name type result_0 result_1 time
0 Mega Minion 9 Electro Wizard 3 Arrows 11 Lightning 5 Tombstone 9 The Log 2 Giant 9 Bowler 5 4258 TwoFiveOne gpa raid Fireball 9 Archers 12 Goblins 12 Minions 11 Bomber 12 The Log 2 Barbarians 12 Royal Giant 13 4325 battusai Supr4 ladder 2 0 2017-07-12
1 Ice Spirit 10 Valkyrie 9 Hog Rider 9 Inferno Tower 9 Goblins 12 Musketeer 9 Zap 12 Fireball 9 4237 The Wolves TITAN Royal Giant 13 Ice Wizard 2 Bomber 12 Knight 12 Fireball 9 Barbarians 12 The Log 2 Archers 12 4296 battusai Supr4 ladder 1 0 2017-07-12
2 Miner 3 Ice Golem 9 Spear Goblins 12 Minion Horde 12 Inferno Tower 8 The Log 2 Skeleton Army 6 Fireball 10 4300 #LA PERLA NEGRA Victor Royal Giant 13 Ice Wizard 2 Bomber 12 Knight 12 Fireball 9 Barbarians 12 The Log 2 Archers 12 4267 battusai Supr4 ladder 0 1 2017-07-12
Related
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
I have the following dataframes:
df = pd.DataFrame({'nameCompany': ['Piestrita Inc', 'Total Play', 'Yate Inc', 'Spider Comp', 'Tech solutions', 'LG Inno'],
'code': ['1', '1', '2', '3', '3', '3']
'results': ['Rick', 'Patram', 'Pulis', 'Marie', 'Landon', 'Freddy']})
df2 = pd.DataFrame({'nameCompany': ['Alaska Inc', 'Kira', 'Joli Molly', 'Health Society'],
'code': ['1', '2', '3', '3']})
df:
nameCompany
code
results
Piestrita Inc
1
Rick
Total Play
1
Patram
Yate Inc
2
Pulis
Spider Comp
3
Marie
Tech solutions
3
Landon
LG Inno
3
Freddy
df2:
nameCompany
code
Alaska Inc
1
Kira
2
Joli Molly
3
Health Society
3
I need to make an update in the df in order to update the value of companyName if it appears in the df2 the code of the df, this update must be on the last element of the df if only one code appears in the df2 but if more appear it must be in the last positions, therefore, the output should be the following one:
df_new = pd.DataFrame({'nameCompany': ['Piestrita Inc', 'Alaska Inc', 'Kira', 'Spider Comp', 'Joli Molly', 'Health Society'],
'code': ['1', '1', '2', '3', '3', '3']
'results': ['Rick', 'Patram', 'Pulis', 'Marie', 'Landon', 'Freddy']})
df_new:
nameCompany
code
results
Pietrista Inc
1
Rick
Alaska Inc
1
Patram
Kira
2
Pulis
Spider Comp
3
Marie
Joli Molly
3
London
Health Society
3
Freddy
I have tried with the update method but I have not obtained the expected results, any suggestions?
Use GroupBy.cumcount with ascending=False for counter column from last values, then use DataFrame.merge by it and code and last use Series.combine_first:
df['g'] = df.groupby('code').cumcount(ascending=False)
df2['g'] = df2.groupby('code').cumcount(ascending=False)
df = df.merge(df2, on=['code','g'], how='left', suffixes=['','_']).drop('g', axis=1)
df['nameCompany'] = df.pop('nameCompany_').combine_first(df['nameCompany'])
print (df)
nameCompany code results
0 Piestrita Inc 1 Rick
1 Alaska Inc 1 Patram
2 Kira 2 Pulis
3 Spider Comp 3 Marie
4 Joli Molly 3 Landon
5 Health Society 3 Freddy
The dataframe has a column with list of dictionaries with same key names . How can i convert it into a tall dataframe? The dataframe is as shown.
A B
1 [{"name":"john","age":"28","salary":"50000"},{"name":"Todd","age":"36","salary":"54000"}]
2 [{"name":"Alex","age":"48","salary":"70000"},{"name":"Mark","age":"89","salary":"150000"}]
3 [{"name":"jane","age":"36","salary":"20000"},{"name":"Rose","age":"28","salary":"90000"}
How to convert the following dataframe to the below one
A name age salary
1 john 28 50000
1 Todd 36 54000
2 Alex 48 70000
2 Mark 89 150000
3 jane 36 20000
3 Rose 28 90000
You are looking for unesting first then , using the same method I provided before .
newdf=unnesting(df,['B'])
pd.concat([newdf,pd.DataFrame(newdf.pop('B').tolist(),index=newdf.index)],axis=1)
A age name salary
0 1 28 john 50000
0 1 36 Todd 54000
1 2 48 Alex 70000
1 2 89 Mark 150000
2 3 36 jane 20000
2 3 28 Rose 90000
More info I have attached my self-def function , you can also find it in the page I linked
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
Data Input
df.B.to_dict()
{0: [{'name': 'john', 'age': '28', 'salary': '50000'}, {'name': 'Todd', 'age': '36', 'salary': '54000'}], 1: [{'name': 'Alex', 'age': '48', 'salary': '70000'}, {'name': 'Mark', 'age': '89', 'salary': '150000'}], 2: [{'name': 'jane', 'age': '36', 'salary': '20000'}, {'name': 'Rose', 'age': '28', 'salary': '90000'}]}
I have a function that opens a file called: "table1.txt" and outputs the comma separated values into a certain format.
My function is:
def sort_and_format():
contents = []
with open('table1.txt', 'r+') as f:
for line in f:
contents.append(line.split(','))
max_name_length = max([len(line[0]) for line in contents])
print(" Team Points Diff Goals \n")
print("--------------------------------------------------------------------------\n")
for i, line in enumerate(contents):
line = [el.replace('\n', '') for el in line]
print("{i:3} {0:{fill_width}} {1:3} {x:3} {2:3} :{3:3}".format(i=i+1, *line,
x = (int(line[2])- int(line[3])), fill_width=max_name_length))
I figured out how to format it correctly so for a "table1.txt file of:
FC Ingolstadt 04, 13, 4, 6
Hamburg, 9, 8, 10
SV Darmstadt 98, 9, 8, 9
Mainz, 9, 6, 9
FC Augsburg, 4, 7, 12
Werder Bremen, 6, 7, 12
Borussia Moenchengladbach, 6, 9, 15
Hoffenheim, 5, 8, 12
VfB Stuttgart, 4, 9, 17
Schalke 04, 16, 14, 3
Hannover 96, 2, 6, 18
Borrusia Dortmund, 16, 15, 4
Bayern Munich, 18, 18, 2
Bayer Leverkusen, 14, 11, 8
Eintracht Frankfurt, 9, 13, 9
Hertha BSC Berlin, 14, 5, 4
1. FC Cologne, 13, 10, 10
VfB Wolfsburg, 14, 10, 6
It would output:
Team Points Diff Goals
--------------------------------------------------------------------------
1 FC Ingolstadt 04 13 -2 4 : 6
2 Hamburg 9 -2 8 : 10
3 SV Darmstadt 98 9 -1 8 : 9
4 Mainz 9 -3 6 : 9
5 FC Augsburg 4 -5 7 : 12
6 Werder Bremen 6 -5 7 : 12
7 Borussia Moenchengladbach 6 -6 9 : 15
8 Hoffenheim 5 -4 8 : 12
9 VfB Stuttgart 4 -8 9 : 17
10 Schalke 04 16 11 14 : 3
11 Hannover 96 2 -12 6 : 18
12 Borrusia Dortmund 16 11 15 : 4
13 Bayern Munich 18 16 18 : 2
14 Bayer Leverkusen 14 3 11 : 8
15 Eintracht Frankfurt 9 4 13 : 9
16 Hertha BSC Berlin 14 1 5 : 4
17 1. FC Cologne 13 0 10 : 10
18 VfB Wolfsburg 14 4 10 : 6
I am trying to figure out how to sort the file so that the team with the highest points would be ranked number 1, and if a team has equal points then they are ranked by diff(the difference in goals for and against the team), and if the diff is the same they are ranked by goals scored.
I thought of implementing a bubble sort function similar to:
def bubble_sort(lst):
j = len(lst)
made_swap = True
swaps = 0
while made_swap:
made_swap = False
for cnt in range (j-1):
if lst[cnt] < lst[cnt+1]:
lst[cnt], lst[cnt+1] = lst[cnt+1], lst[cnt]
made_swap = True
swaps = swaps + 1
return swaps
But I do not know how to isolate each line and compare the values of each to one another to sort.
The following code will sort the list in the ways you asked:
from operator import itemgetter
def sort_and_format():
contents = []
with open('table1.txt', 'r+') as f:
for line in f:
l = line.split(',')
l[1:]=map(int,l[1:])
contents.append(l)
contents.sort(key=itemgetter(2))
contents.sort(key=lambda team:team[2]-team[3])
contents.sort(key=itemgetter(1))
[printing and formatting code]
What this does diferently:
First of all, it converts all the data about each team to numbers, excluding the name. This allows the later code to do math on them.
Then the first contents.sort statement sorts the list by goals scored (index 2). operator.itemgetter(2) is just a faster way to say lambda l:l[2]. The next contents.sort statement stably sorts the list by goals for minus goals against, as that is what the lambda does. Stable sorting means that the order of equally-compairing elements does not change, so teams with equal goal diff remain sorted by goals scored. The third contents.sort statement does the same stable sort by points.
contents = [row.strip('\n').split(', ') for row in open('table1.txt', 'r+')]
so that your rows look like:
['FC Ingolstadt 04', '13', '4', '6']
Then you can use Python's built-in sort function:
table = sorted(contents, key=lambda r: (int(r[1]), int(r[2])-int(r[3]), int(r[3])), reverse=True)
and print 'table' with the specific formatting you want.
I have joined spaces in the first column with _ to make life easier, so the data looks like:
F_ngolstad_4 13 -2 4:6
Hamburg 9 -2 8:10
S_armstad_8 9 -1 8:9
Mainz 9 -3 6:9
F_ugsburg 4 -5 7:12
Werde_remen 6 -5 7:12
Borussi_oenchengladbach 6 -6 9:15
Hoffenheim 5 -4 8:12
Vf_tuttgart 4 -8 9:17
Schalk_4 16 11 14:3
Hannove_6 2 -12 6:18
Borrusi_ortmund 16 11 15:4
Bayer_munich 18 16 18:2
Baye_everkusen 14 3 11:8
Eintrach_rankfurt 9 4 13:9
Herth_S_erlin 14 1 5:4
1._F_ologne 13 0 10:10
Vf_olfsburg 14 4 10:6
all_lines = []
with open('data', 'r') as f:
for line in f:
li = line.split()
all_lines.append(li)
l = sorted(all_lines,key=lambda x: (int(x[1]),int(x[2])),reverse=True)
for el in l:
print(el)
['Bayer_munich', '18', '16', '18:2']
['Schalk_4', '16', '11', '14:3']
['Borrusi_ortmund', '16', '11', '15:4']
['Vf_olfsburg', '14', '4', '10:6']
['Baye_everkusen', '14', '3', '11:8']
['Herth_S_erlin', '14', '1', '5:4']
['1._F_ologne', '13', '0', '10:10']
['F_ngolstad_4', '13', '-2', '4:6']
['Eintrach_rankfurt', '9', '4', '13:9']
['S_armstad_8', '9', '-1', '8:9']
['Hamburg', '9', '-2', '8:10']
['Mainz', '9', '-3', '6:9']
['Werde_remen', '6', '-5', '7:12']
['Borussi_oenchengladbach', '6', '-6', '9:15']
['Hoffenheim', '5', '-4', '8:12']
['F_ugsburg', '4', '-5', '7:12']
['Vf_tuttgart', '4', '-8', '9:17']
['Hannove_6', '2', '-12', '6:18']
Working on an NFL CSV file that can help me automate scoring for games. Right now, I can upload the teams and scores into ONLY 1 column of the csv file.
THESE ARE ALL IN COLUMN A
Example:
A
1 NYJ
2 27
3 PHI
4 20
5 BUF
6 13
7 DET
8 35
9 CIN
10 27
11 IND
12 10
13 MIA
14 24
15 NO
16 21
OR
[['NYJ`'], ['27'], ['PHI'], ['20'], ['BUF'], ['13'], ['DET'], ['35'], ['CIN'], ['27'], ['IND'], ['10'], ['MIA'], ['24'], ['NO'], ['21'], ['TB'], ['12'], ['WAS'], ['30'], ['CAR'], ['25'], ['PIT'], ['10'], ['ATL'], ['16'], ['JAC'], ['20'], ['NE'], ['28'], ['NYG'], ['20'], ['MIN'], ['24'], ['TEN'], ['23'], ['STL'], ['24'], ['BAL'], ['21'], ['CHI'], ['16'], ['CLE'], ['18'], ['KC'], ['30'], ['GB'], ['8'], ['DAL'], ['6'], ['HOU'], ['24'], ['DEN'], ['24'], ['ARI'], ['32'], ['SD'], ['6'`], ['SF'], ['41'], ['SEA'], ['22'], ['OAK'], ['6']]
What I want is this:
A B C D
1 NYJ 27 PHI 20
2 BUF 13 DET 35
3 CIN 27 IND 10
4 MIA 24 NO 21
I have read through previous articles on this and have not got it to work yet. Any ideas on this?
Any help is appreciated and thanks!
current script:
import nflgame
import csv
print "Purpose of this script is to get NFL Scores to help out with GUT"
pregames = nflgame.games(2013, week=[4], kind='PRE')
out = open("scores.csv", "wb")
output = csv.writer(out)
for score in pregames:
output.writerows([[score.home],[score.score_home],[score.away],[score.score_away]])
You're currently using .writerows() to write 4 rows, each with one column.
Instead, you want:
output.writerow([score.home, score.score_home, score.away, score.score_away])
to write a single row with 4 columns.
Without knowing the score data, try to change writerows to writerow:
import nflgame
import csv
print "Purpose of this script is to get NFL Scores to help out with GUT"
pregames = nflgame.games(2013, week=[4], kind='PRE')
out = open("scores.csv", "wb")
output = csv.writer(out)
for score in pregames:
output.writerow([[score.home],[score.score_home],[score.away],[score.score_away]])
This will output it all in one line.