Python Pandas : Getting only 3 first elements from table - python

I using pandas to webscrape this site https://www.mapsofworld.com/lat_long/poland-lat-long.html but i only gettin 3 elements. How could I get all elements from table?
import numpy as np
import pandas as pd
#for getting world map
import folium
# Retreiving Latitude and Longitude coordinates
info = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",match='Augustow',skiprows=2)
#convering the table data into DataFrame
coordinates = pd.DataFrame(info[0])
data = coordinates.head()
print(data)

It looks like if you install and use html5lib as your parser it may fix your issues:
df = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",attrs={"class":"tableizer-table"},skiprows=2,flavor="html5lib")
>>>df
[ 0 1 2
0 Locations Latitude Longitude
1 NaN NaN NaN
2 Augustow 53°51'N 23°00'E
3 Auschwitz/Oswiecim 50°02'N 19°11'E
4 Biala Podxlaska 52°04'N 23°06'E
.. ... ... ...
177 Zawiercie 50°30'N 19°24'E
178 Zdunska Wola 51°37'N 18°59'E
179 Zgorzelec 51°10'N 15°0'E
180 Zyrardow 52°3'N 20°28'E
181 Zywiec 49°42'N 19°10'E
[182 rows x 3 columns]]

Related

Pandas replace not working on known values in a Dataframe

have a quick question about pandas replace.
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
df_no_na = df.replace({0: np.nan})
df_no_na = df_no_na.dropna()
print(df_no_na.head())
print(df.head())
This code will return:
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
[5 rows x 20 columns]
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
I have done this exact same thing and it has worked before I have no idea why it won't now, any help would be awesome, thanks!
You are passing df.replace() a set instead of a dictionary. You need to replace {0, np.nan} with {0: np.nan}:
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
print(df)
df_no_na = df.replace({0: np.nan}) # change this line
print(df_no_na)
df_no_na = df_no_na.dropna()
print(df_no_na())
index Cell_ID Duration Overall_Angle_Median Overall_Euclidean_Median
1 1.0 372005.0 536.0 45.164322 37.859595

Visualizing a file based on byte ranges in python

I have data that describes the makeup of a binary file. Each point of data specifies a start and end range in bytes as well as a type:
[0x046270, 0x057574, "type1"]
[0x057574, 0x05BF20, "type2"]
[0x05BF20, 0x05EF80, "type1"]
[0x05EF80, 0x05F050, "type2"]
I would like to be able to visualize the file by coloring sections and getting something similar to what can be seen in the old Windows disk defragmentation utility.
I have tried using matplotlib's stacked bar chart for this, but I am seeing some issues and think I may be misusing it for this purpose. Is there a name for the type of graph below or any clean way of going about rendering this?
Basic stacked graph with 256 sector images. To make it two tiers like the presented image, you need to add in ax2 or change the structure of the data There is a The process is very heavy, so it takes some time to output.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
FAT_No = np.arange(0, pow(2,8))
sector_st = random.choices(['type1','type2','type3','type4'], k=256)
value = [1]*256
before = ['before']*256
df = pd.DataFrame({'before':before,'fat_no':FAT_No, 'sector':sector_st, 'value':value})
df
before fat_no sector value
0 before 0 type1 1
1 before 1 type1 1
2 before 2 type4 1
3 before 3 type2 1
4 before 4 type2 1
... ... ... ... ...
251 before 251 type2 1
252 before 252 type2 1
253 before 253 type3 1
254 before 254 type2 1
255 before 255 type4 1
fig = plt.figure(figsize=(16,3),dpi=144)
ax = fig.add_subplot(111)
color = {'type1':'b','type2':'g','type3':'r','type4':'w'}
for i in range(len(df)):
ax.barh(df['before'], df['value'].iloc[i], color=color[df['sector'].iloc[i]], left=df['value'].iloc[:i].sum())
plt.show()

Reading a pandas data frame having unequal columns in observations

I am trying to read this small data file,
Link - https://drive.google.com/open?id=1nAS5mpxQLVQn9s_aAKvJt8tWPrP_DUiJ
I am using the code -
df = pd.read_table('/Data/123451_date.csv', sep=';', index_col=0, engine='python', error_bad_lines=False)
It has ';' as a seprator, and values are missing in the file for some columns values in some observations (or rows).
How can I read it properly. I see the current dataframe, which is not loaded properly.
It looks like the data you use has some garbage in it. Precisely, rows 1-33 (inclusive) have additional, unnecessary (non-GPS) information included. You can either fix the database by manually removing the unneeded information from the datasheet, or use following code snippet to skip the rows that include it:
from pandas import read_table
data = read_table('34_2017-02-06.gpx.csv', sep=';', skiprows=list(range(1, 34)).drop("Unnamed: 28", axis=1)
The drop("Unnamed: 28", axis=1) is simply there to remove an additional column that is created probably due to each row in your datasheet ending with a ; (because it reads the empty space at the end of each line as data).
The result of print(data.head()) is then as follows:
index cumdist ele ... esttotalpower lat lon
0 49 340 -34.8 ... 9 52.077362 5.114530
1 51 350 -34.8 ... 17 52.077468 5.114543
2 52 360 -35.0 ... -54 52.077521 5.114551
3 53 370 -35.0 ... -173 52.077603 5.114505
4 54 380 -34.8 ... 335 52.077677 5.114387
[5 rows x 28 columns]
To explain the role of the drop command even more, here is what would happen without it (notice the last, weird column)
index cumdist ele ... lat lon Unnamed: 28
0 49 340 -34.8 ... 52.077362 5.114530 NaN
1 51 350 -34.8 ... 52.077468 5.114543 NaN
2 52 360 -35.0 ... 52.077521 5.114551 NaN
3 53 370 -35.0 ... 52.077603 5.114505 NaN
4 54 380 -34.8 ... 52.077677 5.114387 NaN
[5 rows x 29 columns]

How to create pandas dataframe from web scrape?

I would like to use this web scrape to create a pandas dataframe that way I can export the data to excel. Is anyone familiar with this? I have seen different methods online and on this site but have been unable to successfully duplicate the results with this scrape.
Here is the code so far:
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
for player in team['home_players']:
print(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
This site seems useful but the examples are different:
https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm
Here is another example from stackoverflow.com:
Loading web scraping results into Pandas DataFrame
I am new to coding/scraping so any help will greatly appreciated. Thanks in advance for your time and effort!
I have added a solution to have a dataframe teamwise, I hope this helps. Updated code
import requests
source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
players = []
teams = []
for team in source['data']:
print("\n%s players\n" % team['home_route'].capitalize())
teams.append(team['home_route'].capitalize())
teams.append(team['away_route'].capitalize())
temp = []
temp1 = []
for player in team['home_players']:
print(player['name'])
temp.append(player['name'])
print("\n%s players\n" % team['away_route'].capitalize())
for player in team['away_players']:
print(player['name'])
temp1.append(player['name'])
players.append(temp)
players.append(temp1)
import pandas as pd
df = pd.DataFrame(columns=teams)
for i in range(0, len(df.columns)):
df[df.columns[i]] = players[i]
df
In order to export to excel, you can do
df.to_excel('result.xlsx')
Python requests conveniently renders the json as a dict so you can just use the dict in a pd.DataFrame constructor.
import pandas as pd
df = pd.DataFrame([dict1, dict2, dict3])
# Do your data processing here
df.to_csv("myfile.csv")
Pandas also has pd.io.json with helpers like json_normalize so once your data is in a dataframe you can process nested json in to tabular data, and so on.
you can try like below..
>>> import pandas as pd
>>> import json
>>> import requests
>>> source = requests.get("https://api.lineups.com/nba/fetch/lineups/gateway").json()
>>> df = pd.DataFrame.from_dict(source) # directly use source as itself is a dict
Now you can take the dataframe into csv format by df.to_csv as follows:
>>> df.to_csv("nba_play.csv")
Below are Just your columns which you can process for your data as desired..
>>> df.columns
Index(['bottom_header', 'bottom_paragraph', 'data', 'heading',
'intro_paragraph', 'page_title', 'twitter_link'],
dtype='object')
However as Charles said, you can use json_normalize which will give you better view of data in a tabular form..
>>> from pandas.io.json import json_normalize
>>> json_normalize(df['data']).head()
away_bets.key away_bets.moneyline away_bets.over_under \
0 ATL 500 o232.0
1 POR 165 o217.0
2 SAC 320 o225.0
3 BKN 110 o216.0
4 TOR -140 o221.0
away_bets.over_under_moneyline away_bets.spread \
0 -115 11.0
1 -115 4.5
2 -105 9.0
3 -105 2.0
4 -105 -2.0
away_bets.spread_moneyline away_bets.total \
0 -110 121.50
1 -105 110.75
2 -115 117.00
3 -110 109.00
4 -115 109.50
away_injuries \
0 [{'name': 'J. Collins', 'profile_url': '/nba/p...
1 [{'name': 'M. Harkless', 'profile_url': '/nba/...
2 [{'name': 'K. Koufos', 'profile_url': '/nba/pl...
3 [{'name': 'T. Graham', 'profile_url': '/nba/pl...
4 [{'name': 'O. Anunoby', 'profile_url': '/nba/p...
away_players away_route \
0 [{'draftkings_projection': 30.04, 'yahoo_posit... atlanta-hawks
1 [{'draftkings_projection': 47.33, 'yahoo_posit... portland-trail-blazers
2 [{'draftkings_projection': 28.88, 'yahoo_posit... sacramento-kings
3 [{'draftkings_projection': 37.02, 'yahoo_posit... brooklyn-nets
4 [{'draftkings_projection': 45.2, 'yahoo_positi... toronto-raptors
... nav.matchup_season nav.matchup_time \
0 ... 2019 2018-10-29T23:00:00+00:00
1 ... 2019 2018-10-29T23:00:00+00:00
2 ... 2019 2018-10-29T23:30:00+00:00
3 ... 2019 2018-10-29T23:30:00+00:00
4 ... 2019 2018-10-30T00:00:00+00:00
nav.status.away_team_score nav.status.home_team_score nav.status.minutes \
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
nav.status.quarter_integer nav.status.seconds nav.status.status \
0 None Scheduled
1 None Scheduled
2 None Scheduled
3 None Scheduled
4 None Scheduled
nav.updated order
0 2018-10-29T17:51:05+00:00 0
1 2018-10-29T17:51:05+00:00 1
2 2018-10-29T17:51:05+00:00 2
3 2018-10-29T17:51:05+00:00 3
4 2018-10-29T17:51:05+00:00 4
[5 rows x 383 columns]
Hope, this will help

"ValueError: labels ['timestamp'] not contained in axis" error

I have this code ,i want to remove the column 'timestamp' from the file :u.data but can't.It shows the error
"ValueError: labels ['timestamp'] not contained in axis"
How can i correct it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
data = pd.read_table('u.data')
data.columns=['userID', 'itemID','rating', 'timestamp']
data.drop('timestamp', axis=1)
N = len(data)
print data.shape
print list(data.columns)
print data.head(10)
One of the biggest problem that one faces and that undergoes unnoticed is that in the u.data file while inserting headers the separation should be exactly the same as the separation between a row of data. For example if a tab is used to separate a tuple then you should not use spaces. In your u.data file add headers and separate them exactly with as many whitespaces as were used between the items of a row.
PS: Use sublime text, notepad/notepad++ does not work sometimes.
"ValueError: labels ['timestamp'] not contained in axis"
You don't have headers in the file, so the way you loaded it you got a df where the column names are the first rows of the data. You tried to access colunm timestamp which doesn't exist.
Your u.data doesn't have headers in it
$head u.data
196 242 3 881250949
186 302 3 891717742
So working with column names isn't going to be possible unless add the headers. You can add the headers to the file u.data, e.g. I opened it in a text editor and added the line a b c timestamp at the top of it (this seems to be a tab-separated file, so be careful when added the header not to use spaces, else it breaks the format)
$head u.data
a b c timestamp
196 242 3 881250949
186 302 3 891717742
Now your code works and data.columns returns
Index([u'a', u'b', u'c', u'timestamp'], dtype='object')
And the rest of the trace of your working code is now
(100000, 4) # the shape
['a', 'b', 'c', 'timestamp'] # the columns
a b c timestamp # the df
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
5 298 474 4 884182806
6 115 265 2 881171488
7 253 465 5 891628467
8 305 451 3 886324817
9 6 86 3 883603013
If you don't want to add headers
Or you can drop the column 'timestamp' using it's index (presumably 3), we can do this using df.ix below it selects all rows, columns index 0 to index 2, thus dropping the column with index 3
data.ix[:, 0:2]
i would do it this way:
data = pd.read_table('u.data', header=None,
names=['userID', 'itemID','rating', 'timestamp'],
usecols=['userID', 'itemID','rating']
)
Check:
In [589]: data.head()
Out[589]:
userID itemID rating
0 196 242 3
1 186 302 3
2 22 377 1
3 244 51 2
4 166 346 1

Categories

Resources