Reading Data from URL into a Pandas Dataframe - python

I have a URL that I am having difficulty reading. It is uncommon in the sense that it is data that I have self-generated or in other words have created using my own inputs. I have tried with other queries to use something like this and it works fine but not in this case:
bst = pd.read_csv('https://psl.noaa.gov/data/correlation/censo.data', skiprows=1,
skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Here is the code + URL that is providing the challenge:
zwnd = pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?
ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Thank you for any help that you can provide.
Here is the full error message:
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
engine='python', # c engine doesn't have skipfooter
delim_whitespace=True)
Traceback (most recent call last):
Cell In[240], line 1
pd.read_csv('https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries', skiprows=1, skipfooter=2,index_col=[0], header=None,
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:211 in wrapper
return func(*args, **kwargs)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\util\_decorators.py:331 in wrapper
return func(*args, **kwargs)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:950 in read_csv
return _read(filepath_or_buffer, kwds)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:611 in _read
return parser.read(nrows)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\readers.py:1778 in read
) = self._engine.read( # type: ignore[attr-defined]
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:282 in read
alldata = self._rows_to_cols(content)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:1045 in _rows_to_cols
self._alert_malformed(msg, row_num + 1)
File ~\Anaconda3\envs\Stats\lib\site-packages\pandas\io\parsers\python_parser.py:765 in _alert_malformed
raise ParserError(msg)
ParserError: Expected 2 fields in line 133, saw 3. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

pd.read_csv does not parse HTML. You might try pd.read_html, but would find that it works on <table> tags, not <pre> tags.
On inspecting the HTML content of the given URL, it is evident that the data is contained in a <pre> tag.
Use something like requests to get the page content, and BeautifulSoup4 to parse the HTML page contents (with an appropriate parsing engine, either lxml or html5lib). Then pull out the content of the <pre> tag, splitting on newlines, slicing to ignore unwanted lines, and then splitting on whitespace.
Minimal working code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://psl.noaa.gov/cgi-bin/data/timeseries/timeseries.pl?ntype=1&var=Zonal+Wind&level=1000&lat1=50&lat2=25&lon1=-135&lon2=-65&iseas=0&mon1=0&mon2=0&iarea=0&typeout=1&Submit=Create+Timeseries'
res = requests.get(url)
# get the text from the 'pre' tag, split it on newlines
# slice off 1 head and 5 tail rows
# (inspect the contents of 'soup.find('pre').text' to determine correct values)
soup = BeautifulSoup(res.content, "html5lib")
data = soup.find('pre').text.split("\n")[1:-5]
df = pd.DataFrame([row.split() for row in data]).apply(pd.to_numeric)
df = df.set_index(df.iloc[:,0])
results in
>>> print(df.head(5))
0 1 2 3 4 5 6 7 8 9 10 11 12
0
1948 1948 0.878 0.779 0.851 0.393 0.461 0.747 0.867 0.539 -0.106 0.045 0.819 1.506
1949 1949 0.386 1.197 1.154 1.054 0.358 0.645 0.643 0.477 0.128 -0.091 1.500 0.390
1950 1950 0.674 0.973 1.640 0.821 0.572 1.002 0.635 0.196 -0.020 0.268 0.844 1.045
1951 1951 1.524 0.698 0.971 0.790 0.789 0.587 0.682 0.238 0.256 0.035 0.906 1.268
1952 1952 1.524 1.510 1.353 0.705 0.710 1.188 0.412 0.432 -0.091 0.415 0.443 1.509
and
>>> print(df.dtypes)
0 int64
1 float64
2 float64
...
12 float64
This answer is a good starting point for what you're trying to accomplish.

Its because the first one directly points to a dataset from storage in .data format but the second url points to a website (which is made up of html, css, json, etc files). You can only use pd.read_csv if you are parsing in a .csv file, and i guess a .data file too since it worked for you.
If you can find a link to the actual .data or .csv file on that website you will be able to parse it no problem. Since its a gov website, they probably will have a good file format.
If you cannot, and still need this data you will have to do some webscraping from that website (like using selenium), then you will need to store them as dataframes, and maybe preprocess it so it gets added like expected.

Related

Read multiple yaml files to pandas Dataframe

I do realize this has already been addressed here (e.g., Reading csv zipped files in python, How can I parse a YAML file in Python, Retrieving data from a yaml file based on a Python list). Nevertheless, I hope this question was different.
I know loading a YAML file to pandas dataframe
import yaml
import pandas as pd
with open(r'1000851.yaml') as file:
df = pd.io.json.json_normalize(yaml.load(file))
df.head()
I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. I have not been able to figure it out though...
import pandas as pd
import glob
path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")
li = []
for filename in all_files:
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
Sample Dataset Zipped
Sample Dataset
Is there a way to do this and read files efficiently?
It seems your first part of the code and the second one you added is different.
First part correctly reads yaml files, but the second part is broken:
for filename in all_files:
# `filename` here is just a string containing the name of the file.
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
The problem is that you need to read the files. Currently you're just giving the filename and not the file content. Do this instead
li=[]
# Only loading 3 files:
for filename in all_files[:3]:
with open(filename,'r') as fh:
df = pd.json_normalize(yaml.safe_load(fh.read()))
li.append(df)
len(li)
3
pd.concat(li)
output:
innings meta.data_version meta.created meta.revision info.city info.competition ... info.player_of_match info.teams info.toss.decision info.toss.winner info.umpires info.venue
0 [{'1st innings': {'team': 'Glamorgan', 'delive... 0.9 2020-09-01 1 Bristol Vitality Blast ... [AG Salter] [Glamorgan, Gloucestershire] field Gloucestershire [JH Evans, ID Blackwell] County Ground
0 [{'1st innings': {'team': 'Pune Warriors', 'de... 0.9 2013-05-19 1 Pune IPL ... [LJ Wright] [Pune Warriors, Delhi Daredevils] bat Pune Warriors [NJ Llong, SJA Taufel] Subrata Roy Sahara Stadium
0 [{'1st innings': {'team': 'Botswana', 'deliver... 0.9 2020-08-29 1 Gaborone NaN ... [A Rangaswamy] [Botswana, St Helena] bat Botswana [R D'Mello, C Thorburn] Botswana Cricket Association Oval 1
[3 rows x 18 columns]

Ignoring bad rows of data in pandas.read_csv() that break header= keyword

I have a series of very messy *.csv files that are being read in by pandas. An example csv is:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
This was working fine and dandy until I have a file that has an erronious 1 row line after the header: "Random message here 031114 073721 to 031114 083200"
The error I receieve is:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.
Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large - thus I don't want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn't involve openning the file first as a stringIO buffer to removing offending lines.
Here's one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.
As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.
The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it's evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.
After some tinkering yesterday I found a solution and what the potential issue may be.
I tried the skip_test() function answer above, but I was still getting errors with the size of the table:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine='c'. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.
Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.
For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
I then use these two lines to drop the NaN rows and columns from the DataFrame:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values
If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"

Change CSV before importing it to pandas

I have an issue with a CSV files I am trying to import in pandas. The structure of the file is as follow:
first character of the file is a single quote;
last character of the file is a single quote;
every line of the CSV start with a double quotes, end with a double quote followed by \n
So I have issues importing it with pandas.read_csv. Ideally I would like pandas to just ignore the single and double quotes when importing (not taking them into account for the structure of the data frame, and not importing these as characters).
I do not really know if I should manipulate the CSV file before using pandas.read_csv, or if I have option for just ignoring these characters.
The pd.read_csv methods first argument is either a file name or a stream.
You can read the file manually and manipulate the stream before handing it to pandas.
sio = StringIO("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00"
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 $$$Theawsomestuff$$$### 166.0
Thus subclassing StringIO you can change the behavior of the read method
class StreamChanger(StringIO):
def read(self, **kwargs):
data = super().read(**kwargs)
data = data.replace("$", "")
data = data.replace("#", "")
return data
sio = StreamChanger("id,category,value\n1,beer,2.40\n2,wine,6.40\n3,$$$Theawsomestuff$$$###,166.00")
pd.read_csv(sio)
id category value
0 1 beer 2.4
1 2 wine 6.4
2 3 Theawsomestuff 166.0
Use parameter 'quoting' and pass value 3 to read_csv. once you have dataframe created you should take care of the quotes in data and headers.
import pandas as pd
df=pd.read_csv('check.txt',doublequote=True,delimiter=',',quoting=3)
df=df.replace({'"': '','\'':''}, regex=True)
df.columns = ['Id1','StartTime','start_lat','start_long','StartGeohash']
print df
Sample File
'Id1,StartTime,start_lat,start_long,StartGeohash
"113,2016-11-01 10:50:28.063,-33.139507,-100.226715,9vbsx2"
"113,2016-11-02 10:49:24.063,-33.139507,-100.226715,9vbsx2"
"115,2016-11-03 10:55:20.063,-36.197660,-101.186050,9y2jcm"'
output
Id1 StartTime start_lat start_long StartGeohash
0 113 2016-11-01 10:50:28.063 -33.139507 -100.226715 9vbsx2
1 113 2016-11-02 10:49:24.063 -33.139507 -100.226715 9vbsx2
2 115 2016-11-03 10:55:20.063 -36.197660 -101.186050 9y2jcm

How to exclude first word in Pandas header?

I'm importing text files to Pandas data frames. Number of columns can vary and also the names varies.
However, the header line always starts with ~A and read_csv interprets this a s the name of the first column, subsequently all the column names are shifted on step to the right.
Earlier I used np.genfromtxt() with the argument deletechars = 'A__' but I haven't find any equivalent function for pandas. Is there a way to exclude the name when reading or, as an second option, delete the first name but keep the columns intact?
I'm reading file like this:
in_file = pd.read_csv(file_name, header=header_row,delim_whitespace=True)
Now I got this (just as the text file looks):
~A DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59 NaN
11706 2.93 10525.38 NaN 168.29 368.00 4.75 NaN
11707 2.92 10525.38 126.14 166.71 369.86 4.93 NaN
but I want' to get this:
DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59
11706 2.93 10525.38 NaN 168.29 368.00 4.75
11707 2.92 10525.38 126.14 166.71 369.86 4.93
Why not just post-process?
df = ...
df_modified = df[df.columns[:-1]]
df_modified.columns = df.columns[1:]
How about if you read the file twice? First, use pd.read_csv() but skip your header row. Second, use open.readline() to parse the header and drop the first item. This can then be assigned to your dataframe.
in_file = pd.read_csv(file_name, delim_whitespace=True, header = None, skiprows = [0])
with open(file_name,'rt') as h:
hdrs = h.readline().rstrip('\n').split(',')
in_file.columns = hdrs[1:]
Choose which columns to import
in_file = pd.read_csv(file_name, header=header_row,
delim_whitespace=True,
usecols=['DEPTH','TIME','TX1','TX2','TX3','OUT6')
Ok so if the number of columns vary
and you want to remove the first column (who's name varies)
AND you do not want too do this in a Post-cv_read phase...
then
.... (Drum Roll)
import pandas as pd
#Tim.csv is
#1,2,3
#2,3,4
#3,4,5
headers=['BADCOL','Happy','Sad']
data = pd.read_csv('tim.csv').iloc[:,1:]
Data will now look like
b c
2 3
3 4
4 5
Not sure if this counts as Post-CSV processing or not...

Pandas Read CSV with string delimiters via regex

I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values

Categories

Resources