how to read url .txt files using pandas

how to read url .txt files using pandas - python

I have a problem reading files using pandas (read_csv). I can do it using the built in, with open(...), however it is much easier with pandas. I just need to read the data (numbers) between the ----. This is the LINK with one of my data url. There are more depending on the date that i insert. A sample of this is :
MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020
NAME: Krieza Evias CITY: Krieza Evias STATE:
ELEV: 119 m LAT: 38° 24' 00" N LONG: 24° 18' 00" E
TEMPERATURE (°C), RAIN (mm), WIND SPEED (km/hr)
HEAT COOL AVG
MEAN DEG DEG WIND DOM
DAY TEMP HIGH TIME LOW TIME DAYS DAYS RAIN SPEED HIGH TIME DIR
------------------------------------------------------------------------------------
1 18.2 22.4 10:20 13.5 23:50 1.0 0.9 0.0 4.5 33.8 12:30 E
2 17.6 22.3 15:00 10.8 4:10 2.0 1.3 0.0 4.5 30.6 15:20 E
3 18.1 21.9 12:20 14.1 3:40 1.3 1.1 1.0 4.2 24.1 14:40 E
Keep in mind that i cannot just use skiprows=8 and skipfooter=9 to get the data between the --------, because not all files of this format have a specific number of footer (skipfooter)or title (skiprows) to skip. Some have 2 or 3 and some others have 8-9 lines of footer or title to skip. But every file has 2 lines of -------- where the data are between them.

I think you can't directly use read_csv but you could do this:
import urllib
from io import StringIO
count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
if "---" in line.decode('windows-1252'):
count+=1
elif count==1:
txt+=line.decode('windows-1252')
else:
break
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)
header is None because in your link column names are not in a row only but divided into multiple rows. If they're fixed I suggest you to put them by hand such as ["DAY", "MEAN TEMP", ...].

Related

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:

I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Get time difference between two values in csv file [duplicate]

This question already has answers here:
Pandas: Difference to previous value
(2 answers)
Closed 3 years ago.
I trying to get the avarage, max and min time difference between value occurrences in a csv file.
The contains a multiple columns and rows.
I am currently working in python and trying to use pandas to solve my problem.
I have managed to break down the csv file to the column i want to get the time difference from and the time column.
Where the "payload" column "value occurrences" happens.
looking like:
time | payload
12.1 2368
13.8 2508
I have also tried to get the time in a array when the value occurrences happens and tried to step through the array but failed bad. I felt like there was a easier way to do it.
def average_time(avg_file):
avg_read = pd.read_csv(avg_file, skiprows=2, names=new_col_names, usecols=[2, 3], na_filter=False, skip_blank_lines=True)
test=[]
i=0
for row in avg_read.payload:
if row != None:
test[i]=avg_read.time
i+=1
if len[test] > 2:
average=test[1]-test[0]
i=0
test=[]
return average
The csv-file currently look like:
time | payload
12.1 2250
12.5 2305
12.9 (blank)
13.1 (blank)
13.5 2309
14.6 2350
14.9 2680
15.0 (blank)
I want to get the time diffenrence between the values in the payload columen. example time between
2250 and 2305 --> 12.5-12.1 = 0.4 sec
and the get the difference between
2305 and 2309 --> 13.5-12.5 = 1 s
Skipping the blank numbers
To later on get the maximum, minimun and average difference.

First use dropna then use Series.diff
DataFrame used:
print(df)
time payload
0 12.1 2250.0
1 12.5 2305.0
2 12.9 NaN
3 13.1 NaN
4 13.5 2309.0
5 14.6 2350.0
6 14.9 2680.0
7 15.0 NaN
df.dropna().time.diff()
0 NaN
1 0.4
4 1.0
5 1.1
6 0.3
Name: time, dtype: float64
Note I assumed your (blank) values are NaN, else use the following before running my code:
df.replace('(blank)', np.NaN, inplace=True, axis=1)
# Or if they are whitespaces
df.replace('', np.NaN, inplace=True, axis=1)

Python: Imported csv not being split into proper columns

I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.

Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!

There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.

Read multiple excel file with different sheets names in pandas

To read files from a directory, try the following:
import os
import pandas as pd
path=os.getcwd()
files=os.listdir(path)
files
['wind-diciembre.xls', 'stat_noviembre.xls', 'stat_marzo.xls', 'wind-noviembre.xls', 'wind-enero.xls', 'stat_octubre.xls', 'wind-septiembre.xls', 'stat_septiembre.xls', 'wind-febrero.xls', 'wind-marzo.xls', 'wind-julio.xls', 'wind-octubre.xls', 'stat_diciembre.xls', 'stat_julio.xls', 'wind-junio.xls', 'stat_abril.xls', 'stat_enero.xls', 'stat_junio.xls', 'stat_agosto.xls', 'stat_febrero.xls', 'wind-abril.xls', 'wind-agosto.xls']
where:
stat_enero
Fecha HR PreciAcu RadSolar T Presion Tmax HRmax \
01/01/2011 37 0 162 18.5 0 31.2 86
02/01/2011 70 0 58 12.0 0 14.6 95
03/01/2011 62 0 188 15.3 0 24.9 86
04/01/2011 69 0 181 17.0 0 29.2 97
.
.
.
Presionmax RadSolarmax Tmin HRmin Presionmin
0 0 774 12.3 9 0
1 0 314 9.2 52 0
2 0 713 8.3 32 0
3 0 730 7.7 26 0
.
.
.
and
wind-enero
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
.
.
.
The next step is to read, parse and add the files to a dataframe, Now I do the following:
for f in files:
data=pd.ExcelFile(f)
data1=data.sheet_names
print data1
[u'diciembre']
[u'Hoja1']
[u'Hoja1']
[u'noviembre']
[u'enero']
[u'Hoja1']
[u'septiembre']
[u'Hoja1']
[u'febrero']
[u'marzo']
[u'julio']
.
.
.
for sheet in data1:
data2=data.parse(sheet)
data2
Fecha MagV MagMax Rachas MagRes DirRes DirWind
01/08/2011 00:00 4.3 14.1 17.9 1.0 281.3 ONO
02/08/2011 00:00 4.2 15.7 20.6 1.5 28.3 NNE
03/08/2011 00:00 4.6 23.3 25.6 2.9 49.2 ENE
04/08/2011 00:00 4.8 17.9 23.0 2.0 30.5 NNE
05/08/2011 00:00 6.0 22.5 26.3 4.4 68.7 ENE
06/08/2011 00:00 4.9 23.8 23.0 3.3 57.3 ENE
07/08/2011 00:00 3.4 12.9 20.2 1.6 104.0 ESE
08/08/2011 00:00 4.0 20.5 22.4 2.6 79.1 ENE
09/08/2011 00:00 4.1 22.4 25.8 2.9 74.1 ENE
10/08/2011 00:00 4.6 18.4 24.0 2.3 52.1 ENE
11/08/2011 00:00 5.0 22.3 27.8 3.3 65.0 ENE
12/08/2011 00:00 5.4 24.9 25.6 4.1 78.7 ENE
13/08/2011 00:00 5.3 26.0 31.7 4.5 79.7 ENE
14/08/2011 00:00 5.9 31.7 29.2 4.5 59.5 ENE
15/08/2011 00:00 6.3 23.0 25.1 4.6 70.8 ENE
16/08/2011 00:00 6.3 19.5 30.8 4.8 64.0 ENE
17/08/2011 00:00 5.2 21.2 25.3 3.9 57.5 ENE
18/08/2011 00:00 5.0 22.3 23.7 2.6 59.4 ENE
19/08/2011 00:00 4.4 21.6 27.5 2.4 57.0 ENE
The above output shows only part of the file,how I can parse all files and add them to a dataframe

First off, it appears you have a few different datasets in these files. You may want them all in one dataframe, but for now, I am going to assume you want them separated. Ex (All of the wind*.xls files in one dataframe and all of the stat*.xls files in another.) You could parse the data using read_excel and then concatenate the results using the timestamp as the index as follows:
import numpy as np
import pandas as pd, datetime as dt
import glob, os
runDir = "Path to files"
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob("wind*.xls")
df = pd.DataFrame()
for each in files:
sheets = pd.ExcelFile(each).sheet_names
for sheet in sheets:
df = df.append(pd.read_excel(each, sheet, index_col='Fecha'))
You now have a time-indexed dataframe! If you really want to have all of the data in one dataframe (from all of the file types), you can just adjust the glob to include all of the files using something like glob.glob('*.xls'). I would warn from personal experience that it may be easier for you to read in each type of data separately and then merge them after you have done some error checking/munging etc.

Below solution is just a minor tweak on #DavidHagan's answer above.
This one includes a column to identify the read File No like F0, F1, etc.
and sheet no of each file as S0, S1, etc.
So that we can know where the rows came from.
import numpy as np
import pandas as pd, datetime as dt
import glob, os
import sys
runDir = r'c:\blah\blah'
if os.getcwd() != runDir:
os.chdir(runDir)
files = glob.glob(r'*.*xls*')
df = pd.DataFrame()
#fno is 0, 1, 2, ... (for each file)
for fno, each in enumerate(files):
sheets = pd.ExcelFile(each).sheet_names
# sno iss 0, 1, 2, ... (for each sheet)
for sno, sheet in enumerate(sheets):
FileNo = 'F' + str(fno) #F0, F1, F2, etc.
SheetNo = 'S' + str(sno) #S0, S1, S2, etc.
# print FileNo, SheetNo, each, sheet #debug info
#header = None if you don't want header or take this out.
#dfxl is dataframe of each xl sheet
dfxl = pd.read_excel(each, sheet, header=None)
#add column of FileNo and SheetNo to the dataframe
dfxl['FileNo'] = FileNo
dfxl['SheetNo'] = SheetNo
#now add the current xl sheet to main dataframe
df = df.append(dfxl)
After doing above.. i.e. reading multiple XL Files and Sheets into a single dataframe (df)... you can do this.. to get a sample row from each File, Sheet combination.. and the sample wil be available in dataframe (dfs1).
#get unique FileNo and SheetNo in dft2
dft2 = df.loc[0,['FileNo', 'SheetNo']]
#empty dataframe to collect sample from each of the read file/sheets
dfs1 = pd.DataFrame()
#loop through each sheet and fileno names
for row in dft2.itertuples():
#get a sample from each file to view
dfts = df[(df.FileNo == row[1]) & (df.SheetNo ==row[2])].sample(1)
#append the 1 sample to dfs1. this will have a sample row
# from each xl sheet and file
dfs1 = dfs1.append(dfts, ignore_index = True)
dfs1.to_clipboard()

How to interpret values in a .txt data file as a time series

I have a data file that has values in it like this:
# DD MM YYYY HH MN SS Hs Hrms Hma x Tz Ts Tc THmax EP S T0 2 Tp Hrms EPS
29 11 2000 13 17 56 2.44 1.71 3.12 9.12 11.94 5.03 12.74 .83 8.95 15.03 1.80 .86
29 11 2000 13 31 16 2.43 1.74 4.16 9.17 11.30 4.96 11.70 .84 8.84 11.86 1.80 .87
I use the following to get the data in:
infile = open ("testfile.txt", 'r')
data = np.genfromtxt(infile,skiprows=2)
which gives me a numpy.ndarray
I want to be able to interpret the first 0-5 columns as a timestamp (DD:MM:YYY:HH:MN:SS), but this is where I get stumped - there seems to be a million ways to do it and I don't know what's best.
I've been looking at dateutil and pandas - I know there is something blindingly obvious I should do, but am at a loss. Should I convert to a csv format first? Somehow concatenate the values from each row (cols 0-5) using a for loop?
After this I'll plot values from other columns against the timestamps/deltas.
I'm totally new to python, so any pointers appreciated :)

Here's a pandas solution for you:
test.csv:
29 11 2000 13 17 56 2.44 1.71 3.12 9.12 11.94 5.03 12.74 .83 8.95 15.03 1.80 .86
29 11 2000 13 31 16 2.43 1.74 4.16 9.17 11.30 4.96 11.70 .84 8.84 11.86 1.80 .87
pandas provide a read_csv util for reading the csv, you should give the following parameters to parse your file:
delimiter: the default one is comma, so you need to set it as a space
parse_dates: those date columns (order sensitive)
date_parser: the default is dateutil.parser.parse, but seems it doesn't work for your case, so you should implement your own parser
header: if your csv doesn't have the column name, you should set it as None
Finally, here the sample code:
In [131]: import datetime as dt
In [132]: import pandas as pd
In [133]: pd.read_csv('test.csv',
parse_dates=[[2,1,0,3,4,5]],
date_parser=lambda *arr:dt.datetime(*[int(x) for x in arr]),
delimiter=' ',
header=None)
Out[133]:
2_1_0_3_4_5 6 7 8 9 10 11 12 13 14 \
0 2000-11-29 13:17:56 2.44 1.71 3.12 9.12 11.94 5.03 12.74 0.83 8.95
1 2000-11-29 13:31:16 2.43 1.74 4.16 9.17 11.30 4.96 11.70 0.84 8.84
15 16 17
0 15.03 1.8 0.86
1 11.86 1.8 0.87

This is how I would do it:
from datetime import datetime
# assuming you have a row of the data in a list like this
# (also works on ndarrays in numpy, but you need to keep track of the row,
# so let's assume you've extracted a row like the one below...)
rowData = [29, 11, 2000, 13, 17, 56, 2.44, 1.71, 3.12, 9.12, 11.94, 5.03, 12.74, 0.83, 8.95, 15.03, 1.8, 0.86]
# unpack the first six values
day, month, year, hour, min, sec = rowData[:6]
# create a datetime based on the unpacked values
theDate = datetime(year,month,day,hour,min,sec)
No need to convert the data to a string and parse that. Might be good to check out the datetime documentation.

I barely know anything about numpy, but you can use the datetime module to convert the dates into a date object:
import datetime
line = "29 11 2000 13 17 56 2.44 1.71 3.12 9.12 11.94 5.03 12.74 .83 8.95 15.03 1.80 .86"
times = line.split()[:6]
Now from here you have two options:
print ':'.join(times)
# 29:11:2000:13:17:56
Or, as I said before, use the datetime module:
mydate = datetime.datetime.strptime(':'.join(times), '%d:%m:%Y:%H:%M:%S')
print datetime.datetime.strftime(mydate, '%d:%m:%Y:%H:%M:%S')
# 29:11:2000:13:17:56
Of course, you're probably thinking that the second option is useless, but if you want more information from the dates (i.e like the year), then it's probably better to convert it to a datetime object.

import datetime
import re
import numpy as np
def convert_to_datetime(x):
return datetime.datetime.strptime(x, '%d:%m:%Y:%H:%M:%S')
infile = open("testfile.txt", 'r')
infile = (re.sub(r'^(\d+) (\d+) (\d+) (\d+) (\d+) (\d+)', r'\1:\2:\3:\4:\5:\6', line, 1) for line in infile)
data = np.genfromtxt(infile, skiprows=2, converters={0: convert_to_datetime})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to read url .txt files using pandas - python

Related

Select value from dataframe based on other dataframe

Get time difference between two values in csv file [duplicate]

Python: Imported csv not being split into proper columns

Read multiple excel file with different sheets names in pandas

How to interpret values in a .txt data file as a time series

Categories

Resources