How to scrape data into an excel file

How to scrape data into an excel file - python

https://m-selig.ae.illinois.edu/ads/coord/ag25.dat
I'm trying to scrape data from the UIUC airfoil database website but all of the links are formatted differently than the other. I tried using pandas read table and use skiprows
to skip the non-data point part of the url but every url have a different number of rows to skip.
How can I manage to only read the numbers in the url?

Use pd.read_fwf() which will read a table of fixed-width formatted lines into DataFrame:
In terms of how to handle different files with different rows to skip, what we could do is once the file is read, just count the rows until there is a line that contains only numeric values. Then feed that into the skiprows parameter.
In the case of values greater than 1.0, we can simply just filter those out from the dataframe.
import pandas as pd
from io import StringIO
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/ag25.dat'
response = requests.get(url).text
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').isdecimal() for x in line.split()]):
break
skip = idx
df = pd.read_fwf(StringIO(response), skiprows=skip, header=None)
df = df[~(df > 1).any(1)]
Output:
print(df)
0 1
0 1.000000 0.000283
1 0.994054 0.001020
2 0.982050 0.002599
3 0.968503 0.004411
4 0.954662 0.006281
.. ... ...
155 0.954562 0.001387
156 0.968423 0.000836
157 0.982034 0.000226
158 0.994050 -0.000374
159 1.000000 -0.000680
[160 rows x 2 columns]
**Option 2:**
import pandas as pd
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/b707b.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]

Related

How to read a url into a dataframe and join unwanted rows?

I have the following code to import some data from a website, I want to convert my data variable into a dataframe.
I've tried with pd.DataFrame and pd.read_csv(io.StringIO(data), sep=";") but always show me an error.
import requests
import io
# load file
data = requests.get('https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT').content
# decode data
data = data.decode('latin-1')
# skip first 2 rows
data = data.split('\r\n')[2::]
del data[1]
# trying to fix csv structure
lines = []
lines_2 = []
for line in data:
line = ';'.join(line.split(';'))
if len(line) > 0 and line[0].isdigit():
lines.append(line)
lines_2.append(line)
else:
if len(lines) > 0:
lines_2.append(lines_2[-1] + line)
lines_2.remove(lines_2[-2])
else:
lines.append(line)
data = '\r\n'.join(lines_2)
print(data)
the expected ouput should be like this:
date 1 2
0 29/08/2020 HI RE ....
1 30/08/2020 HI RE ....
2 31/08/2020 HI RE ...
There are few rows that need to be added to the previos one (the main rows should be the rows who start by a date)

Prayson's answer is correct, but the skiprows parameter should also be used (otherwise the metadata is interpreted as column names).
import pandas as pd
df = pd.read_csv(
"https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT",
sep=";",
skiprows=2,
encoding='latin-1',
)
print(df)

You can read text/csv data directly from URL with pandas
import pandas as pd
URI = 'https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT'
df = pd.read_csv(URI, sep=';', encoding='latin1')
print(df)
pandas will do the downloading for you. So no need for requests or io.StringIO.

Two type of headers txt to Pandas dataframe

Let's say I have a .txt file like that:
#D=H|ID|STRINGIDENTIFIER
#D=T|SEQ|DATETIME|VALUE
H|879|IDENTIFIER1
T|1|1569972384|7
T|2|1569901951|9
T|3|1569801600|8
H|892|IDENTIFIER2
T|1|1569972300|109
T|2|1569907921|101
T|3|1569803600|151
And I need to create a dataframe like this:
IDENTIFIER SEQ DATETIME VALUE
879_IDENTIFIER1 1 1569972384 7
879_IDENTIFIER1 2 1569901951 9
879_IDENTIFIER1 3 1569801600 8
892_IDENTIFIER2 1 1569972300 109
892_IDENTIFIER2 2 1569907921 101
892_IDENTIFIER2 3 1569803600 151
What would be the possible code?

A basic way to do it might just to be to process the text file and convert it into a csv before using the read_csv function in pandas. Assuming the file you want to process is as consistent as the example:
import pandas as pd
with open('text.txt', 'r') as file:
fileAsRows = file.read().split('\n')
pdInput = 'IDENTIFIER,SEQ,DATETIME,VALUE\n' #addHeader
for row in fileAsRows:
cols = row.split('|') #breakup row
if row.startswith('H'): #get identifier info from H row
Identifier = cols[1]+'_'+cols[2]
if row.startswith('T'): #get other info from T row
Seq = cols[1]
DateTime = cols[2]
Value = cols[3]
tempList = [Identifier,Seq, DateTime, Value]
pdInput += (','.join(tempList)+'\n')
with open("pdInput.csv", "a") as file:
file.write(pdInput)
## import into pandas
df = pd.read_csv("pdInput.csv")

How do i add column header, in the second row in a pandas dataframe?

I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)

You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green

You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green

So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers

Read inconsistently formatted csv file into Pandas Dataframe (blocks with headline and repeating column headers)

I have a CSV file which basically looks like the following (I shortened it to a minimal example showing the structure):
ID1#First_Name
TIME_BIN,COUNT,AVG
09:00-12:00,100,50
15:00-18:00,24,14
21:00-23:00,69,47
ID2#Second_Name
TIME_BIN,COUNT,AVG
09:00-12:00,36,5
15:00-18:00,74,68
21:00-23:00,22,76
ID3#Third_Name
TIME_BIN,COUNT,AVG
09:00-12:00,15,10
15:00-18:00,77,36
21:00-23:00,55,18
As one can see, the data is separated into multiple blocks. Each block has a headline (e.g. ID1#First_Name) which contains two peaces of information (IDx and x_Name), separated by #.
Each headline is followed by the column headers (TIME_BIN, COUNT, AVG) which stay the same for all blocks.
Then follow some lines of data which belong to the column headers (e.g. TIME_BIN=09:00-12:00, COUNT=100, AVG=50).
I would like to parse this file into a Pandas dataframe which would look like the following:
ID Name TIME_BIN COUNT AVG
ID1 First_Name 09:00-12:00 100 50
ID1 First_Name 15:00-18:00 24 14
ID1 First_Name 21:00-23:00 69 47
ID2 Second_Name 09:00-12:00 36 5
ID2 Second_Name 15:00-18:00 74 68
ID2 Second_Name 21:00-23:00 22 76
ID3 Third_Name 09:00-12:00 15 10
ID3 Third_Name 15:00-18:00 77 36
ID3 Third_Name 21:00-23:00 55 18
This means that the headline may not be skipped but has to be split by the # and then linked to the data from the block it belongs to. Besides, the column headers are only needed once since they do not change later on.
Somehow I managed to achieve my goal with the following code. However, the approach looks kind of overcomplicated and not robust to me and I am sure that there are better ways to do this. Any suggestions are welcome!
import pandas as pd
from io import StringIO (<- Python 3, for Python 2 use from StringIO import StringIO)
pathToFile = 'mydata.txt'
# read the textfile into a StringIO object and skip the repeating column header rows
s = StringIO()
with open(pathToFile) as file:
for line in file:
if not line.startswith('TIME_BIN'):
s.write(line)
# reset buffer to the beginning of the StringIO object
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['TIME_BIN', 'COUNT', 'AVG'])
# split the headline string which is currently found in the TIME_BIN column and insert both parts as new dataframe columns.
# the headline is identified by its start which is 'ID'
df['ID'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(0)
df['Name'] = df[df.TIME_BIN.str.startswith('ID')].TIME_BIN.str.split('#').str.get(1)
# fill the NaN values in the ID and Name columns by propagating the last valid observation
df['ID'] = df['ID'].fillna(method='ffill')
df['Name'] = df['Name'].fillna(method='ffill')
# remove all rows where TIME_BIN starts with 'ID'
df['TIME_BIN'] = df['TIME_BIN'].drop(df[df.TIME_BIN.str.startswith('ID')].index)
df = df.dropna(subset=['TIME_BIN'])
# reorder columns to bring ID and Name to the front
cols = list(df)
cols.insert(0, cols.pop(cols.index('Name')))
cols.insert(0, cols.pop(cols.index('ID')))
df = df.ix[:, cols]

import pandas as pd
from StringIO import StringIO
import sys
pathToFile = 'mydata.txt'
f = open(pathToFile)
s = StringIO()
cur_ID = None
for ln in f:
if not ln.strip():
continue
if ln.startswith('ID'):
cur_ID = ln.replace('\n',',',1).replace('#',',',1)
continue
if ln.startswith('TIME'):
continue
if cur_ID is None:
print 'NO ID found'
sys.exit(1)
s.write(cur_ID + ln)
s.seek(0)
# create new dataframe with desired column names
df = pd.read_csv(s, names=['ID','Name','TIME_BIN', 'COUNT', 'AVG'])

assigning arrays from CSV with pandas module

If I have a file of 100+ columns, how can I make each column into an array, referenced by the column header, without having to do header1 = [1,2,3], header2 = ['a','b','c'] , and so on..?
Here is what I have so far, where headers is a list of the header names:
import pandas as pd
data = []
df = pd.read_csv('outtest.csv')
for i in headers:
data.append(getattr(df, i).values)
I want each element of the array headers to be the variable name of the corresponding data array in data (they are in order). Somehow I want one line that does this so that the next line I can say, for example, test = headername1*headername2.

import pandas as pd
If the headers are in the csv file, we can simply use:
df = pd.read_csv('outtest.csv')
If the headers are not present in the csv file:
headers = ['list', 'of', 'headers']
df = pd.read_csv('outtest.csv', header=None, names=headers)
Assuming headername1 and headername2 are constants:
test = df.headername1 * df.headername2
Or
test = df['headername1'] * df['headername2']
Assuming they are variable:
test = df[headername1] * df[headername2]
By default this form of access returns a pd.Series, which is generally interoperable with numpy. You can fetch the values explicitly using .values:
df[headername1].values
But you seem to already know this.

I think I see what you're going for, so using a StringIO object to simulate a file object as the setup:
import pandas as pd
import StringIO
txt = '''foo,bar,baz
1, 2, 3
3, 2, 1'''
fileobj = StringIO.StringIO(txt)
Here's the approximate code you want:
data = []
df = pd.read_csv(fileobj)
for i in df.columns:
data.append(df[i])
for i in data:
print i
prints
0 1
1 3
Name: foo
0 2
1 2
Name: bar
0 3
1 1
Name: baz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape data into an excel file - python

Related

How to read a url into a dataframe and join unwanted rows?

Two type of headers txt to Pandas dataframe

How do i add column header, in the second row in a pandas dataframe?

Read inconsistently formatted csv file into Pandas Dataframe (blocks with headline and repeating column headers)

assigning arrays from CSV with pandas module

Categories

Resources