How to read a url into a dataframe and join unwanted rows? - python

I have the following code to import some data from a website, I want to convert my data variable into a dataframe.
I've tried with pd.DataFrame and pd.read_csv(io.StringIO(data), sep=";") but always show me an error.
import requests
import io
# load file
data = requests.get('https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT').content
# decode data
data = data.decode('latin-1')
# skip first 2 rows
data = data.split('\r\n')[2::]
del data[1]
# trying to fix csv structure
lines = []
lines_2 = []
for line in data:
line = ';'.join(line.split(';'))
if len(line) > 0 and line[0].isdigit():
lines.append(line)
lines_2.append(line)
else:
if len(lines) > 0:
lines_2.append(lines_2[-1] + line)
lines_2.remove(lines_2[-2])
else:
lines.append(line)
data = '\r\n'.join(lines_2)
print(data)
the expected ouput should be like this:
date 1 2
0 29/08/2020 HI RE ....
1 30/08/2020 HI RE ....
2 31/08/2020 HI RE ...
There are few rows that need to be added to the previos one (the main rows should be the rows who start by a date)

Prayson's answer is correct, but the skiprows parameter should also be used (otherwise the metadata is interpreted as column names).
import pandas as pd
df = pd.read_csv(
"https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT",
sep=";",
skiprows=2,
encoding='latin-1',
)
print(df)

You can read text/csv data directly from URL with pandas
import pandas as pd
URI = 'https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT'
df = pd.read_csv(URI, sep=';', encoding='latin1')
print(df)
pandas will do the downloading for you. So no need for requests or io.StringIO.

Related

How to manipulate csv entries from horizontal to vertical

I have a csv with the following entries:
apple,orange,bannana,grape
10,5,6,4
four,seven,eight,nine
yes,yes,no,yes
3,5,7,4
two,one,six,nine
no,no,no,yes
2,4,7,8
yellow,four,eight,one
no,yes,no,no
I would like to make a new csv file with the following format and so on:
apple,10,four,yes
orange,5,seven,yes
bannana,6,seven,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
So after grape it starts at apple with the new values.
I have tried using pandas DataFrames but cant figure how to get the data formatted how I need it.
You could try the following in pure Python (data.csv name of input file):
import csv
from itertools import islice
with open("data.csv", "r") as fin,\
open("data_new.csv", "w") as fout:
reader, writer = csv.reader(fin), csv.writer(fout)
header = next(reader)
length = len(header) - 1
while (rows := list(islice(reader, length))):
writer.writerows([first, *rest] for first, rest in zip(header, zip(*rows)))
Or with Pandas:
import pandas as pd
df = pd.read_csv("data.csv")
df = pd.concat(gdf.T for _, gdf in df.set_index(df.index % 3).groupby(df.index // 3))
df.reset_index().to_csv("data_new.csv", index=False, header=False)
Output file data_new.csv for the provided sample:
apple,10,four,yes
orange,5,seven,yes
bannana,6,eight,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
apple,2,yellow,no
orange,4,four,yes
bannana,7,eight,no
grape,8,one,no
Hope it works for you.
df = pd.read_csv('<source file name>')
df.T.to_csv('<destination file name>')
You can transpose your dataframe in pandas as below.
pd.read_csv('file.csv', index_col=0, header=None).T
this question is already answered:
Can pandas read a transposed CSV?
According to your new description, the problem is completely changed.
you need to split your dataframe to subsets and merge them.
# Read dataframe without header
df = pd.read_csv('your_dataframe.csv', header=None)
# Create an empty DataFrame to store transposed data
tr = pd.DataFrame()
# Create, transpose and append subsets to new DataFrame
for i in range(1,df.shape[0],3):
... temp = pd.DataFrame()
... temp = temp.append(df.iloc[0])
... temp = temp.append(df.iloc[i:i+3])
... temp = temp.transpose()
... temp.columns = [0,1,2,3]
... tr = d.append(temp)

How to scrape data into an excel file

https://m-selig.ae.illinois.edu/ads/coord/ag25.dat
I'm trying to scrape data from the UIUC airfoil database website but all of the links are formatted differently than the other. I tried using pandas read table and use skiprows
to skip the non-data point part of the url but every url have a different number of rows to skip.
How can I manage to only read the numbers in the url?
Use pd.read_fwf() which will read a table of fixed-width formatted lines into DataFrame:
In terms of how to handle different files with different rows to skip, what we could do is once the file is read, just count the rows until there is a line that contains only numeric values. Then feed that into the skiprows parameter.
In the case of values greater than 1.0, we can simply just filter those out from the dataframe.
import pandas as pd
from io import StringIO
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/ag25.dat'
response = requests.get(url).text
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').isdecimal() for x in line.split()]):
break
skip = idx
df = pd.read_fwf(StringIO(response), skiprows=skip, header=None)
df = df[~(df > 1).any(1)]
Output:
print(df)
0 1
0 1.000000 0.000283
1 0.994054 0.001020
2 0.982050 0.002599
3 0.968503 0.004411
4 0.954662 0.006281
.. ... ...
155 0.954562 0.001387
156 0.968423 0.000836
157 0.982034 0.000226
158 0.994050 -0.000374
159 1.000000 -0.000680
[160 rows x 2 columns]
**Option 2:**
import pandas as pd
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/b707b.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]

Convert list of multiple strings into a Python data frame

I have a list of string values I read this from a text document with splitlines. which yields something like this
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
I have tried this
for i in X:
textnew = i.split("|")
data[x] = textnew
I want to make a dataframe out of this
Name Contact Education
SMITH 12345 Graduate
NITA 11111 Diploma
You can read it directly from your file by specifying a sep argument to pd.read_csv.
df = pd.read_csv("/path/to/file", sep='|')
Or if you wish to convert it from list of string instead:
data = [row.split('|') for row in X]
headers = data.pop(0) # Pop the first element since it's header
df = pd.DataFrame(data, columns=headers)
you had it almost correct actually, but don't use data as dictionary(by using keys - data[x] = textnew):
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
df = []
for i in X:
df.append(i.split("|"))
print(df)
# [['NAME', 'Contact', 'Education'], ['SMITH', '12345', 'Graduate'], ['NITA', '11111', 'Diploma']]
Depends on further transformations, but pandas might be overkill for this kind of task
Here is a solution for your problem
import pandas as pd
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
data = []
for i in X:
data.append( i.split("|") )
df = pd.DataFrame( data, columns=data.pop(0))
In your situation, you can avoid to load the file using readlines and use pandas for take care about loading the file:
As mentioned above, the solution is a standard read_csv:
import os
import pandas as pd
path = "/tmp"
filepath = "file.xls"
filename = os.path.join(path,filepath)
df = pd.read_csv(filename, sep='|')
print(df.head)
Another approach (in such situation when you have no access to the file or you have to deal with a list of string) can be wrap the list of string as a text file, then load normally using pandas
import pandas as pd
from io import StringIO
X = ["NAME|Contact|Education", "SMITH|12345|Graduate", "NITA|11111|Diploma"]
# Wrap the string list as a file of new line
DATA = StringIO("\n".join(X))
# Load as a pandas dataframe
df = pd.read_csv(DATA, delimiter="|")
Here the result

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources