python parsing tab separated file - python

Fairly new to python
I want to parse a file with \t separated values, images below. How do i remove the \t from the file and seperate the values into columns?
Code below.
import pandas as pd
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
How it looks right now
How i want it to look

Add sep="\t" into pd.read_csv. The data is messy, thus double tab needs to be replaced:
df = pd.read_csv(
io.StringIO(s.decode('utf-8').replace("\t\t", "\t")),
header=None, sep="\t")

If using csv library is an option you can try:
import pandas as pd
import requests
import csv
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
raw_data = requests.get(url).content
file = open("raw_data.txt","w")
file.write(raw_data)
data = list(csv.reader(open('raw_data.txt', 'rb'), delimiter='\t'))
df = pd.DataFrame.from_records(data)
print df

Related

How to create function to import CSV?

I'd like to create a function in Python to import CSV files from github, by just inputting the file name.
I tried the following code, but no dataframe is created. Can anyone give me a hand? Million thanks
import pandas as pd
def in_csv(file_name):
file = 'https://raw.githubusercontent.com/USER/file_name.csv'
file_name = pd.read_csv(file, header = 0)
in_csv('csv_name')
There are many ways to read à csv, but with à pandas.DataFrame:
import pandas as pd
def in_csv(file_name):
file_path = f'https://raw.githubusercontent.com/USER/{file_name}'
df = pd.read_csv(file_path, header = 0)
return df
df = in_csv('csv_name')
print(df.head())
Thanks #Alian and #Aaron
Just add '.csv' after {file_name} and work perfect.
import pandas as pd
def in_csv(file_name):
file_path = f'https://raw.githubusercontent.com/USER/{file_name}**.csv**'
df = pd.read_csv(file_path, header = 0)
return df
df = in_csv('csv_name')
In order to do this, you probably want to use a Python 3 F-string rather than a regular string. In this case, you would want to change your first line in the function to this:
file = f'https://raw.githubusercontent.com/USER/{file_name}.csv'
The f'{}' syntax uses the value of the variable or expression within the brackets instead of the string literal you included.

How to handle .json fine in tabular form in python?

By using this code:
import pandas as pd
patients_df = pd.read_json('/content/students.json',lines=True)
patients_df.head()
the data are shown in tabular form look like this:
The main json file looks like this:
data = []
for line in open('/content/students.json', 'r'):
data.append(json.loads(line))
How can I get the score column of the table in an organized manner like column name Exam, Quiz, and Homework
Possible solution could be the following:
# pip install pandas
import pandas as pd
import json
def separate_column(row):
for e in row["scores"]:
row[e["type"]] = e["score"]
return row
with open('/content/students.json', 'r') as file:
data = [json.loads(line.rstrip()) for line in file]
df = pd.json_normalize(data)
df = df.apply(separate_column, axis=1)
df = df.drop(['scores'], axis=1)
print(df)

How to read a url into a dataframe and join unwanted rows?

I have the following code to import some data from a website, I want to convert my data variable into a dataframe.
I've tried with pd.DataFrame and pd.read_csv(io.StringIO(data), sep=";") but always show me an error.
import requests
import io
# load file
data = requests.get('https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT').content
# decode data
data = data.decode('latin-1')
# skip first 2 rows
data = data.split('\r\n')[2::]
del data[1]
# trying to fix csv structure
lines = []
lines_2 = []
for line in data:
line = ';'.join(line.split(';'))
if len(line) > 0 and line[0].isdigit():
lines.append(line)
lines_2.append(line)
else:
if len(lines) > 0:
lines_2.append(lines_2[-1] + line)
lines_2.remove(lines_2[-2])
else:
lines.append(line)
data = '\r\n'.join(lines_2)
print(data)
the expected ouput should be like this:
date 1 2
0 29/08/2020 HI RE ....
1 30/08/2020 HI RE ....
2 31/08/2020 HI RE ...
There are few rows that need to be added to the previos one (the main rows should be the rows who start by a date)
Prayson's answer is correct, but the skiprows parameter should also be used (otherwise the metadata is interpreted as column names).
import pandas as pd
df = pd.read_csv(
"https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT",
sep=";",
skiprows=2,
encoding='latin-1',
)
print(df)
You can read text/csv data directly from URL with pandas
import pandas as pd
URI = 'https://www.omie.es/sites/default/files/dados/AGNO_2020/MES_08/TXT/INT_PDBC_MARCA_TECNOL_1_01_08_2020_31_08_2020.TXT'
df = pd.read_csv(URI, sep=';', encoding='latin1')
print(df)
pandas will do the downloading for you. So no need for requests or io.StringIO.

how to put this distorted data into the csv file, in the table format

Please find below the code that I am using to put distorted data into CSV file in the table format:
import requests
from bs4 import BeautifulSoup
import csv
f = open('moneyControl-bonus','w' , newline = '')
writer = csv.writer(f)
f2 = open('moneyControl-dividend','w' , newline = '')
writer2 = csv.writer(f2)
url = 'https://www.moneycontrol.com/stocks/marketinfo/upcoming_actions/home.html'
headers = {'user-agent':'Mozilla/5.0'}
response = requests.get(url,headers)
soup = BeautifulSoup(response.content,'lxml')
div = soup.find_all('div',class_='tbldata36 PT10')[0]
for table in div.find_all('table'):
for row in table.find_all('tr'):
writer.writerow[row]
div2 = soup.find_all('div',class_='tbldata36 PT20')[0]
for table2 in soup.find_all('table'):
for row2 in table2.find_all('tr'):
writer2.writerow[row2]
Have you tried using pandas? This the default library to be used in writing the csv format.
# pip install pandas
import pandas as pd
You can write into the CSV file by either passing list of lists or by dictonary.
df = pd.DataFrame({'col1':[1,2,3,4],'col2':['a','b','c','d']})
df = pd.DataFrame([[1,2,3,4],['a','b','c','d']],columns=['col1','col2'])
df.to_csv(path_and_name_of_file)
You can use many different formats with that one such as Excel, table, text, json etc.
Please take a look at the official DataFrame Documentation

JSON from API call to pandas dataframe

I'm trying to get an API call and save it as a dataframe.
problem is that I need the data from the 'result' column.
Didn't succeed to do that.
I'm basically just trying to save the API call as a csv file in order to work with it.
P.S when I do this with a "JSON to CSV converter" from the web it does it as I wish. (example: https://konklone.io/json/)
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&
address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&
endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
j
df = pd.DataFrame(j)
df.head()
output example picture
Try this
import requests
import pandas as pd
import json
res = requests.get("http://api.etherscan.io/api?module=account&action=txlist&address=0xddbd2b932c763ba5b1b7ae3b362eac3e8d40121a&startblock=0&endblock=99999999&sort=asc&apikey=YourApiKeyToken")
j = res.json()
# print(j)
filename ="temp.csv"
df = pd.DataFrame(j['result'])
print(df.head())
df.to_csv(filename)
Looks like you need.
df = pd.DataFrame(j["result"])

Categories

Resources