I extracted some tables from a html file using Pandas and BeautifulSoup. However, the output is really messy. For some reason, Pandas adds columns and makes the Original columns unaligned. The header and rest of the columns are not in the same column anymore. Similarly, it cuts off columns when ) or % are in the column.
This is the code I used to extract the table I want:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.sec.gov/Archives/edgar/data/1961/0001264931-18-000031.txt')
soup = BeautifulSoup(r.content, "lxml")
tables = soup.find_all('table')
for i, table in enumerate(tables):
if (('income tax' in str(table)) or ('Income tax' in str(table))) and (('Statutory' in str(table)) or ('statutory' in str(table))):
df = pd.read_html(str(table))
print(df[0])
else:
pass
Not sure why df[0] prints the whole list, but that is no problem for me. The table I am intersted in looks like:
[ 0 1 2 3 4 \
0 NaN NaN 2017.0 NaN 2016
1 Income tax computed at the federal statutory rate NaN NaN 34 %
2 Income tax computed at the state statutory rate NaN NaN 5 %
3 Valuation allowance NaN NaN (39 )%
4 Total deferred tax asset NaN NaN 0 %
5 6 7 8
0 NaN NaN NaN NaN
1 NaN NaN 34 %
2 NaN NaN 5 %
3 NaN NaN (39 )%
4 NaN NaN 0 % ]
If I display the html code this table is based on, it looks clean with columns aligned. However, using Pandas it is messy. Anyone knows how to solve this?
Related
How can I extract the string belongs to words.
here is my text,
ID Event Name Event Type
0 1 Taltz Seminar for Dermathologists Out of office
2 3 Experiment Results for Taltz In Ofice
3 4 Use of Taltz in Rheumathology OUTOFOFFICE
5 6 RHeums Experiences with Taltz IO
How can I get the Dermathologists and Rheumathology belonging string using regex.
I have tried this one.
import re
pattern = r'(derma\w+),\s(RHeums\w+).*'
df_named = df['Event Name'].str.extract(
pattern,
flags=re.I)
df_clean = df_named.reindex(
columns =
['dermatological ',
'rheumatological'])
df_clean.head()
Convert Series to new DataFrame:
df_named = pd.DataFrame({'dermatological': df['Event Name'].str.extract('(derma\w+)', flags=re.I, expand=False),
'rheumatological':df['Event Name'].str.extract('(RHeuma\w+)', flags=re.I, expand=False)})
print (df_named)
dermatological rheumatological
0 Dermathologists NaN
1 NaN NaN
2 NaN Rheumathology
3 NaN NaN
If need append new columns to original DataFrame use:
df_named = df.assign(dermatological = df['Event Name'].str.extract('(derma\w+)', flags=re.I),
rheumatological = df['Event Name'].str.extract('(RHeum\w+)', flags=re.I))
print (df_named)
ID Event Name Event Type dermatological \
0 1 Taltz Seminar for Dermathologists Out of office Dermathologists
1 3 Experiment Results for Taltz In Ofice NaN
2 4 Use of Taltz in Rheumathology OUTOFOFFICE NaN
3 6 RHeums Experiences with Taltz IO NaN
rheumatological
0 NaN
1 NaN
2 Rheumathology
3 RHeums
I'm working with several csv with Pandas. I changed some data name on the original csv file and saved the file. Then, I restarted and reloaded my jupyter notebook but now I got something like this for all dataframe I charged the data source :
Department Zone Element Product Year Unit Value
0 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
1 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
2 U1,"Z5","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
3 U1,"Z6","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
4 U1,"Z9","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
I tried to use sep=',', encoding='UTF-8-SIG',quotechar='"', quoting=0, engine='python' but same issue. I don't know how to parse the csv because even when I created a new csv form the data (without the quote and separator as ; ) the same issue appears...
csv is 321 rows, as this example with the problem : https://www.cjoint.com/c/LDCmfvq06R6
and original csv file without problem in Pandas : https://www.cjoint.com/c/LDCmlweuR66
I thing problem with quotes of the file
import csv
df = pd.read_csv('LDCmfvq06R6_FAOSTAT.csv', quotechar='"',
delimiter = ',',
quoting=csv.QUOTE_NONE,
on_bad_lines='skip')
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()
What I am looking for, but can't seem to get right is essentially this:
I'll read in a csv file to build a df. With the output being a table that I can manipulate or edit.
Ultimately, I'm looking for a method to rename the column header placeholders with the first "string" in the column.
Currently:
import pandas as pd
df = pd.read_csv("NameOfCSV.csv")
df
Example Output:
m1_name
m1_delta
m2_name
m2_delta
m3_name
m3_delta
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
CH4
2
...
NaN
What I am trying to understand how to do is create a generic program that will grab the gas name within the column and rename the header "m*_name" and any other header corresponding to "m*_blah" with the gas name of that "m*_name" column.
Example Desired Output where the headers reflect the gas name:
**CO2_name
CO2_delta
NMHC_name
NMHC_delta
NO2_name
NO2_delta**
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NO2
1
...
NaN
I've tried playing around with the several functions (primarily .rename()), trying to find some examples of similar problems, and digging for some documentation but have been unsuccessful with coming up with a solution. Beyond the couple column headers here in the example there are about a dozen per gas name, so I was also trying to build a loop structure to find the headers with the corresponding m_number,"m*_otherHeader", to populate those headers also. The datasets I'm working with are dynamic and the positions of these original columns from the csv change position as well. Some help would be appreciated, and/or pointing me in the direction of some examples or the proper documentation to read through would be great!
df = df.drop_duplicates().set_axis([(f"{df[f'{x}_name'].mode()[0]}_{y}" if df[f'{x}_name'].mode()[0] != 0 else f'{x}_{y}') for x, y in df.columns.str.split('_', n=1)], axis=1)
Output:
>>> df
CO2_name CO2_delta NMHC_name NMHC_delta CH4_name CH4_delta m10_name
0 CO2 1 NMHC 2 CH4 1 0
1 CO2 2 NMHC 1 CH4 2 0
I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc
I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt():
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.
What am I getting wrong?
This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))
The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
Two of the rows are headers
There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
This solution separates the two tables in the file into separate dataframes.
This is not as nice as the other answer, but the data is properly separated into different dataframes.
The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
The important point separating air and ice data.
import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
Without the header code
Simplify the code, by using a manual header list.
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN