pandas Add html tables values where data is int - python

I am getting html table based on day so if I search for 20 days it brings me 20 table and I want to add all 20 tables in 1 table so I can verify data within time series.
I have tried merge and add functions of pandas but it just add as string.
Table one
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
table two
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
code but it add as string.
tab_data = [[item.text for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
df = pd.DataFrame(tab_data)
df2 = pd.DataFrame(tab_data)
df3 = df.add(df2,fill_value=0)
df

If you want to convert the numeric cells into integers, you would need to do that explicitly, as follows:
tab_data = [[int(item.text) if item.text.isdigit() else item.text
for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
Hope it helps.

The way you are converting the data frame treats all values as text.
There are two options here.
Explicitly convert the strings to the data type you want using astype
Use read_html to create data frames from html tables, which also tries to do the data type conversion.

Related

Receiving error while trying to highlight a row with .style Pd DataFrame

I have the following code for trying to highlight a certain row in my dataframe.
# morphological features in WALS
wals_morph_features = pd.DataFrame({'Feature Name': ['Fusion of Selected Inflectional Formatives', 'Exponence of Selected Inflectional Formatives', 'Exponence of Tense-Aspect-Mood Inflection', 'Inflectional Synthesis of the Verb', 'Locus of Marking in the Clause', 'Locus of Marking in Possessive Noun Phrases', 'Locus of Marking: Whole-language Typology', 'Zero Marking of A and P Arguments', 'Prefixing vs. Suffixing in Inflectional Morphology Prefixing vs. Suffixing in Inflectional Morphology', 'Reduplication', 'Case Syncretism', 'Syncretism in Verbal Person/Number Marking'],
'Feature ID': ['20A', '21A', '21B', '22A', '23A', '24A', '25A', '25B', '26A', '27A', '28A', '29A'],
'Number of Languages': ['165', '162', '160', '145', '236', '236', '236', '235', '969', '368', '198', '198'],
'Number of Variances': ['7', '5', '6', '7', '5', '6', '5', '2', '6', '3', '4', '3']})
def custom_style(row):
color = 'white'
if row.values[-1] == 'Locus of Marking in Possessive Noun Phrases':
color = 'yellow'
return ['background-color: %s' % color]*len(row.values)
wals_morph_features.styler.apply(custom_style, axis=1)
wals_morph_features
However, I am receiving the error AttributeError: 'DataFrame' object has no attribute 'styler'
I'm not sure what's wrong as the documentation is saying this is the correct syntax.
It should read
wals_morph_features.style.apply(custom_style, axis=1)
instead of
wals_morph_features.styler.apply(custom_style, axis=1),
i.e., the attribute is style and not styler.
You can try this way:
choose the colors you like:
Disclosure: this answer was orignally answered elsewhere in stackoverflow.
def red_or_green(dataframe):
row = dataframe['Feature Name'] == 'Locus of Marking in Possessive Noun Phrases'
a = np.where(np.repeat(row.to_numpy()[:,None],dataframe.shape[1],axis=1),
'background-color: white','background-color: #222222')
return pd.DataFrame(a,columns=dataframe.columns,index=dataframe.index)
wals_morph_features.style.apply(red_or_green, axis=None)

Generating a random sample list from a population that has no elements in common with another list

I wish to sample elements from a list such that none of the element are contained in another list of specified elements. I wish to keep generating new samples, till one that is non-intersecting is generated. This, code below is what I have thought of, but it's not working whenever there is an intersecting initial sample,it goes into an infinite loop and the print reveals that all of the generated samples are the same.
import random
unique_entities=['100','1001','10001','100001','11111']
pde_fin= ['2151', '2146', '2153', '2135', '2158', '2160', '2137', '2169', '2147', '2015', '2022', '2173', '2028', '2014', '2018', '2009', '1140', '1085', '1136', '1132', '1007', '1080', '1078', '1131', '1106', '1164', '1092', '1108', '1118', '1045', '1051', '1006','1001']
random_entities=random.sample(unique_entities,3) #choses 5 unique entities
while(not(set(random_entities).isdisjoint(pde_fin))):
random_entites=random.sample(unique_entities,5)
print(random_entities,"random_entites")
print(unique_entities)
Can you please help me understand what is going wrong?
There are two issues with the line random_entites=random.sample(unique_entities,5):
First, there is a typo, you wrote random_entites instead of random_entities.
Second, you're taking a sample of 5 elements from unique_entities, which happens to contain only 5 elements in total. Therefore the sample always contains the element '1001', the one element which is also in pde_fin.
Here is a working version of the program, which includes some other tweaks:
import random
unique_entities = ['100', '1001', '10001', '100001', '11111']
pde_fin = ['2151', '2146', '2153', '2135', '2158', '2160', '2137', '2169', '2147', '2015', '2022', '2173', '2028',
'2014', '2018', '2009', '1140', '1085', '1136', '1132', '1007', '1080', '1078', '1131', '1106', '1164',
'1092', '1108', '1118', '1045', '1051', '1006', '1001']
sample_size = 3
random_entities = set(random.sample(unique_entities, sample_size))
print(f"{random_entities=}")
while not random_entities.isdisjoint(pde_fin):
random_entities = set(random.sample(unique_entities, sample_size))
print(f"{random_entities=}")
print(f"Result: {random_entities}")
You can filter unique_entities, before doing the sampling. Mathematically, filtering before or after are the same in terms of randomness.
unique_entities=['100','1001','10001','100001','11111']
pde_fin= ['2151', '2146', '2153', '2135', '2158', '2160', '2137', '2169', '2147', '2015', '2022', '2173', '2028', '2014', '2018', '2009', '1140', '1085', '1136', '1132', '1007', '1080', '1078', '1131', '1106', '1164', '1092', '1108', '1118', '1045', '1051', '1006','1001']
unique_entities_unique = [i for i in unique_entities if not i in pde_fin]
random_entities=random.sample(unique_entities_unique,3)
print(random_entities,"random_entites")

How do you convert formatted 'epi-week' to date using Python?

I am currently trying to learn how to apply Data Science skills which I am learning through Coursera and Dataquest to little personal projects.
I found a dataset on Google BigQuery from the US Department of Health and Human Services which includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
I exported the data to a .csv file and imported it into a Jupyter notebook which I am running through Anaconda. Upon looking at the header of the dataset I noticed that the dates/weeks are shown as 'epi_week'.
I am trying to make the data more readable and useable for some analysis, to do this I was hoping to conver it into something along the lines of DD/MM/YYYY or Week/Month/Year etc.
I did some research, apparently epi-weeks are also referred to as CDC weeks and so far I found an extension/package for python 3 which is called "epiweeks".
Using the epiweeks package I can turn some 'normal' dates into what the package creator refers to into some sort of an epi weeks form but they look nothing like what I can see in the dataset.
For example if I use todays date, the 24th of May 2019 (24/05/2019) then the output is: "Week 21 of Year 2019" but this is what the first four entrys in the data (and following the same format, all the other ones) look like:
epi_week
'197006'
'197007'
'197008'
'197012'
In [1]: disease_header
Out [1]:
[['epi_week', 'state', 'loc', 'loc_type', 'disease', 'cases', 'incidence_per_100000']]
In [2]: disease[:4]
Out [2]:
[['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']]
The epiweeks package was developed to solve problems like the one you have here.
With the example data you provided, let's create a new column with week ending date:
import pandas as pd
from epiweeks import Week
columns = ['epi_week', 'state', 'loc', 'loc_type',
'disease', 'cases', 'incidence_per_100000']
data = [
['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']
]
df = pd.DataFrame(data, columns=columns)
# Now create a new column with week ending date in ISO format
df['week_ending'] = df['epi_week'].apply(lambda x: Week.fromstring(x).enddate())
That results in something like:
I recommend you to have a look over the epiweeks package documentation for more examples.
If you only need to have year and week columns, that can be done without using the epiweeks package:
df['year'] = df['epi_week'].apply(lambda x: int(x[:4]))
df['week'] = df['epi_week'].apply(lambda x: int(x[4:6]))
That results in something like:

How to save stack of arrays to .csv when there is a mismatch between array dtype and format specifier?

These are my stacks of arrays, both with variables arranged columnwise.
final_a = np.stack((four, five, st, dist, ru), axis=-1)
final_b = np.stack((org, own, origin, init), axis=-1)
Example:
In: final_a
Out: array([['9999', '10793', ' 1', '99', '2'],
['9999', '10799', ' 1', '99', '2'],
['9999', '10712', ' 1', '99', '2'],
...,
['9999', '23960', '33', '99', '1'],
['9999', '82920', '33', '99', '2'],
['9999', '82920', '33', '99', '2']],
dtype='<U5')
But when I try to save either of them to a .csv file using this code:
np.savetxt("/Users/jaisaranc/Documents/ASI selected data - A.csv", final_a, delimiter=",")
It throws this error:
TypeError: Mismatch between array dtype ('<U5') and format specifier ('%.18e,%.18e,%.18e,%.18e,%.18e')
I have no idea what to do.
savetxt in Numpy allows you specify a format for how the array will be displayed when it's written to a file. The default format (fmt='%.18e') can only format arrays containing only numeric elements. Your array contains strings (dtype='<U5' means the type is unicode with length 5) so it raises an error. In your case you should also include fmt='%s' as an argument to ensure the array elements in the output file are formatted as strings. For example:
np.savetxt("example.csv", final_a, delimeter=",", fmt="%s")

Concatenate values of two keys in the same dictionary

I had a dictionary like:
a = {'date' : ['2012-03-09', '2012-01-12', '2012-11-11'],
'rate' : ['199', '900', '899'],
'country code' : ['1', '2', '44'],
'area code' : ['114', '11', '19'],
'product' : ['Mobile', 'Teddy', 'Handbag']}
Then I used zip function to concatenate the values:
data = [(a,b,c+d,e) for a,b,c,d,e in zip(*a.values())]
Output:
data = [('2012-03-09', '199', '1114', 'Mobile'),
('2012-01-12', '900', '211', 'Teddy'),
('2012-11-11', '899', '4419', 'Handbag')]
What if I want the function to itself search for the 'country code' and 'area code', and merge them. Any suggestions please?
A generic method to merge 'columns', letting you specify what columns to expect and what to merge up front:
def merged_pivot(data, *output_names, **merged_columns):
input_names = []
column_map = {}
for col in output_names:
start = len(input_names)
input_names.extend(merged_columns.get(col, [col]))
column_map[col] = slice(start, len(input_names))
for row in zip(*(data[c] for c in input_names)):
yield tuple(''.join(row[column_map[c]]) for c in output_names)
which you call with:
list(merged_pivot(a, 'date', 'rate', 'code', 'product', code=('country code', 'area code')))
passing in:
the list of mappings
each columns that makes up the output ('date', 'rate', 'code', 'product' in the above example)
any column in the output that is composed of a merged list of input columns (code=('country code', 'area code') in the example, so code in the output is formed by merging country code and area code).
Output:
>>> list(merged_pivot(a, 'date', 'rate', 'code', 'product', code=('country code', 'area code')))
[('2012-03-09', '199', '1114', 'Mobile'), ('2012-01-12', '900', '211', 'Teddy'), ('2012-11-11', '899', '4419', 'Handbag')]
or, slightly reformatted:
[('2012-03-09', '199', '1114', 'Mobile'),
('2012-01-12', '900', '211', 'Teddy'),
('2012-11-11', '899', '4419', 'Handbag')]
Instead of calling list() on the merged_pivot() generator, you can also just loop over it's output if all you need to do is process each row separately:
columns = ('date', 'rate', 'code', 'product')
for row in merged_pivot(a, *columns, code=('country code', 'area code')):
# do something with `row`
print row
You have to define the order of keys yourself (otherwise a.values returns it in an arbitrary order). I renamed your original dictionary to dd:
[(a,b,c+d,e) for a,b,c,d,e in zip(*(dd[k] for k in ('date', 'rate', 'country code', 'area code', 'product')))]
returns
[('2012-03-09', '199', '1114', 'Mobile'),
('2012-01-12', '900', '211', 'Teddy'),
('2012-11-11', '899', '4419', 'Handbag')]

Categories

Resources