Can not get entire table from html - Pandas - python

I was trying to get data using pandas from a wikipedia article about the largest bankrupts DATA but for some reason the table was incomplete. I used this:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_U.S._bank_failures')
type(df)
len(df)
df[1]`
PS: I am using hidrogen to run jupyter at Atom. But that was the output:
Please explain what happened. I am new to Data Science and Pandas

You are getting the whole table. By default, only some number of rows is displayed, hence the ... signs in the middle. If you want to display all rows you can change pandas display default as follows:
# show at most 100 rows
pd.options.display.max_rows = 100
Note that this is a display setting only, the DataFrame contains all table data already.

Related

Pandas Data frames and sorting values

I am having a difficult time with writing this hw assignment, and am not sure where I messed up. I have tried several things, and believe my issue lies in the sort_values or maybe in the groupby command.
The issue is that I want to only display graph data from the year 2007. (using pandas and plotly in jupyternotebook for my class). I have the graph I want mostly but cannot get it to display the data correctly. It simply isn't filtering out the years, or taking data from specific dates as requested.
import pandas as pd
import plotly.express as px
df = pd.read_csv('Data/Country_Data.csv')
print(df.shape)
df.head(2)
df_Q1 = df.query("year == '2007'")
print(df_Q1.shape)
df_Q1.head()
This is where the issue begins, because it prints a table with only header information. As in it prints all the column names, but none of the data for them, and then later on it displays a graph of what I assume is the most recent death data rather than the year 2007 as specified.

How to transform values from dataframe(python) attributes from a row into columns?

I have the following dataframe Current Dataframe loaded from a csv, that I want to use do some sampling tests.
For that I wanted to use all of the current columns, but trying to transform Element_Count and Tag_Count into separate columns from the values from Element_Count(e.g link: 10) and Tag_Count(separately).
I want to extract each value and turn it into a column. The final dataframe would be something like this(obviously depending on the values inside of Element/Tag_Count) :
Index (the 0,1,2 etc from the dataframe its self) PageID ,Uri, A, AA, AAA, link (and its value inside of Element_Count, e.g link as column and in the case of the first one in the picture 44 in the row for that specific url) etc, html, etc (with all the values of Tag_Count present in all of the content inside of the rows of the column Tag_Count as explained for Element_Count)
The current code to generate the dataframe is the following:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore") #to ignore some warnings which have no effect in this particular case.
df = pd.read_csv('test.csv', sep=';')
df.head()
I have searched google, also in here for some answers to no avail.
Have tried changing the test csv to achieve my goal, with no success. Have also tried after seeing a question on here to use:
pd.DataFrame(df.ranges.str.split(',').tolist())
to achieve the desired result with no success.
Any ideas in how to achieve this via dataframes, or by any other method?
(Anything that I have forgot to mention that u feel is important to understand the problem please say and I will edit it in)
Edit :
Although logic would say that the element and tag count arrays should be in dictionary form and easily dividable, that is not the case as shown in the print

How to force pdfplumber to extract table according to the number of columns in the upper row?

I am trying to extract a table from PDF document with python package pdfplumber. The table has four columns and multiple rows. The first row are headers and the second row has only one merged cell, then the values are saved normally (example)
pdfplumber was able to retrive the table, but it made 6 columns out if four and saved values not according to the columns.
Table as shown in PDF document
I tried to use various table settings, including "vertical strategy": "lines", but this yields me the same result.
# Python 2.7.16
import pandas as pd
import pdfplumber
path = 'file_path'
pdf = pdfplumber.open(path)
first_page = pdf.pages[7]
df5 = pd.DataFrame(first_page.extract_table())
getting six columns instead of four with values in wrong columns.
Output example:
Table as output in jupyter notebooks
I would be happy to hear, if anybody has any suggestion, solution.
Did you got the answer as i want ot replace the \n coming in the text of column?
This is not exactly what you're looking for but you could load the op into a dataframe and iterate over it using the non-null values in the first row as column names for another dataframe. After that it is easy, you can just collate all the data between 2 column name columns in the output dataframe and insert it into the new dataframe after merging those cells.

Selecting Pandas DataFrame Rows Based On Conditions

I am new to Python and getting to grips with Pandas. I am trying to perform a simple import CSV, filter, write CSV but can't the filter seems to be dropping rows of data compared to my Access query.
I am importing via the command below:
Costs1516 = pd.read_csv('C:......../1b Data MFF adjusted.csv')
Following import I get a data warning that the service code column contains data of multiple types (some are numerical codes others are purely text) but the import seems to attribute data type Object which I thought would just treat them both as strings and all would be fine....
I want the output dataframe to have the same structure as the the imported data (Costs1516), but only to include rows where 'Service Code' = '110'.
I have pulled the following SQL from Access which seems to do the job well, and returns 136k rows:
SELECT [1b Data MFF adjusted].*, [1b Data MFF adjusted].``[Service code]
FROM [1b Data MFF adjusted]
WHERE ((([1b Data MFF adjusted].[Service code])="110"));
My pandas equivalent is below but only returns 99k records:
Costs1516Ortho = Costs1516.loc[Costs1516['Service code'] == '110']
I have compared the two outputs and I can't see any reason why pandas is excluding some lines and including others....I'm really stuck...any suggested areas to look or approaches to test gratefully received.

How to get column name in tab separated dataframe?

I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?
To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".
Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html

Categories

Resources