Table scraping Python - python

I'm currently trying to parse this table: http://kuap.ru/banks/8012/balances/en
However, I ran into a problem: the table includes lots of options for drop-down lists (which I don't need), and tbody seems to end unexpectedly somewhere in the beginning of the table.
So, basically, I've got three questions:
Could you please provide working code to parse the whole table? To parse the table and turn it into a dataframe
Is it possible to parse from specific line in this kind of table? Like "start with id..." How?
Is it possible to parse only a specific column in a table like this? (Where columns don't have specific IDs). For example, can I scrape the data only from the first two columns (names and first column with numbers?)
Thanks a lot in advance!

import pandas as pd
df = pd.read_html("http://kuap.ru/banks/8012/balances/en", skiprows=[0])[-1]
df.drop(df.columns[-1], axis=1, inplace=True)
print(df)

Related

Convert rows to columns in Python

I have a excel in below format
Note:- Values in Column Name will be dynamic. In current example 10 records are shown. In another set of data it can be different number of column name.
I want to convert the rows into columns as below
Is there any easy option in python pandas to handle this scenario?
Thanks #juhat for the suggestion on pivot table. I was able to achieve the intended result with this code:
fsdData = pd.read_csv("py_fsd.csv")
fsdData.pivot(index="msg Srl", columns="Column Name", values="Value")

How to export a nested dictionary into a csv and format it into columns and rows?

I tried to search for my problem, but I couldn't find anything closley related, so I decided to put my question in here. I'm pretty new to Python and still learning this languange, so forgive me if I didn't catch obvious things.
Currently I'm trying to extract data via beautifulsoup from some letters that were transformed into .xml-files. Those extracted data - I put them into a nested dictionary. What I want to accomplish is a csv-file which contains all those data.
I've got a nested dictionary like this:
dict = [
{"id":"0", "foundPersons" : ["Smith, Bernhard","Jackson, Andrew","Dalton, Henry"], "sendfrom" : "Calahan, Agnes"},
{"id":"1", "foundPersons" : ["Cutter, John","Ryan, Jack"], "sendfrom" : "Enrico, Giovanni"}
{"id":"2", "foundPersons" : "Grady, Horace", "sendfrom" : "Calahan, Agnes"}
]
I want to have ID, foundPersons and sendfrom als the header of each column. Then below the ID header I want to have every ID in each cell, every foundPerson of one ID into one cell (that means that those 3 names of ID0 are in one cell) and so on.
I tried that by using the csv module but I couldn't figure out how to accomplish that. Can somebody maybe give me a hint? Or is there any other module / library which can help me out here?
You want to use the pandas library here.
import pandas as pd
data = pd.DataFrame.from_dict(dict)
data.to_csv('output.csv')
print(data)
And your data is now organized to columns and rows.

Take a table from a website using pd.read_html and turn into a pandas Dataframe

I am trying to take a table using pandas pd.read_html function from the website https://www.statbunker.com/competitions/FantasyFootballPlayersStats?comp_id=556
I have got as far as making the table into a list but when I try to change it to a Dataframe only the 5 outputs from the top line of the 620x20 table appears in the output.The code I have used is
BPL1617 = pd.read_html("https://www.statbunker.com/competitions/FantasyFootballPlayersStats?comp_id=556")
Then to convert to a dataframe I usedBPL20162017 = pd.DataFrame(BPL1617) but the output is wrong.
I hope to later add new tables from this website and merge them.
Any help would be greatly appreciated as I don't know where I am going wrong!

Selecting Pandas DataFrame Rows Based On Conditions

I am new to Python and getting to grips with Pandas. I am trying to perform a simple import CSV, filter, write CSV but can't the filter seems to be dropping rows of data compared to my Access query.
I am importing via the command below:
Costs1516 = pd.read_csv('C:......../1b Data MFF adjusted.csv')
Following import I get a data warning that the service code column contains data of multiple types (some are numerical codes others are purely text) but the import seems to attribute data type Object which I thought would just treat them both as strings and all would be fine....
I want the output dataframe to have the same structure as the the imported data (Costs1516), but only to include rows where 'Service Code' = '110'.
I have pulled the following SQL from Access which seems to do the job well, and returns 136k rows:
SELECT [1b Data MFF adjusted].*, [1b Data MFF adjusted].``[Service code]
FROM [1b Data MFF adjusted]
WHERE ((([1b Data MFF adjusted].[Service code])="110"));
My pandas equivalent is below but only returns 99k records:
Costs1516Ortho = Costs1516.loc[Costs1516['Service code'] == '110']
I have compared the two outputs and I can't see any reason why pandas is excluding some lines and including others....I'm really stuck...any suggested areas to look or approaches to test gratefully received.

How to get column name in tab separated dataframe?

I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?
To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".
Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html

Categories

Resources