PDF to Pandas Data Frame - python

Just when I think I am finally getting it, such a newb.
I am trying to get a list of numbers from a column from a table that is an PDF.
First step I wanted to convert to a Panda DF.
pip install tabula-py
pip install PyPDF2
import pandas as pd
import tabula
df = tabula.read_pdf('/content/Manifest.pdf')
The output I get however is a list of 1, not a DF. When I look at DF the info is there, I just have no idea how access it as it is a list of 1.
So not sure why I didnt get a DF and no idea what I meant to do with a list of 1.Output
Not sure if it matters but I am using google Colab.
Any help would be awesome.
Thanks

tabula.read_pdf returns the list of dataframes without any additional arguments. To access your specific dataframe, you can select the index and use it.
Here's an example where I have read the document and selected the very first index and compared the types
import tabula
df = tabula.read_pdf(
"https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf")
df_0 = df[0]
print("type of df :", type(df))
print("type of df_0", type(df_0))
Returns:
type of df : <class 'list'>
type of df_0 <class 'pandas.core.frame.DataFrame'>

Try something as
df = tabula.read_pdf('/content/Manifest.pdf', sep=' ')

Related

How to remove b' from values in dataframe

I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))

Convert dask to pandas dataframe

I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
I am running the following script:
import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()
And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.
How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False
When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.
To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from (sample=).
The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.
Maybe try this:
df = pd.DataFrame()
df = df2.compute()

How do I convert these 2 columns as seen below in In [10] to a dataframe/table to be able to export to a csv file

enter image description here
Hi, I am very new to Python and I plan to create a final exportable table with these reviews scraped from a website to see the words that were most used. I have thus managed to get this 2 columns but have no idea how to proceed, can I directly export this into a table in excel or must I convert it into a dataframe then export it to a CSV? And what is the required code to run as such? Thank you so much for your help!!
It's convenient to use pandas library for working with dataframes:
import pandas as pd
series = pd.Series(wordcount)
series.to_csv("wordcount.csv")
However, if you use the code above, you'll get a warning. To fix it, there are 2 ways:
1) Add header parameter:
series.to_csv("wordcount.csv", header=True)
2) Or convert series to dataframe and then save it (without new index):
df = series.reset_index()
df.to_csv("wordcount.csv", index=False)

Import Excel file in to Python as a list

I want to import one column with 10 rows in to Python as a list.
So I have in excel for example: One, Two, Three, Four,..., Ten
Everything written in column A over row 1-10.
Now I want to import these cells into Python, so that my result is:
list = ['One', 'Two', 'Three', 'Four', ..., 'Ten']
Since I am a total noob in programming, I have no clue how to do it. So please tell me the most easiest way. All tutorials I have found, did't got me the result I want.
Thank you
I am using Python 2.7
Even though pandas is a great library, for your simple task you can just use xlrd:
import xlrd
wb = xlrd.open_workbook(path_to_my_workbook)
ws = wb.sheet_by_index(0)
mylist = ws.col_values(0)
Note that list is not a good name for a variable in Python, because that is the name of a built-in function.
I am unsure if your data is in xlsx form or CSV form. If XLSX, use this Python Excel tutorial. If CSV, it is much easier, and you can follow the code snippet below. If you don't want to use pandas, you can use the numpylibrary. Use the example code snippet below for taking the top row of a CSV file:
import numpy as np
csv_file = np.genfromtxt('filepath/relative/to/your/script.csv',
delimiter=',', dtype=str)
top_row = csv_file[:].tolist()
This will work for a file that has only one column of text. If you have more columns, use the following snippet to just get the first column. The '0' indicates the first column.
top_row = csv_file[:,0].tolist()
I recommend installing pandas.
pip install pandas
and
import pandas
df = pandas.read_excel('path/to/data.xlsx') # The options of that method are quite neat; Stores to a pandas.DataFrame object
print df.head() # show a preview of the loaded data
idx_of_column = 5-1 # in case the column of interest is the 5th in Excel
print list(df.iloc[:,idx_of_column]) # access via index
print list(df.loc[['my_row_1','my_row_2'],['my_column_1','my_column_2']]) # access certain elements via row and column names
print list(df['my_column_1']) # straight forward access via column name
(checkout pandas doc)
or
pip install xlrd
code
from xlrd import open_workbook
wb = open_workbook('simple.xls')
for s in wb.sheets():
print 'Sheet:',s.name
for row in range(s.nrows):
values = []
for col in range(s.ncols):
values.append(s.cell(row,col).value)
print ','.join(values)
(example from https://github.com/python-excel/tutorial/raw/master/python-excel.pdf)

Python: convert excel data into dataframes

I want to put some data available in an excel file into a dataframe in Python.
The code I use is as below (two examples I use to read an excel file):
d=pd.ExcelFile(fileName).parse('CT_lot4_LDO_3Tbin1')
e=pandas.read_excel(fileName, sheetname='CT_lot4_LDO_3Tbin1',convert_float=True)
The problem is that the dataframe I get has the values with only two numbers after comma. In other words, excel values are like 0.123456 and I get into the dataframe values like 0.12.
A round up or something like that seems to be done, but I cannot find how to change it.
Can anyone help me?
thanks for the help !
You can try this. I used test.xlsx which has two sheets, and 'CT_lot4_LDO_3Tbin1' is the second sheet. I also set the first value as Text format in excel.
import pandas as pd
fileName = 'test.xlsx'
df = pd.read_excel(fileName,sheetname='CT_lot4_LDO_3Tbin1')
Result:
In [9]: df
Out[9]:
Test
0 0.123456
1 0.123456
2 0.132320
Without seeing the real raw data file, I think this is the best answer I can think of.
Well, when I try:
df = pd.read_csv(r'my file name')
I have something like that in df
http://imgur.com/a/Q2upp
And I cannot put .fileformat in the sentence
You might be interested in removing column datatype inference that pandas performs automatically. This is done by manually specifying the datatype for the column. Here is what you might be looking for.
Python pandas: how to specify data types when reading an Excel file?
Using pandas 0.20.1 something like this should work:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.fileformat')
for exemple, in excel:
df = pd.read_csv('CT_lot4_LDO_3Tbin1.xlsx')
Read this documentation:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Categories

Resources