How to remove b' from values in dataframe

How to remove b' from values in dataframe - python

I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object

I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.

Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))

Related

Importing graph in python

I just want to import graphs from external sources into python and read corresponding x and y-values.
Is it possible in with any python module and if possible what format can the graphs be imported?
I searched for such modules but could only find articles for plotting graphs

you can use pandas library for loading .csv, .xls, .xlsx and some other files with it.
you can install it using pip install pandas, this is the example for loading files:
import pandas as pd
df = pd.read_csv("file.csv")
df.head()
for loading csv file into python list you can use:
import pandas as pd
df = pd.read_csv("file.csv")
x_list = df["x"].tolist() # reading the column named "x" and convert it to list
y_list = df["y"].tolist() # reading the column named "y" and convert it to list
and for loading them into numpy you can use:
import pandas as pd
df = pd.read_csv("file.csv")
x_list = df["x"].to_numpy() # reading the column named "x" and convert it to numpy array
y_list = df["y"].to_numpy() # reading the column named "y" and convert it to numpy array

Why cant I extract a single column using pandas?

I have a (theoretically) simple task. I need to pull out a single column of 4000ish names from a table and use it in another table.
I'm trying to extract the column using pandas and I have no idea what is going wrong. It keeps flagging an error:
TypeError: string indices must be integers
import pandas as pd
file ="table.xlsx"
data = file['Locus tag']
print(data)

You have just add file name and define the path . But you cannot load the define pandas read excel function . First you have just the read excel function from pandas . That can be very helpful to you read the data and extract the column etc
Sample Code
import pandas as pd
import os
p = os.path.dirname(os.path.realpath("C:\Car_sales.xlsx"))
name = 'C:\Car_sales.xlsx'
path = os.path.join(p, name)
Z = pd.read_excel(path)
Z.head()
Sample Code
import pandas as pd
df = pd.read_excel("add the path")
df.head()

PDF to Pandas Data Frame

Just when I think I am finally getting it, such a newb.
I am trying to get a list of numbers from a column from a table that is an PDF.
First step I wanted to convert to a Panda DF.
pip install tabula-py
pip install PyPDF2
import pandas as pd
import tabula
df = tabula.read_pdf('/content/Manifest.pdf')
The output I get however is a list of 1, not a DF. When I look at DF the info is there, I just have no idea how access it as it is a list of 1.
So not sure why I didnt get a DF and no idea what I meant to do with a list of 1.Output
Not sure if it matters but I am using google Colab.
Any help would be awesome.
Thanks

tabula.read_pdf returns the list of dataframes without any additional arguments. To access your specific dataframe, you can select the index and use it.
Here's an example where I have read the document and selected the very first index and compared the types
import tabula
df = tabula.read_pdf(
"https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf")
df_0 = df[0]
print("type of df :", type(df))
print("type of df_0", type(df_0))
Returns:
type of df : <class 'list'>
type of df_0 <class 'pandas.core.frame.DataFrame'>

Try something as
df = tabula.read_pdf('/content/Manifest.pdf', sep=' ')

Convert dask to pandas dataframe

I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
I am running the following script:
import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()
And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.
How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False

When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.
To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from (sample=).
The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.

Maybe try this:
df = pd.DataFrame()
df = df2.compute()

Letter appeared in data when arff loaded into Python

I have loaded an arff file to python using this code:
import pandas as pd, scipy as sp
from scipy.io import arff
datos,meta = arff.loadarff(open('selectividad.arff', 'r'))
d = pd.DataFrame(datos)
When I use head function to see the data frame, this is how it looks:
However, those 'b' are not present in the arff file as we can see below:
https://gyazo.com/3123aa4c7007cb4d6f99241b1fc41bcb
What is the problem here? Thank you very much

For one column, apply the following code:
data['name_column'] = data['name_column'].str.decode('utf-8')
For a dataframe, apply:
str_df = df.select_dtypes([np.object])
str_df = str_df.stack().str.decode('utf-8').unstack()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove b' from values in dataframe - python

Although Shimon shared an answer, you could also give this a try: df.apply(lambda x: x.str.decode('utf8'))

Related

Importing graph in python

Why cant I extract a single column using pandas?

PDF to Pandas Data Frame

Convert dask to pandas dataframe

Letter appeared in data when arff loaded into Python

Categories

Resources