I have loaded an arff file to python using this code:
import pandas as pd, scipy as sp
from scipy.io import arff
datos,meta = arff.loadarff(open('selectividad.arff', 'r'))
d = pd.DataFrame(datos)
When I use head function to see the data frame, this is how it looks:
However, those 'b' are not present in the arff file as we can see below:
https://gyazo.com/3123aa4c7007cb4d6f99241b1fc41bcb
What is the problem here? Thank you very much
For one column, apply the following code:
data['name_column'] = data['name_column'].str.decode('utf-8')
For a dataframe, apply:
str_df = df.select_dtypes([np.object])
str_df = str_df.stack().str.decode('utf-8').unstack()
Related
I just want to import graphs from external sources into python and read corresponding x and y-values.
Is it possible in with any python module and if possible what format can the graphs be imported?
I searched for such modules but could only find articles for plotting graphs
you can use pandas library for loading .csv, .xls, .xlsx and some other files with it.
you can install it using pip install pandas, this is the example for loading files:
import pandas as pd
df = pd.read_csv("file.csv")
df.head()
for loading csv file into python list you can use:
import pandas as pd
df = pd.read_csv("file.csv")
x_list = df["x"].tolist() # reading the column named "x" and convert it to list
y_list = df["y"].tolist() # reading the column named "y" and convert it to list
and for loading them into numpy you can use:
import pandas as pd
df = pd.read_csv("file.csv")
x_list = df["x"].to_numpy() # reading the column named "x" and convert it to numpy array
y_list = df["y"].to_numpy() # reading the column named "y" and convert it to numpy array
I have a (theoretically) simple task. I need to pull out a single column of 4000ish names from a table and use it in another table.
I'm trying to extract the column using pandas and I have no idea what is going wrong. It keeps flagging an error:
TypeError: string indices must be integers
import pandas as pd
file ="table.xlsx"
data = file['Locus tag']
print(data)
You have just add file name and define the path . But you cannot load the define pandas read excel function . First you have just the read excel function from pandas . That can be very helpful to you read the data and extract the column etc
Sample Code
import pandas as pd
import os
p = os.path.dirname(os.path.realpath("C:\Car_sales.xlsx"))
name = 'C:\Car_sales.xlsx'
path = os.path.join(p, name)
Z = pd.read_excel(path)
Z.head()
Sample Code
import pandas as pd
df = pd.read_excel("add the path")
df.head()
I read my arff dataframe from here https://archive.ics.uci.edu/ml/machine-learning-databases/00426/ like this:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0])
df.head()
But my dataframe has b' in all values in all columns:
How to remove it?
When i try this, it doesn't work as well:
from scipy.io import arff
import pandas as pd
data = arff.loadarff('Autism-Adult-Data.arff')
df = pd.DataFrame(data[0].str.decode('utf-8'))
df.head()
It says AttributeError: 'numpy.ndarray' object has no attribute 'str'
as you see .str.decode('utf-8') from Removing b'' from string column in a pandas dataframe didn't solve a problem
This doesn't work as well:
df.index = df.index.str.encode('utf-8')
A you see its both string and and numbers are bytes object
I was looking at the same dataset and had a similar issue. I did find a workaround and am not sure if this post will be helpful? So rather than use the from scipy.io import arff, I used another library called liac-arff. So the code should be like
pip install liac-arff
Or whatever the pip command that works for your operating system or IDE, and then
import arff
import pandas as pd
data = arff.loads('Autism-Adult-Data.arff')
Data returns a dictionary. To find what columns that dictionary has, you do
data.keys()
and you will find that all arff files have the following keys
['description', 'relation', 'attributes', 'data']
Where data is the actual data and attributes has the column names and the unique values of those columns. So to get a data frame you need to do the following
colnames = []
for i in range(len(data['attributes'])):
colnames.append(data['attributes'][i][0])
df = pd.DataFrame.from_dict(data['data'])
df.columns = colnames
df.head()
So I went overboard here with all creating the dataframe and all but this returns a data frame with no issues with a b', and the key is using import arff.
So the GitHub for the library I used can be found here.
Although Shimon shared an answer, you could also give this a try:
df.apply(lambda x: x.str.decode('utf8'))
While trying to load a big csv file (150 MB) I get the error "Kernel died, restarting". Then only code that I use is the following:
import pandas as pd
from pprint import pprint
from pathlib import Path
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
basedaily = pd.read_csv('combined_csv.csv')
Before it used to work, but I do not know why it is not working anymore. I tried to fixed it using engine="python" as follows:
basedaily = pd.read_csv('combined_csv.csv', engine='python')
But it gives me an error execution aborted.
Any help would be welcome!
Thanks in advance!
It may be because of the lack of memory you got this error. You can split your data in many data frames, do your work than you can re merge them, below some useful code that you may use:
import pandas as pd
# the number of row in each data frame
# you can put any value here according to your situation
chunksize = 1000
# the list that contains all the dataframes
list_of_dataframes = []
for df in pd.read_csv('combined_csv.csv', chunksize=chunksize):
# process your data frame here
# then add the current data frame into the list
list_of_dataframes.append(df)
# if you want all the dataframes together, here it is
result = pd.concat(list_of_dataframes)
I was able to load the .arff file using the following commands. But I was not able to extract the data from the object and convert the object into a dataframe format. I need this to do apply machine learning algorithms on this dataframe.
Command:-
import arff
dataset = pd.DataFrame(arff.load(open('Training Dataset.arff')))
print(dataset)
Please help me to convert the data from here into a dataframe.
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff
raw_data = loadarff('Training Dataset.arff')
df_data = pd.DataFrame(raw_data[0])
Try this. Hope it helps
from scipy.io.arff import loadarff
import pandas as pd
data = loadarff('Training Dataset.arff')
df = pd.DataFrame(data[0])
Similar to answer above, but no need to import numpy