I've been learning Graphlab, but wanted to take a look at pandas as well since it's open source and in the future I might find myself at a company that doesn't have a GL license, and I was wondering how pandas would handle creating a basic model the way I can with GL.
data = pd.read_csv("~/Downloads/diamonds.csv")
sframe = gl.SFrame(data)
train_data, test_data = sframe.random_split(.8, seed=1)
train, test = train_test_split(data, train_size=0.75, random_state=88)
reg_model = gl.linear_regression.create(train_data, target="price", features=["carat","cut","color"], validation_set=None)
What would be the pandas equivalent of the last line above?
pandas itself doesn't have any predictive modeling built in (that i know of).
Here is a good link on how to leverage pandas in a statistical model. This one too.
pandas is probably one of the best (if not the best) modules for data manipulation in Python. It'll make storing data and manipulating the data for modeling much easier than lists and reading CSVs, etc.
Reading in files is as easy as (notice how intuitive it is):
import pandas as pd
# Excel
df1 = read_excel(PATH_HERE)
# Csv
df1 = read_csv(PATH_HERE)
# JSON
df1 = read_json(PATH_HERE)
and to spit it out:
# Excel
d1.to_excel(PATH_HERE)
# Need I go on again??
It also makes filtering and slicing your data very simple. Here is the official doc:
For modeling purposes have a look at
sklearn and NLTK for text analysis. There are others, but those are the ones I've used.
For modelling, you have to use sklearn library. The last line equivalent is:
model = sklearn.linear_model.LogisticRegression()
model.fit(train_data["carat","cut","color"], train_data["price"])
docs
Related
I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:
df = pd.read_csv('data.csv')
But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:
"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."
But I dont see much gain in speed
If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer
In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:
import pandas as pd
import datatable as dt
# Read with databale
datatable_df = dt.fread('myfile.csv')
# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()
I use Pandas and Dask all the time. I also have a number of custom classes and functions which I utilize a lot for different analyses, which I am always having to edit to account for either Dask or Pandas. I consistently find myself in a situation where I wish I could assign attributes to the dataset which I am analyzing, minimizing the compute command from dask and also allowing easier management of functions as I switch between data types. Something effectively akin to:
import pandas as pd
import dask.dataframe as dd
from pydataset import data
df = data('titanic')
setattr(df, 'vals12', 1)
test = dd.from_pandas(df, npartitions = 2)
test.vals12 #would still contain the attribute vals12
df = test.compute()
df.vals12 #would still contain the attribute vals12
However, I do not know of a way to achieve this, without editing the base packages (Pandas / Dask). As a result, I was wondering if there was a way to achieve the above example without creating a new class (or static version of the packages) or if there is a way to "branch" the repos in a non-public way (allowing for my edits to be added, but still allowing me to easily get future features without pain)?
In the upcoming release of Dask, you will be able to do this by using the recent attrs feature in pandas 1.0. For now, you can pip install dask from Github to use this functionality.
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({
"a":[0,1,2],
"b":[2,3,4]
})
df.attrs["vals12"] = 1
ddf = dd.from_pandas(df, npartitions=2)
ddf.attrs
{'vals12': 1}
Im kinda new to Python and Datascience.
I have a 33gb csv file Dataset, and i want to parse it in a DataFrame to do some stuff on it.
I tried to do it the 'Casual' with pandas.read_csv and it's taking ages to parse..
I searched on the internet and found this article.
It says that the most efficent way to read a large csv file is to use csv.DictReader.
So i tried to do that :
import pandas as pd
import csv
df = pd.DataFrame(csv.DictReader(open("MyFilePath")))
Even with this solution it's taking ages to do the job..
Can you please guys tell me what's the most efficient way to parse a large dataset into pandas?
There is no way you can read such a big file in a short time. Anyway there are some strategies to deal with a large data, these are some of them which give u opportunity to implement your code without leaving the comfort of Pandas:
Sampling
Chunking
Optimising Pandas dtypes
Parallelising Pandas with Dask.
The most simple option is sampling your dataset(This may be helpful for you). Sometimes a random part ofa large dataset will already contain enough information to do next calculations. If u don't actually need to process your entire dataset this is excellent technique to use.
sample code :
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename)) - 1 # number of lines in file
s = n//m # part of the data
skip = sorted(random.sample(range(1, n + 1), n - s))
df = pandas.read_csv(filename, skiprows=skip)
This is the link for Chunking large data.
I have a very simple csv file that looks like this:
time,is_boy,is_girl
135,1,0
136,0,1
137,0,1
I have this csv file sitting in a Hive table also, where all the values have been created as doubles in the table.
Behind the scenes, this table is actually enormous, and has an enormous number of rows, so I have chosen to use Spark 2 to solve this problem.
I would like to use this clustering library, with Python:
https://spark.apache.org/docs/2.2.0/ml-clustering.html
If anyone knows how to load this data, either directly from the csv or by using some Spark SQL magic, and preprocess it correctly, using Python, so that it can be passed into the kmeans fit() method and calculate a model, I would be very grateful. I also think it would be useful for others as I haven't found an example for csvs and for this library yet.
The fit method just takes a vector / Dataframe
spark.read().csv or spark.sql both return you a Dataframe.
However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples
So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:
I created a simple csv like so:
time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0
Then I created a hive table, executing this query in hue:
CREATE EXTERNAL TABLE pollab02.experiment_raw(
`time` double,
`is_boy` double,
`is_girl` double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
serdeproperties( 'separatorChar' = ',' )
STORED AS TEXTFILE LOCATION "/user/me/hive/experiment"
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")
Then my pyspark script was as follows:
(I'm assuming a SparkSession has been created with the name "spark")
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
raw_data = spark.sql("select * from dbname.experiment_raw")
#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)
#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns
#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)
and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.
In the lab that I work in, we process a lot of data produced by a 96 well plate reader. I'm trying to write a script that will perform a few calculations and output a bar graph using matplotlib.
The problem is that the plate reader outputs data into a .xlsx file. I understand that some modules like pandas have a read_excel function, can you explain how I should go about reading the excel file and putting it into a dataframe?
Thanks
Data sample of a 24 well plate (for simplicity):
0.0868 0.0910 0.0912 0.0929 0.1082 0.1350
0.0466 0.0499 0.0367 0.0445 0.0480 0.0615
0.6998 0.8476 0.9605 0.0429 1.1092 0.0644
0.0970 0.0931 0.1090 0.1002 0.1265 0.1455
I'm not exactly sure what you mean when you say array, but if you mean into a matrix, might you be looking for:
import pandas as pd
df = pd.read_excel([path here])
df.as_matrix()
This returns a numpy.ndarray type.
This task is super easy in Pandas these days.
import pandas as pd
df = pd.read_excel('file_name_here.xlsx', sheet_name='Sheet1')
or
df = pd.read_csv('file_name_here.csv')
This returns a pandas.DataFrame object which is very powerful for performing operations by column, row, over an entire df, or over individual items with iterrows. Not to mention slicing in different ways.
There is awesome xlrd package with quick start example here.
You can just google it to find code snippets. I have never used panda's read_excel function, but xlrd covers all my needs, and can offer even more, I believe.
You could also try it with my wrapper library, which uses xlrd as well:
import pyexcel as pe # pip install pyexcel
import pyexcel.ext.xls # pip install pyexcel-xls
your_matrix = pe.get_array(file_name=path_here) # done