How do i import datasets in Python? - python

I try to import some datasets in my code. I need help, because I tried a lot of tutorials and web pages and I am still gettting errors. I use Spyder IDE and python 3.7:
import numpy as np
import pandas as pd
import tensorflow as tf
import os
dts1=pd.read_csv(r"C:\Users\Cucu\Desktop\sample_submission.csv")
dts1

This works for me. If you are still experiencing errors, please post them.
import pandas as pd
# Read data from file 'sample_submission.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
data = pd.read_csv(r"C:\Users\Cucu\Desktop\sample_submission.csv")
# Preview the first 5 lines of the loaded data
print(data.head())

Try using other approaches :
pd.read_csv("C:\\Users\\Cucu\\Desktop\\sample_submission.csv")
pd.read_csv("C:/Users/Cucu/Desktop/sample_submission.csv")

Related

Is there a way to download a sample CSV file

I used a sample of a csv program to do some tables on Jupiter notebook, I now need to download that sample csv file so I can look at it in excel, is there a way I can download the sample
I need to download lf if possible.
Here is my code:
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import io
import requests
df = pd.read_csv("diamonds.csv")
lf = df.sample(5000, random_state=999)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
lf.sample(5000, random_state=999)'''
You first need to convert the sample to a dataframe and then you can export it.
dframe.to_csv(“file_name.csv”)
Let me know if it works.
Answer from here:
import urllib.request
urllib.request.urlretrieve("http://jupyter.com/diamond.csv", "diamond.csv")
if what you mean by download is exporting the dataframe to spreadsheet format, pandas have the function
import pandas as pd
df = pd.read_csv("diamond.csv")
# do your stuff
df.to_csv("diamond2.csv") # if you want to export to csv with different name
df.to_csv("folder/diamond2.csv") # if you want to export to csv inside existed folder
df.to_excel("diamond2.xlsx") # if you want to export to excel
The file will appear on the same directory as your jupyter notebook.
You can also specify the directory
df.to_csv('D:/folder/diamond.csv')
to check where is your current work directory, you can use
import os
print(os.getcwd())

Is it possibe to change similar libraries (Data Analysis) in Python within the same code?

I use the modin library for multiprocessing.
While the library is great for faster processing, it fails at merge and I would like to revert to default pandas in between the code.
I understand as per PEP 8: E402 conventions, import should be declared once and at the top of the code however my case would need otherwise.
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df = mpd.read_csv()
do stuff
Then I would like to revert to default pandas within the same code
but how would i do the below in pandas as there does not seem to be a clear way to switch from pd and mpd in the below lines and unfortunately modin seems to take precedence over pandas.
df = df.loc[:, df.columns.intersection(['col1', 'col2'])]
df = df.drop_duplicates()
df = df.sort_values(['col1', 'col2'], ascending=[True, True])
Is it possible?
if yes, how?
You can simply do the following :
import modin.pandas as mpd
import pandas as pd
This way you have both modin as well as original pandas in memory and you can efficiently switch as per your need.
Since many have posted answers however in this particular case, as applicable and pointed out by #Nin17 and this comment from Modin GitHub, to convert from Modin to Pandas for single core processing of some of the operations like df.merge you can use
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df_modin = mpd.read_csv() #reading dataframe into Modin for parallel processing
df_pandas = df_modin._to_pandas() #converting Modin Dataframe into pandas for single core processing
and if you would like to reconvert the dataframe to a modin dataframe for parallel processing
df_modin = mpd.DataFrame(df_pandas)
You can try pandarallel package instead of modin , It is based on similar concept : https://pypi.org/project/pandarallel/#description
Pandarallel Benchmarks : https://libraries.io/pypi/pandarallel
As #Nin17 said in a comment on the question, this comment from the Modin GitHub describes how to convert a Modin dataframe to pandas. Once you have a pandas dataframe, you call any pandas method on it. This other comment from the same issue describes how to convert the pandas dataframe back to a Modin dataframe.

Its dataset "faithful" preloaded by default in any LIBRARY?

When I write and run the following code, everything is done fine, but I have a doubt if someone could confirm it for me:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import pandas as pd
import seaborn as sns
from pydataset import data
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
faithful = data('faithful')
faithful.head(10)
All works fine. But, in the penultimate row above, the dataset 'faithful' I have not loaded, no copied, no have I linked to a URL to access said data. However, it runs and reads all the data. I must assume that this DataSet is included by default, in some library? Which one ?. Where is it located? How can I corroborate or verify this information? Any command? Thanks!.
You are importing the built-in datasets from pydataset module when you are running your 7th line:
from pydataset import data
If you run data() command, you will see all the 750+ datasets contained in this module. 'faithful' data is also present in this.

python/pandas "Kernel died, restarting" while loading a csv file

While trying to load a big csv file (150 MB) I get the error "Kernel died, restarting". Then only code that I use is the following:
import pandas as pd
from pprint import pprint
from pathlib import Path
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
basedaily = pd.read_csv('combined_csv.csv')
Before it used to work, but I do not know why it is not working anymore. I tried to fixed it using engine="python" as follows:
basedaily = pd.read_csv('combined_csv.csv', engine='python')
But it gives me an error execution aborted.
Any help would be welcome!
Thanks in advance!
It may be because of the lack of memory you got this error. You can split your data in many data frames, do your work than you can re merge them, below some useful code that you may use:
import pandas as pd
# the number of row in each data frame
# you can put any value here according to your situation
chunksize = 1000
# the list that contains all the dataframes
list_of_dataframes = []
for df in pd.read_csv('combined_csv.csv', chunksize=chunksize):
# process your data frame here
# then add the current data frame into the list
list_of_dataframes.append(df)
# if you want all the dataframes together, here it is
result = pd.concat(list_of_dataframes)

Read SAS data into Python

I am tying to read SAS dataset using Python but this is showing an error:
"IndexError list assignment index out of range"
I am not sure what could be the reason. Can anyone help me out?
Following is the code where I am trying to read SAS data (which is in multi millions) into Python:
import pandas as pd
import numpy as np
from sas7bdat import SAS7BDAT
with SAS7BDAT('/dat_xyz/mdpqr/data_test.sas7bdat') as m:
mdata = m.to_data_frame()
Let me know the solution.
Thanks,
Surya

Categories

Resources