Reading more than 200M rows in pandas

Reading more than 200M rows in pandas - python

I want to read a csv file that has more than 200M rows and the data look like this:
id_1
date
id_2
Hf23R
01-01-2005
M9R34
There is no null in the data and there is no special character i tried read it the basic way but it crash my computer every time i run the code i cannot do any analysis on it or anything so is there is a way to read via panda in a sufficient way!!
Here is how i read it!
import pandas as pd
path = "data.csv"
df = pd.read_csv(path)

I used the dask and it works perfectly
import dask.dataframe as dd
df = dd.read_csv('data.csv')

Related

Dataframe Read CSV - Delimiter Not Working

I am attempting to read a CSV via Pandas. The code works fine but the result does not properly separate the data. Below is my code:
df = pd.read_csv('data.csv', encoding='utf-16', sep='\\', error_bad_lines=False)
df.loc[:3]
When I run this the output looks something like this:
Anything I can do to adjust this? All help is appreciated!

Just use \t as sep argument while reading csv file
import pandas as pd
import io
data="""id\tname\temail
1\tJohn\tjohn#example.com
2\tJoe\tjoe#example.com
"""
df = pd.read_csv(io.StringIO(data),sep="\t")
id name email
1 John john#example.com
2 Joe joe#example.com
you dont need IO, its just for example.

How to extract a gzip file and read its contents in a dataframe in python

I have a lot of gzip files which I need to extract. The file name looks like this -
FGT6HD3917800515[root].2020-07-03-13-20-35.tlog.1593759574.csv
All these files have a single CSV file each. I want to read the contents of these CSV files in a dataframe in Python. The data in CSV looks like this -
NTP 1593759574 accept unscanned India port10 1x.1xx.xx.xxx 123 1593779419 181 17 India portxx 1xx.xxx.1xx.1xx 42338 1xx.1xx.xxx.xx 123 1xx.1xx.xxx.x 42338
This is what I have tried -
import gzip
import pandas as pd
import numpy as np
import os
list = os.listdir(r'C:\Users\SAKSHI SHARMA\.spyder-py3\filter data\')
print(list);
a = np.empty((0))
for i in list:
with gzip.open(r'C:\Users\SAKSHI SHARMA\.spyder-py3\filter data/'+i) as f: #why do I have to give /
features_train = pd.read_csv(f)
a = np.append(a,features_train)
del features_train
final_data = pd.concat(a, axis=0, ignore_index=True)
print(final_data)
I get the following error TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I was suggested by someone to incorporate Hadoop as I am working on ~40 GB of data. However, I have a lot to learn and work on in Python and switching to a new software like Hadoop would make things complicated for me.
Can someone please help me out on how to read these types of zipped files and read the contents on in a dataframe. Thanks!

Check Dask Library as following, which reads many files to one df
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
Read their docs https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

combining multiple files into a single file with DataFrame

I have been able to generate several CSV files through an API. Now I am trying to combine all CSV's into a unique Master file so that I can then work on it. But it does not work. Below code is what I have attempted What am I doing wrong?
import glob
import pandas as pd
from pandas import read_csv
master_df = pd.DataFrame()
for file in files:
df = read_csv(file)
master_df = pd.concat([master_df, df])
del df
master_df.to_csv("./master_df.csv", index=False)

Although it is hard to tell what the precise problem is without more information (i.e., error message, pandas version), I believe it is that in the first iteration, master_df and df do not have the same columns. master_df is an empty DataFrame, whereas df has whatever columns are in your CSV. If this is indeed the problem, then I'd suggest storing all your data-frames (each of which represents one CSV file) in a single list, and then concatenating all of them. Like so:
import pandas as pd
df_list = [pd.read_csv(file) for file in files]
pd.concat(df_list, sort=False).to_csv("./master_df.csv", index=False)
Don't have time to find/generate a set of CSV files and test this right now, but am fairly sure this should do the job (assuming pandas version 0.23 or compatible).

Problem reading a data from a file with pandas Python (pandas.io.parsers.TextFileReader)

i want to read a dataset from a file with pandas, but when i use pd.read_csv(), the program read it, but when i want to see the dataframe appears:
pandas.io.parsers.TextFileReader at 0x1b3b6b3e198
As additional informational the file is too large (around 9 Gigas)
The file use as a separator the vertical lines, and i tried using chunksize but it doesn't work.
import pandas as pd
df = pd.read_csv(r"C:\Users\dguerr\Documents\files\Automotive\target_file", iterator=True, sep='|',chunksize=1000)
I want to import my data in the traditional pandas dataframe format.

You can load it chunk by chunk by doing:
import pandas as pd
path_to_file = "C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file"
chunk_size = 1000
for chunk in pd.read_csv(path_to_file,chunksize=chunk_size):
# do your stuff

You might want to check encoding types within a DataFrame. Your pd.read_csv defaults to utf8, should you be using latin1 for instance, this could potentially lead to such errors.
import pandas as pd
df = pd.read_csv('C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file',
encoding='latin-1', chunksize=1000)

way to generate a specified number dataframe of new csv file from existing csv file in python using pandas

I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`

The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading more than 200M rows in pandas - python

I used the dask and it works perfectly import dask.dataframe as dd df = dd.read_csv('data.csv')

Related

Dataframe Read CSV - Delimiter Not Working

How to extract a gzip file and read its contents in a dataframe in python

combining multiple files into a single file with DataFrame

Problem reading a data from a file with pandas Python (pandas.io.parsers.TextFileReader)

way to generate a specified number dataframe of new csv file from existing csv file in python using pandas

Categories

Resources