I read a census ACS file into iPython Notebook in chunks using:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
Then I selected some columns I want to keep and use for analysis. Now I want to export pusb to a txt or csv file, but `pusb.to_csv(etc... didn't work. How do I do this? Is there a way to concatenate the chunks one I read them so that they're one data frame?
Thanks in advance!
You can try function concat:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
print pusb
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
I try explain it better:
pusb = pd.read_csv('ss14pusb.csv', low_memory=False, chunksize = 25000)
function read csv by chunks - docs and output is TextFileReader, not DataFrame.
You can check this iterable object by:
for chunk in pusb:
print(chunk)
And then you need concat chunks to one big DataFrame - use concat.
Concatenating objects.
Related
I'm trying to reduce the size of a dataframe I have.
The source data is gzipped JSON and approx. 20 GB in s3 and each line looks like this:
{"timestamp":"1633121635","name":"www.hello.com","type":"a","value":"ipv4:1.1.1.1"}
I'm using pandas.read_json to read it in chunks and then drop the keys I don't want, e.g.
for df in pd.read_json(s3_source_data_location,
lines=True,
chunksize=20000000):
df.drop('timestamp', axis=1, inplace=True)
df.drop('type', axis=1, inplace=True)
I know I can try reducing the size by fiddling around with the datatypes for 'value' and 'name' but I want to first see if I can only read the keys I want.
pandas.read_csv has a 'usecols' argument where you can specify the columns you want to read. Hoping there is a way I can do this with JSON.
Lookig at the structure of JSON, you only want to keep name and value. So in order to do that and reduce the computing type, first create a list of columns you want to keep in the dataframe:
cols = ['name', 'value']
Then create an empty list and define your file path:
data = []
file_name = 'path-of-json-file.json'
Then open and read for specific categories:
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['name'], doc['value']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.
I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
Doesn't work so I found iterate and chunksize in a similar post so I used:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5) and search the whole file with just:
for chunk in df:
print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader - source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.
You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.
But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.
The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.
When you say this:
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.
Instead of this:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')
… you have to do things like this:
for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
print df.dtypes
customer_group3 = df.groupby('UserID')
Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.
Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.
You need to concatenate the chucks. For example:
df2 = pd.concat([chunk for chunk in df])
And then run your commands on df2
This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.
df = pd.read_csv("data.csv",
usecols=['A', 'B', 'C', 'Date'],
dtype={'A':'uint32',
'B':'uint8',
'C':'uint8'
},
parse_dates=['Date'], # convert to datetime64
sep='\t'
)
I have 100 XLS files that I would like to combine into a single CSV file. Is there a way to improve the speed of combining them all together?
This issue with using concat is that it lacks the arguments that to_csv affords me:
listOfFiles = glob.glob(file_location)
frame = pd.DataFrame()
for idx, a_file in enumerate(listOfFiles):
print a_file
data = pd.read_excel(a_file, sheetname=0, skiprows=range(1,2), header=1)
frame = frame.append(data)
# Save to CSV..
print frame.info()
frame.to_csv(output_dir, index=False, encoding='utf-8', date_format="%Y-%m-%d")
Using multiprocessing, you could read them in parallel using something like:
import multiprocessing
import pandas as pd
dfs = multiprocessing.Pool().map(df.read_excel, f_names)
and then concatenate them to a single one:
df = pd.concat(dfs)
You probably should check if the first part is at all faster than
dfs = map(df.read_excel, f_names)
YMMV - it depends on the files, the disks, etc.
It'd be more performant to read them into a list and then call concat:
merged = pd.concat(df_list)
so something like
df_list=[]
for f in xl_list:
df_list.append(pd.read_csv(f)) # or read_excel
merged = pd.concat(df_list)
The problem with repeatedly appending to a dataframe is that the memory has to be allocated to fit the new size and the contents copied and really you only want to do this once.
I am trying to save a csv to a folder after making some edits to the file.
Every time I use pd.to_csv('C:/Path of file.csv') the csv file has a separate column of indexes. I want to avoid printing the index to csv.
I tried:
pd.read_csv('C:/Path to file to edit.csv', index_col = False)
And to save the file...
pd.to_csv('C:/Path to save edited file.csv', index_col = False)
However, I still got the unwanted index column. How can I avoid this when I save my files?
Use index=False.
df.to_csv('your.csv', index=False)
There are two ways to handle the situation where we do not want the index to be stored in csv file.
As others have stated you can use index=False while saving your
dataframe to csv file.
df.to_csv('file_name.csv',index=False)
Or you can save your dataframe as it is with an index, and while reading you just drop the column unnamed 0 containing your previous index.Simple!
df.to_csv(' file_name.csv ')
df_new = pd.read_csv('file_name.csv').drop(['unnamed 0'],axis=1)
If you want no index, read file using:
import pandas as pd
df = pd.read_csv('file.csv', index_col=0)
save it using
df.to_csv('file.csv', index=False)
As others have stated, if you don't want to save the index column in the first place, you can use df.to_csv('processed.csv', index=False)
However, since the data you will usually use, have some sort of index themselves, let's say a 'timestamp' column, I would keep the index and load the data using it.
So, to save the indexed data, first set their index and then save the DataFrame:
df.set_index('timestamp')
df.to_csv('processed.csv')
Afterwards, you can either read the data with the index:
pd.read_csv('processed.csv', index_col='timestamp')
or read the data, and then set the index:
pd.read_csv('filename.csv')
pd.set_index('column_name')
Another solution if you want to keep this column as index.
pd.read_csv('filename.csv', index_col='Unnamed: 0')
If you want a good format next statement is the best:
dataframe_prediction.to_csv('filename.csv', sep=',', encoding='utf-8', index=False)
In this case you have got a csv file with ',' as separate between columns and utf-8 format.
In addition, numerical index won't appear.