splitting csv files in pandas on python - python

I am trying to load a spec column in pandas but it keep printing me the name of the column and also it skips the first part
can anyone help me?
this is the code i am using:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df_iter = pd.read_csv('tweets.csv', chunksize=10000, iterator=True, usecols=["text"])
df_iter = df_iter[1:]
for iter_num in enumerate(df_iter, -1):
for line in df_iter:
print(line)

Firstly,
Since you are reading the csv in chunks, I would assume that the file is very large. You need to loop through those chunks to read all the data of the file. Then you can merge / concatenate all these chunks.
Second thing, enumerate() is not for dataframes. You need iterrows().
Something like this -
import pandas as pd
pd.set_option('display.max_colwidth', -1)
df_iter = pd.read_csv('tweets.csv', chunksize=10000, iterator=True, usecols=["text"])
df_records = [] #list
for chunk in df_iter:
df_records.append(chunk)
df_new = pd.concat(df_records)
for iter_num, value in df_new.iterrows():
print(value[0])

Related

Python Pandas: read_csv with chunksize and concat still throws MemoryError

I am trying to extract certain rows from a 10GB ~35mil rows csv file into a new csv based on condition (value of a column (Geography = Ontario)). It runs for a few minutes and I can see my free hard drive space getting drained from 14GB to basically zero and then get the MemoryError. I thought chunksize would help here but it did not :( Please advise.
import pandas as pd
df = pd.read_csv("Data.csv", chunksize = 10000)
result = pd.concat(df)
output=result[result['Geography']=='Ontario']
rowcount=len(output)
print(output)
print(rowcount)
output.to_csv('data2.csv')
You can try writing in chunks. Roughly:
df = pd.read_csv("Data.csv", chunksize = 10000)
header = True
for chunk in df:
chunk=chunk[chunk['Geography']=='Ontario']
chunk.to_csv(outfilename, header=header, mode='a')
header = False
Idea from here.

what is an efficient way to load and aggregate a large .bz2 file into pandas?

I'm trying to load a large bz2 file in chunks and aggregate into a pandas DataFrame, but Python keeps crashing. The methodology I'm using is below, which I've had success with on smaller datasets. What is a more efficient way to aggregate larger than memory files into Pandas?
Data is line delimited json compressed to bz2, taken from https://files.pushshift.io/reddit/comments/ (all publicly available reddit comments).
import pandas as pd
reader = pd.read_json('RC_2017-09.bz2', compression='bz2', lines=True, chunksize=100000) df = pd.DataFrame() for chunk in reader:
# Count of comments in each subreddit
count = chunk.groupby('subreddit').size()
df = pd.concat([df, count], axis=0)
df = df.groupby(df.index).sum()
reader.close()
EDIT: Python crashed when I used chunksize 1e5. The script worked when i increased chunksize to 1e6.
I used this iterator method which work for me without memory error. you can try it.
chunksize = 10 ** 6
cols=['a','b','c','d']
iter_csv = pd.read_csv(filename.bz2, compression='bz2', delimiter='\t', usecols=cols, low_memory=False, iterator=True, chunksize=chunksize, encoding="utf-8")
# some work related to your group by replacing below code
df = pd.concat([chunk[chunk['b'] == 1012] for chunk in iter_csv])

How to merge two csv files using multiprocessing with python pandas

I want to merge two csv files with common column using python panda
With 32 bit processor after 2 gb memory it will throw memory error
how can i do the same with multi processing or any other methods
import gc
import pandas as pd
csv1_chunk = pd.read_csv('/home/subin/Desktop/a.txt',dtype=str, iterator=True, chunksize=1000)
csv1 = pd.concat(csv1_chunk, axis=1, ignore_index=True)
csv2_chunk = pd.read_csv('/home/subin/Desktop/b.txt',dtype=str, iterator=True, chunksize=1000)
csv2 = pd.concat(csv2_chunk, axis=1, ignore_index=True)
new_df = csv1[csv1["PROFILE_MSISDN"].isin(csv2["L_MSISDN"])]
new_df.to_csv("/home/subin/Desktop/apyb.txt", index=False)
gc.collect()
please help me to fix this
thanks in advance
I think you only need one column from your second file (actually, only unique elements from this column are needed), so there is no need to load the whole data frame.
import pandas as pd
csv2 = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'])
unique_msidns = set(csv2['L_MSISDN'])
If this still gives a memory error, try doing this in chunks:
chunk_reader = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'], chunksize=1000)
unique_msidns = set()
for chunk in chunk_reader:
unique_msidns = unique_msidns | set(chunk['L_MSIDNS'])
Now, we can deal with the first data frame.
chunk_reader = pd.read_csv('/home/subin/Desktop/a.txt', chunksize=1000)
for chunk in chunk_reader:
bool_idx = chunk['PROFILE_MSISDN'].isin(unique_msidns)
# *append* selected lines from every chunk to a file (mode='a')
# col names are not written
chunk[bool_idx].to_csv('output_file', header=False, index=False, mode='a')
If you need column names to be written into the output file, you can do it with the first chunk (I've skipped it for code clarity).
I believe it's safe (and probably faster) to increase chunksize.
I didn't test this code, so be sure to double check it.

how to subtract one column data from 2nd row to 1st row in csv files using python

I have CSV file with data like
data,data,10.00
data,data,11.00
data,data,12.00
I need to update this as
data,data,10.00
data,data,11.00,1.00(11.00-10.00)
data,data,12.30,1.30(12.30-11.00)
could you help me to update the csv file using python
You can use pandas and numpy. pandas reads/writes the csv and numpy does the calculations:
import pandas as pd
import numpy as np
data = pd.read_csv('test.csv', header=None)
col_data = data[2].values
diff = np.diff(col_data)
diff = np.insert(diff, 0, 0)
data['diff'] = diff
# write data to file
data.to_csv('test1.csv', header=False, index=False)
when you open test1.csv then you will find the correct results as you described above with the addition of a zero next to the first data point.
For more info see the following docs:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.to_csv.html

Not reading all rows while importing csv into pandas dataframe

I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']

Categories

Resources