How to split a large excel file using Pandas? - python

I've tried the following(pd is pandas):
for i, chunk in pd.read_excel(os.path.join(INGEST_PATH,file), chunksize=5):
but I am getting this error:
NotImplementedError: chunksize keyword of read_excel is not implemented
I've tried searching for other methods but most of them are for CSV files, not xlsx, I also have pandas version 0.20.1
Any help is appreciated.

df = pd.read_excel(os.path.join(INGEST_PATH,file))
# split indexes
idxes = np.array_split(df.index.values, 5)
chunks = [df.ix[idx] for idx in idxes]

the above solutions werent working for me because the file wasnt being split properly and resulted into omitting the last few rows.. actually it gave me an error saying unequal divisions or something to that effect.
so i wrote the following. this will work for any file size.
enter code here
url_1=r'C:/Users/t3734uk/Downloads/ML-GooGLECRASH/amp_ub/df2.csv'
target_folder=r'C:\Users\t3734uk\Downloads\ML-GooGLECRASH\amp_ub'
df = pd.read_csv(url_1)
rows,columns=df.shape
def calcRowRanges(_no_of_files):
row_ranges=[]
interval_size=math.ceil(rows/_no_of_files)
print('intrval size is ----> '+ '{}'.format(interval_size))
for n in range(_no_of_files):
row_range=(n*interval_size,(n+1)*interval_size)
# print(row_range)
if row_range[1] > rows:
row_range=(n*interval_size,rows)
# print(row_range)
row_ranges.append(row_range)
return row_ranges
def splitFile(_df_,_row_ranges):
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)
def dosplit(num_files):
row_ranges=calcRowRanges(num_files)
print(row_ranges)
print(len(row_ranges))
splitFile(df,row_ranges)
dosplit(enter_no_files_to_be_split_in)
on second thoughts the following fucntion is more intuitive:
def splitFile2(_df_,no_of_splits):
_row_ranges=calcRowRanges(no_of_splits)
for row_range in _row_ranges:
_df=_df_[row_range[0]:row_range[1]]
writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')
_df.to_excel(writer)enter code here

Related

JSON File Parsing In Python Brings Different Line In Each Execution

I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.

Stackig vertically .csv files with Pandas in Python

So I have been trying to merge .csv files with Pandas, and trying to create a couple of functions to automate it but I keep having an issue.
My problem is that I want to stack one .csv after the other(same number of columns and different number of rows) but instead of getting a bigger csv with the same numer of columns , I get a bigger csv with more columns and rows(correct number of rows, incorrect number of columns(more columns than the ones that are supposed to be)).
The code Im using is this one:
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,ignore_index=True)
return csv_final.to_csv("final_data.csv",index = None, header = None)
I have 3.csv files that have a size of 20000x17, and I want to merge it into one of 60000x17. I suppose my error must be in the arguments of index, header, index_col, etc....
Thanks in advance.
So after modifying the code, it worked. First of all, as Serge Ballesta said, it is necesary to say to the read_csv that there is no header. Finally, using the sort = False, the function works perfectly. This is the final code that I have used, and the final .csv is 719229 rows × 17 columns long. Thanks to everbody!
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None,header = None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,sort = False)
return csv_final.to_csv("final_data.csv", header = None)
If the files have no header you must say it to read_csv. If you don't, the first line of each file is read as a header line. As a result the DataFrames have different column names and concat will add new columns. So you should read with:
solo_csv = pd.read_csv(csv_path,index_col=None, header=None)
Alternatively, there is no reason to decode them, and you could just concatenate the sequential files:
def stackcsv(content_folder):
with open("final_data.csv", "w") as fdout
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
with open(csv_path) as fdin:
while True:
chunk = fdin.read()
if len(chunk) == 0: break
fdout.write(chunk)
Add parameter sort to False in the pandas concat function:
csv_final = pd.concat(combined_csv, axis = 0, ignore_index=True, sort=False)

Copy the first column of a .csv file into a new file

I know this is a very easy task, but i am acting pretty dumb right now and dont get it solved. I need to copy the first column of a .csv file including header into a newly created file. My code:
station = 'SD_01'
import csv
import pandas as pd
df = pd.read_csv(str( station ) + "_ED.csv", delimiter =';')
list1 = []
matrix1 = df[df.columns[0]].as_matrix()
list1 = matrix1.tolist()
with open('{0}_RRS.csv'.format(station),"r+") as f:
writer = csv.writer(f)
writer.writerows(map(lambda x: [x], list1))
As result, my file has an empty line between the values, has no header (i could continue without the header, though) and something at the bottom which a can not identify.
>350
>
>351
>
>352
>
>...
>
>949
>
>950
>
>Ž‘’“”•–—˜™š›œžŸ ¡¢
Just a short impression of the 1200+ lines
I am pretty sure that this is a very clunky way to do this; easyier ways are always welcome.
How do i get rid of all the empty lines and this crazy stuff in the end?
When you get a column from a dataframe, it's returned as type Series - and the Series has a built in to_csv method you can use. So you don't need to do any matrix casting or anything like that.
import pandas as pd
df = pd.read_csv('name.csv',delimiter=';')
first_column = df[[df.columns[0]]
first_column.to_csv('new_file.csv')

python pandas automatic excel lookup system

I know this is alot of code and there is alot to do, but i am really stuck and don't know how to continue after i got the function that the program can match identical files. I am pretty sure you know how the lookup from excel works. This Program does basicly the same. I tried to comment out the important parts and hope you can give me some help how i can continue this project. Thank you very much!
import pandas as pd
import xlrd
File1 = pd.read_excel("Excel_test.xlsx", usecols=[0], header=None, index=False) #the two excel files with the columns that should be compared
File2 = pd.read_excel("Excel_test02.xlsx", usecols=[0], header=None, index=False)
fullFile1 = pd.read_excel("Excel_test.xlsx", header=None, index=False)#the full excel files
fullFile2 = pd.read_excel("Excel_test02.xlsx", header=None, index=False)
i = 0
writer = pd.ExcelWriter("output.xlsx")
def loadingTime(): #just a loader that shows the percentage of the matching process
global i
loading = (i / len(File1)) * 100
loading = round(loading, 2)
print(str(loading) + "%/100%")
def matcher():
global i
while(i < len(File1)):#goes in column that should be compared and goes on higher if there is a match found in second file
for o in range(len(File2)):#runs through the column in second file
matching = File1.iloc[i].str.lower() == File2.iloc[o].str.lower() #matches the column contents of the two files
if matching.bool() == True:
print("Match")
"""
df.append(File1.iloc[i])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
df.append(File2.iloc[o])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
"""
i += 1
matcher()
df.to_excel(writer, "Sheet")
writer.save() #After the two files have been compared to each other, now a file containing both excel contents and is also arranged correctly

difference in csv.reader and pandas - python

I am importing a csv file using csv.reader and pandas. However, the number of rows from the same file are different.
reviews = []
openfile = open("reviews.csv", 'rb')
r = csv.reader(openfile)
for i in r:
reviews.append(i)
openfile.close()
print len(reviews)
the results is 10,000 (which is the correct value). However, pandas returns a different value.
df = pd.read_csv("reviews.csv", header=None)
df.info()
this returns 9,985
Does anyone know why there is difference between the two methods of importing data?
I just tried this:
reviews_df = pd.DataFrame(reviews)
reviews_df.info()
This returns 10,000.
Refer to the pandas.read_csv there is an argument named skip_blank_lines and its default value is True hence unless you are setting it to False it will not read the blank lines.
Consider the following example, there are two blank rows:
A,B,C,D
0.07,-0.71,1.42,-0.37
0.08,0.36,0.99,0.11
1.06,1.55,-0.93,-0.90
-0.33,0.13,-0.11,0.89
1.91,-0.74,0.69,0.83
-0.28,0.14,1.28,-0.40
0.35,1.75,-1.10,1.23
-0.09,0.32,0.91,-0.08
Read it with skip_blank_lines=False:
df = pd.read_csv('test_data.csv', skip_blank_lines=False)
len(df)
10
Read it with skip_blank_lines=True:
df = pd.read_csv('test_data.csv', skip_blank_lines=True)
len(df)
8

Categories

Resources