difference in csv.reader and pandas - python - python

I am importing a csv file using csv.reader and pandas. However, the number of rows from the same file are different.
reviews = []
openfile = open("reviews.csv", 'rb')
r = csv.reader(openfile)
for i in r:
reviews.append(i)
openfile.close()
print len(reviews)
the results is 10,000 (which is the correct value). However, pandas returns a different value.
df = pd.read_csv("reviews.csv", header=None)
df.info()
this returns 9,985
Does anyone know why there is difference between the two methods of importing data?
I just tried this:
reviews_df = pd.DataFrame(reviews)
reviews_df.info()
This returns 10,000.

Refer to the pandas.read_csv there is an argument named skip_blank_lines and its default value is True hence unless you are setting it to False it will not read the blank lines.
Consider the following example, there are two blank rows:
A,B,C,D
0.07,-0.71,1.42,-0.37
0.08,0.36,0.99,0.11
1.06,1.55,-0.93,-0.90
-0.33,0.13,-0.11,0.89
1.91,-0.74,0.69,0.83
-0.28,0.14,1.28,-0.40
0.35,1.75,-1.10,1.23
-0.09,0.32,0.91,-0.08
Read it with skip_blank_lines=False:
df = pd.read_csv('test_data.csv', skip_blank_lines=False)
len(df)
10
Read it with skip_blank_lines=True:
df = pd.read_csv('test_data.csv', skip_blank_lines=True)
len(df)
8

Related

JSON File Parsing In Python Brings Different Line In Each Execution

I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.

CSV dataframe doesn't match with dataframe generated from URL

I have a file that I download from NERL API. When I try to compare it with older csv I get a difference using .equals command in padas but both files are 100% same. the only difference is one data frame is generated from CSV and another is directly from API URL.
Below is my code, why is there a difference?
import pandas as pd
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"D:\<myPCPath>\nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False)
if csv_df.equals(urlDF):
print("Same")
else:
print("Different")
My output is coming as Different. How do I fix this and why is this difference comming?
Problem is precision in read_csv, set to float_precision='round_trip' and then compared NaNs values, need replaced them to same values, like same:
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False, float_precision='round_trip')
if csv_df.fillna('same').equals(urlDF.fillna('same')):
print("Same")
else:
print("Different")
Same

Pandas .DAT file import error with skip rows

I am trying to break a huge data file into smaller parts. I am using the following scripts -
df = pd.read_csv(file_name, header=None,encoding='latin1',sep='\t',nrows=100000, skiprows = 100000)
but I see that skip rows argument skips around 200000 rows instead of 100000 can anyone tell me as to why this is happening
Thanks to #EdChum I was able to solve the problem using chunksize with the following code:-
i = 0
tp = pd.read_csv(filename,header=None,encoding='latin1', sep='\t', iterator=True, chunksize=1000000)
for c in tp:
ca = pd.DataFrame(c)
ca.to_csv (file_destination +str(i)+'test.csv', index = False, header = False)
i = i+1

Loading only a list of rows using Panda read_csv function - Python

I would like to know if there is an option for the pandas.read_csv function which allow me load only a certain list of rows from the original csv file.
The csv file is really big, and I cant load the whole file due to a lack of memory.
Is there an option like:
df = pandas.read_csv(file, <b>'read_only'</b> = list_to_read) ?
with list_to_read = [0,2,10] for example (this will only read the row 0, the row 2 and the row 10)
Many thanks in advance
If you go over the docs for read_csv you will find the nrows kwarg:
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
Note however that this will read the n first rows from the file, not arbitrary lines (ie you can't provide it [0, 2, 10] and expect it to read the first, third and eleventh rows)
You may want to iteratively update the dataframe as you read through the file. This is not a fast process, but it will get only the rows of interest into a dataframe without pulling the entire file into memory.
import pandas as pd
col_list = ['columnA', 'columnB', ... ] #fill in your data columns
row_list = [0, 3, 10, ... ]
df = pd.DataFrame(columns=col_list)
row_number = 0
with open('path/to/file', 'rb') as fp:
for i, line in enumerate(fp.xreadlines()):
if i in row_list:
data_line = map(float, line.strip().split(',')) #assumes all columns are floats
df.loc[row_number] = data_line
row_number += 1

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources