I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.
Related
I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))
I have a strange problem. In my folder i have .dat data with CO2 values from a CO2 sensor in the laboratory. Data from experiment 4,5,6,7,8 with names CO2_4.dat,CO2_5.dat,CO2_6.dat,CO2_7.dat,CO2_8.dat
I know how to read them manually. For example for reading CO2_4 this works :
dfCO2_4_manual = pd.read_csv(r'C:\data\CO2\co2_4.dat', sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","co2_4", "p"])
display(dfCO2_4_manual)
which gives me a dataframe with the correct values:
every minute one value
But if i want to loop over my folder and read them all with this technique ("which work for other CSV files from the laboratory") which is safing the dataframes in a dictionary:
exp_list =[4,5,6,7,8] # list with number of each experiment
path_CO2 = r'C:\data\CO2'
CO2_files = glob.glob(os.path.join(path_CO2, "*.dat"))
CO2_dict = {}
for f, i in zip(offline_files, exp_list):
CO2_dict["CO2_{0}".format(i)] = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
display(CO2_dict["CO2_4"])
gives me a dataframe with many skipped and completely wrong values.
If i open the CO2_4.dat data with text editor it looks like this:
Does someone know what is happening?
It's not clear how to help exactly given we don't have access to your files, however, is this line
for f, i in zip(offline_files, exp_list):
correct? Where is offline_files defined? It's not in the code you have provided. Also, are you wanting to analyze each df separately? Is that why you are storing them in a dictionary?
As an alternative you can store each df in a list and concatenate them. You can then group and apply analyses to them that way.
df_hold_list = []
for f, i in zip(CO2_files, exp_list): #changed file list name; please verify
df = pd.read_csv(f, sep=";", encoding= 'unicode_escape', header = 0, skiprows=[0], usecols=[0,1,2,4], names =["ts","t","CO2_{0}".format(i), "p"])
df['file'] = 'CO2_{0}'.format(i) # add column for sorting/grouping
df_hold_list.append(df)
df_new = pd.concat(df_hold_list, axis=0) # check the axis, should be 0 or 1
I can't test the code, but should work. Let me know if it doesn't.
I know this is a very easy task, but i am acting pretty dumb right now and dont get it solved. I need to copy the first column of a .csv file including header into a newly created file. My code:
station = 'SD_01'
import csv
import pandas as pd
df = pd.read_csv(str( station ) + "_ED.csv", delimiter =';')
list1 = []
matrix1 = df[df.columns[0]].as_matrix()
list1 = matrix1.tolist()
with open('{0}_RRS.csv'.format(station),"r+") as f:
writer = csv.writer(f)
writer.writerows(map(lambda x: [x], list1))
As result, my file has an empty line between the values, has no header (i could continue without the header, though) and something at the bottom which a can not identify.
>350
>
>351
>
>352
>
>...
>
>949
>
>950
>
>Ž‘’“”•–—˜™š›œžŸ ¡¢
Just a short impression of the 1200+ lines
I am pretty sure that this is a very clunky way to do this; easyier ways are always welcome.
How do i get rid of all the empty lines and this crazy stuff in the end?
When you get a column from a dataframe, it's returned as type Series - and the Series has a built in to_csv method you can use. So you don't need to do any matrix casting or anything like that.
import pandas as pd
df = pd.read_csv('name.csv',delimiter=';')
first_column = df[[df.columns[0]]
first_column.to_csv('new_file.csv')
I have the following code:
for state in state_list:
state_df = pd.DataFrame()
for df in pd.read_csv(tax_sample,sep='\|\|', engine='python', dtype = tax_column_types, chunksize = 10, nrows = 100):
state_df = pd.concat(state_df,df[df['state'] == state])
state_df.to_csv('property' + state + '.csv')
My dataset is quite big, and I'm breaking it into chunks (in reality these would be bigger than 10 obs). I'm taking each chunk and checking if the state matches a particular state in a list, and, if so, store it in a dataframe and save it down.
In short, I'm trying to take a dataframe with many different states in it and break it into several dataframe, each with only one state and save to CSV.
however, the code above gives the error:
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea why?
Thanks,
Mike
Consider iterating off the chunks and each time run .isin[] for filter on state_list but save in a container like dict or list. As commented, avoid the overhead of expanding dataframes in a loop.
Afterwards, bind with pd.concat on container and then run a looped groupby on state field to output each file individually.
df_list = []
reader = pd.read_csv(tax_sample, sep='\|\|', engine='python',
dtype=tax_column_types, chunksize=10, nrows=100)
for chunk in reader:
tmp = chunk[chunk['state'].isin(state_list)]
df_list.append(tmp)
master_df = pd.concat(df_list)
for g in master_df.groupby('state'):
g[1].to_csv('property' + g[0] + '.csv')
I am importing a csv file using csv.reader and pandas. However, the number of rows from the same file are different.
reviews = []
openfile = open("reviews.csv", 'rb')
r = csv.reader(openfile)
for i in r:
reviews.append(i)
openfile.close()
print len(reviews)
the results is 10,000 (which is the correct value). However, pandas returns a different value.
df = pd.read_csv("reviews.csv", header=None)
df.info()
this returns 9,985
Does anyone know why there is difference between the two methods of importing data?
I just tried this:
reviews_df = pd.DataFrame(reviews)
reviews_df.info()
This returns 10,000.
Refer to the pandas.read_csv there is an argument named skip_blank_lines and its default value is True hence unless you are setting it to False it will not read the blank lines.
Consider the following example, there are two blank rows:
A,B,C,D
0.07,-0.71,1.42,-0.37
0.08,0.36,0.99,0.11
1.06,1.55,-0.93,-0.90
-0.33,0.13,-0.11,0.89
1.91,-0.74,0.69,0.83
-0.28,0.14,1.28,-0.40
0.35,1.75,-1.10,1.23
-0.09,0.32,0.91,-0.08
Read it with skip_blank_lines=False:
df = pd.read_csv('test_data.csv', skip_blank_lines=False)
len(df)
10
Read it with skip_blank_lines=True:
df = pd.read_csv('test_data.csv', skip_blank_lines=True)
len(df)
8