Python: iteratively add to dataframe - python

I have the following code:
for state in state_list:
state_df = pd.DataFrame()
for df in pd.read_csv(tax_sample,sep='\|\|', engine='python', dtype = tax_column_types, chunksize = 10, nrows = 100):
state_df = pd.concat(state_df,df[df['state'] == state])
state_df.to_csv('property' + state + '.csv')
My dataset is quite big, and I'm breaking it into chunks (in reality these would be bigger than 10 obs). I'm taking each chunk and checking if the state matches a particular state in a list, and, if so, store it in a dataframe and save it down.
In short, I'm trying to take a dataframe with many different states in it and break it into several dataframe, each with only one state and save to CSV.
however, the code above gives the error:
TypeError: first argument must be an iterable of pandas objects, you
passed an object of type "DataFrame"
Any idea why?
Thanks,
Mike

Consider iterating off the chunks and each time run .isin[] for filter on state_list but save in a container like dict or list. As commented, avoid the overhead of expanding dataframes in a loop.
Afterwards, bind with pd.concat on container and then run a looped groupby on state field to output each file individually.
df_list = []
reader = pd.read_csv(tax_sample, sep='\|\|', engine='python',
dtype=tax_column_types, chunksize=10, nrows=100)
for chunk in reader:
tmp = chunk[chunk['state'].isin(state_list)]
df_list.append(tmp)
master_df = pd.concat(df_list)
for g in master_df.groupby('state'):
g[1].to_csv('property' + g[0] + '.csv')

Related

JSON File Parsing In Python Brings Different Line In Each Execution

I am trying to analyze a large dataset from Yelp. Data is in json file format but it is too large, so script is crahsing when it tries to read all data in same time. So I decided to read line by line and concat the lines in a dataframe to have a proper sample from the data.
f = open('./yelp_academic_dataset_review.json', encoding='utf-8')
I tried without encoding utf-8 but it creates an error.
I created a function that reads the file line by line and make a pandas dataframe up to given number of lines.
Anyway some lines are lists. And script iterates in each list too and adds to dataframe.
def json_parser(file, max_chunk):
f = open(file)
df = pd.DataFrame([])
for i in range(2, max_chunk + 2):
try:
type(f.readlines(i)) == list
for j in range(len(f.readlines(i))):
part = json.loads(f.readlines(i)[j])
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
except:
f = open(file, encoding = "utf-8")
for j in range(len(f.readlines(i))):
try:
part = json.loads(f.readlines(i)[j-1])
except:
print(i,j)
df2 = pd.DataFrame(part.items()).T
df2.columns = df2.iloc[0]
df2 = df2.drop(0)
datas = [df2, df]
df2 = pd.concat(datas)
df = df2
df2.reset_index(inplace=True, drop=True)
return df2
But still I am having an error that list index out of range. (Yes I used print to debug).
So I looked closer to that lines which causes this error.
But very interestingly when I try to look at that lines, script gives me different list.
Here what I meant:
I runned the cells repeatedly and having different length of the list.
So I looked at lists:
It seems they are completely different lists. In each run it brings different list although line number is same. And readlines documentation is not helping. What am I missing?
Thanks in advance.
You are using the expression f.readlines(i) several times as if it was referring to the same set of lines each time.
But as as side effect of evaluating the expression, more lines are actually read from the file. At one point you are basing the indices j on more lines than are actually available, because they came from a different invocation of f.readlines.
You should use f.readlines(i) only once in each iteration of the for i in ... loop and store its result in a variable instead.

Python: Pandas - Nested Loop takes very long to complete. How to speed up?

I am self learning Python and Pandas to support my daily job. With a lot of trial and error, I have built the below function. This function takes as arguments (i) a dataframe referenced as 'dataset', (ii) a list of country names and (iii) a list of unique legal entity IDs. (The function works.)
The dataset is a large dataframe with 300,000+ rows and approximately 30 columns -- it is a dump of a general ledger. The key columns are "LE_ID" and "COUNTRY", that respectively contain (i) an unique ID for the relevant legal entity and (ii) the name of the country of that legal entity. Not all rows are unique, there's approx 5000 LE_IDs filling the 300,000+ rows.
I want to "split" this dataset into XLS files that, per country, outline, per tab, the ledger details of every LE_ID. The below function achieves this. But at a horrible speed -- it takes 40 mins to complete on my pretty recent laptop.
The function:
def SplitDatasetByCountry(dataset, countries, ids):
for country in countries:
## output
output_folder = Path('/Users/XXXXXX/Desktop/TOOL/Reports')
output_filename = 'Data__for_' + str(country) + '_.xlsx'
output = output_folder / output_filename
## writer
writer = pd.ExcelWriter(output)
workbook = writer.book
## country logic
x = dataset.loc[dataset['COUNTRY'] == country]
ids_for_entities_in_country = x['LE_ID'].to_list()
unique_ids = list(set(ids_for_entities_in_country))
for id in ids:
if id in unique_ids:
y = x.loc[x['LE_ID'] == id]
y.to_excel(writer, sheet_name=str(id))
else:
pass
writer.save()
workbook.close()
I would be grateful for any suggestions how to speed this up. I think I am excessively iterating, which is causing the issue, but I am not sure how to solve this. Yesterday's version of this function was a bit faster, but I ended up w/ corrupted XLS files -- probably because I accidently had the code writer over the same xls file multiple times.
I understand from the community here that list comprehensions are preferred, but I can't figure out the right syntax to get my function organized. I would prefer to keep iterating, but eliminating the current (seemingly?) redundant iterations.
thanks for your thoughts
First UPDATE MAY 28:
The dataset has the following fields
LE_ID object
LEGAL_ENTITY_NAME object
COUNTRY object
GL_ACCOUNT object
BOOK_AMT int64
ADJUSTED_TAX float64
Second UPDATE MAY 28:
Revised code per suggestion in comments, works flawlessly:
def SplitDatasetByCountry(dataset):
for country, country_df in dataset.groupby('COUNTRY'):
## output
output_folder = Path('/Users/XXXX/Desktop/Reports')
output_filename = 'Data_for_' + str(country) + '_.xlsx'
output = output_folder / output_filename
## writer
writer = pd.ExcelWriter(output)
workbook = writer.book
## country logic
country_expenses = function_A(country_df)
country_income = function_B(country_df)
expense = country_expenses.groupby('LE_ID')['BOOK_AMT'].sum()
income = country_income..groupby('LE_ID')['BOOK_AMT'].sum()
expense.to_excel(writer, sheet_name='Country Expense')
income.to_excel(writer, sheet_name='Country Income')
for le_id, le_id_df in country_df.groupby('LE_ID'):
le_id_df.to_excel(writer, sheet_name=str(le_id))
writer.save()
workbook.close()
There is two terms that need to be understand.
Vectorization and traditional loops.
If you use for loops even nested for loops. The time complexity will increase
in the order of O(n^2).
But if you use Vectorization, your algorithm is likely to work faster.
Instead of for loops, try to do the problem using numpy summations or dot functions and it will be more faster.
This the example for implementing vectorization techniques
You need to group by COUNTRY first, then group by LED_ID and write the result to file.
for country, country_df in dataset.groupby('COUNTRY'):
## output
output_folder = Path('/Users/XXXXXX/Desktop/TOOL/Reports')
output_filename = 'Data__for_' + str(country) + '_.xlsx'
output = output_folder / output_filename
## writer
with pd.ExcelWriter(output) as writer:
for led_id, led_df in country_df.groupby("LED_ID"):
led_df.to_excel(writer, sheet_name=str(led_id))
Could you please share the format of the dataframe before it enters your function ?
If that format is good enough then simply splitting the dataframe according to the country, then the ids should be enough :
list_of_dfs = [dataset[["COUNTRY"] == country] for country in dataset["COUNTRY"].unique()]
I do not understand what you are trying to accomplish with the id, but i can at least point out that you can get uniques directly as above using .unique() on a Series, cf pandas' doc :
unique_ids = dataset["LE_ID"].unique()

Pandas DataFrame takes too long

I am running the below code on a file with close to 300k lines. I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?
import sys
import numpy as np
import pandas as pd
file = sys.argv[1]
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])
size = np.array([])
ts = np.array([])
for i in range(len(df)):
if orig_bytes[i] > resp_bytes[i]:
size = np.append(size, orig_bytes[i])
ts = np.append(ts, df['ts'][i])
else:
size = np.append(size, resp_bytes[i])
ts = np.append(ts, df['ts'][i])
The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.
Thanking you all for your help
I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go. Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:
After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]
If you do want to include the entries where they are equal to each other, then just skip the query step.

Best/Fastest way to read 3k of sheets from an Excel and Upload them in a Pandas Dataframe

I have an Excel file with 3k worth of sheets. I've currently reading the sheets one by one, converting to a dataframe, append to a list and repeat.
An iteration in the for loop lasts aprox 90 seconds, which is a huge amount of time. Each sheet has around 35 rows of data with 5 columns.
Can somebody suggest a better methodology in approaching this?
This is my code:
import pandas as pd
import time
nr_pages_workbook = list(range(1,3839))
nr_pages_workbook = ['Page '+str(x) for x in nr_pages_workbook]
list_df = []
start = time.time()
for number in nr_pages_workbook:
data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx',sheet_name=number)
list_df.append(data)
break
stop = time.time() - start
Df_Date_Raw = pd.concat(list_df)
You can try passing nr_pages_workbook directly to sheet_name param in read_excel, according to the docs it can be a list, and the return value will be a dict of dataframes. This way you can avoid the overhead of opening and reading the file in every cycle.
Or just simply omit the parameter, and read all sheets into a dict, and then concatenate from the dict:
data = pd.read_excel('D:\\DEV\\Stage\\Project\\Extras.xlsx')
df = pd.concat([v for k,v in data.items()])
You are reading the whole file again whenever you are iterating through the loop. I would suggest reading it once using ExcelFile and then just accessing a particular sheet in the loop. Try:
import pandas as pd
xl = pd.ExcelFile('foo.xls')
sheet_list = xl.sheet_names
for i in sheet_list:
if i ==0:
df = xl.parse(i)
else:
df = df.append(xl.parse(i), ignore_index=True)

Reduce memory usage of this Pandas code reading JSON file and pickling

I can't figure out a way to reduce memory usage for this program further.
Basically, I'm reading from JSON log files into a pandas dataframe, but:
the list append function is what is causing the issue. It creates two different objects in memory, causing huge memory usage.
.to_pickle method of pandas is also a huge memory hog, because the biggest spike in memory is when writing to the pickle.
Here is my most efficient implementation to date:
columns = ['eventName', 'sessionId', "eventTime", "items", "currentPage", "browserType"]
df = pd.DataFrame(columns=columns)
l = []
for i, file in enumerate(glob.glob("*.log")):
print("Going through log file #%s named %s..." % (i+1, file))
with open(file) as myfile:
l += [json.loads(line) for line in myfile]
tempdata = pd.DataFrame(l)
for column in tempdata.columns:
if not column in columns:
try:
tempdata.drop(column, axis=1, inplace=True)
except ValueError:
print ("oh no! We've got a problem with %s column! It don't exist!" % (badcolumn))
l = []
df = df.append(tempdata, ignore_index = True)
# very slow version, but is most memory efficient
# length = len(df)
# length_temp = len(tempdata)
# for i in range(1, length_temp):
# update_progress((i*100.0)/length_temp)
# for column in columns:
# df.at[length+i, column] = tempdata.at[i, column]
tempdata = 0
print ("Data Frame initialized and filled! Now Sorting...")
df.sort(columns=["sessionId", "eventTime"], inplace = True)
print ("Done Sorting... Changing indices...")
df.index = range(1, len(df)+1)
print ("Storing in Pickles...")
df.to_pickle('data.pkl')
Is there an easy way to reduce memory? The commented code does the job but takes 100-1000x longer. I'm currently at 45% memory usage at max during the .to_pickle part, 30% during the reading of the logs. But the more logs there are, the higher that number goes.
This answer is for general pandas dataFrame memory usage optimization:
Pandas loads in string columns as object type by default. For all the columns which have the type object, try to assign the type category to these columns by passing a dictionary to parameter dtypes of the read_csv function. Memory usage decreases dramatically for columns with 50% or less unique values.
Pandas reads in numeric columns as float64 by default. Use pd.to_numeric to downcast float64 type to 32 or 16 if possible. This again saves you memory.
Load in csv data chunk by chunk. Process it, and move on to the next chunk. This can be done by specifying value to the chunk_size parameter of read_csv method.
If you need to build a DataFrame up from pieces, it is generally much more efficient to construct a list of the component frames and combine them all in one step using concat. See the first approach below.
# df = 10 rows of dummy data
In [10]: %%time
...: dfs = []
...: for _ in xrange(1000):
...: dfs.append(df)
...: df_concat = pd.concat(dfs, ignore_index=True)
...:
Wall time: 42 ms
In [11]: %%time
...: df_append = pd.DataFrame(columns=df.columns)
...: for _ in xrange(1000):
...: df_append = df_append.append(df, ignore_index=True)
...:
Wall time: 915 ms

Categories

Resources