Pandas DataFrame takes too long - python

I am running the below code on a file with close to 300k lines. I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?
import sys
import numpy as np
import pandas as pd
file = sys.argv[1]
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])
size = np.array([])
ts = np.array([])
for i in range(len(df)):
if orig_bytes[i] > resp_bytes[i]:
size = np.append(size, orig_bytes[i])
ts = np.append(ts, df['ts'][i])
else:
size = np.append(size, resp_bytes[i])
ts = np.append(ts, df['ts'][i])
The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.
Thanking you all for your help

I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go. Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:
After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]
If you do want to include the entries where they are equal to each other, then just skip the query step.

Related

Why all , are not converted to decimals when importing in Pandas?

I am using the following code to import the CSV file. It works well except for when it encounters a three digit number followed by a decimal. Below is my code and the result
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def fft(x, Plot_ShareY=True):
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN') #loads the csv files
#replaces non-numeric symbols to NaN.
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
#print(dfs) #before dropping NaNs
#each column taken into a separate variable
time = dfs['Time'] #- np.min(dfs['Time'])
channelA = dfs['Channel A']
channelB = dfs['Channel B']
channelC = dfs['Channel C']
channelD = dfs['Channel D']
channels = [channelA, channelB, channelC, channelD]
#printing the smallest index number which is NaN
ind_num_A = np.where(channelA.isna())[0][0]
ind_num_B = np.where(channelB.isna())[0][0]
ind_num_C = np.where(channelC.isna())[0][0]
ind_num_D = np.where(channelD.isna())[0][0]
ind_num = [ind_num_A, ind_num_B, ind_num_C, ind_num_D]
#dropping all rows after the first NaN is found
rem_ind = np.amin(ind_num) #finds the array-wise minimum
#print('smallest index to be deleted is: ' +str(rem_ind))
dfs = dfs.drop(dfs.index[rem_ind:])
print(dfs) #after dropping NaNs
The result is as I want except for the last five rows in Channel B and C, where a comma is seen instead of a point to indicate decimal. I don't know why it works everywhere else but not for a few rows. The CSV file can be found here.
It looks like a data type issue. Some of the values are strings so pandas will not automatically convert to float before replacing ',' with '.'.
one option is to convert each column after you read the file with something like: df['colname'] = df['colname'].str.replace(',', '.').astype(float)
I think you need to replace the non-numeric symbols -∞ and ∞ as NaN already while reading, and not after the fact. If you do it after the data frame is created, then the values have been read in and it's parsed as data type str intead of float. This messes up the data types of the column.
So instead of na_values='NaN' do this na_values=["-∞", "∞"], so the code is like this:
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values=["-∞", "∞"])
#replaces non-numeric symbols to NaN.
# dfs = dfs.replace({'-∞': np.nan, '∞': np.nan}) # not needed anymore

Python pandas dataframe set_name length issue expecting one argument but is given two

So I have this prep_dat function and I am giving it the following csv data:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
within this prep_data function, there is this line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
However, it keeps erring out when it gets to the line saying
ValueError: Length of new names must be 1, got 2
Is there something wrong with the line or is it something wrong with the function
Here is the whole source code
import pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]['Tumor_Sample_Barcode'] #Creates a new dataframe of all the data within the 'Tumor_Sample_Barcode' column that does not match the PRIMARY_TUMOR_PATIENT_ID_REGEX
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
Any help would be greatly appreciated( I know this wasn't to specific, Im still trying to work out what this function does exactly and how I could make data to test it)

Stackig vertically .csv files with Pandas in Python

So I have been trying to merge .csv files with Pandas, and trying to create a couple of functions to automate it but I keep having an issue.
My problem is that I want to stack one .csv after the other(same number of columns and different number of rows) but instead of getting a bigger csv with the same numer of columns , I get a bigger csv with more columns and rows(correct number of rows, incorrect number of columns(more columns than the ones that are supposed to be)).
The code Im using is this one:
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,ignore_index=True)
return csv_final.to_csv("final_data.csv",index = None, header = None)
I have 3.csv files that have a size of 20000x17, and I want to merge it into one of 60000x17. I suppose my error must be in the arguments of index, header, index_col, etc....
Thanks in advance.
So after modifying the code, it worked. First of all, as Serge Ballesta said, it is necesary to say to the read_csv that there is no header. Finally, using the sort = False, the function works perfectly. This is the final code that I have used, and the final .csv is 719229 rows × 17 columns long. Thanks to everbody!
import os
import pandas as pd
def stackcsv(content_folder):
global combined_csv
combined_csv= []
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
solo_csv = pd.read_csv(csv_path,index_col=None,header = None)
combined_csv.append(solo_csv)
csv_final = pd.concat(combined_csv,axis = 0,sort = False)
return csv_final.to_csv("final_data.csv", header = None)
If the files have no header you must say it to read_csv. If you don't, the first line of each file is read as a header line. As a result the DataFrames have different column names and concat will add new columns. So you should read with:
solo_csv = pd.read_csv(csv_path,index_col=None, header=None)
Alternatively, there is no reason to decode them, and you could just concatenate the sequential files:
def stackcsv(content_folder):
with open("final_data.csv", "w") as fdout
entries = os.listdir(content_folder)
for i in entries:
csv_path = os.path.join(content_folder, i)
with open(csv_path) as fdin:
while True:
chunk = fdin.read()
if len(chunk) == 0: break
fdout.write(chunk)
Add parameter sort to False in the pandas concat function:
csv_final = pd.concat(combined_csv, axis = 0, ignore_index=True, sort=False)

excluding count from dataFrame if value is 0

I have the code below:
import pandas as pd
import numpy as np
# Read data from file 'filename.csv'
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later)
df = pd.read_csv("../example.csv", parse_dates=['DateTime'])
# Preview the first 5 lines of the loaded data
df = df.assign(Burned = df['Quantity'])
df.loc[df['To'] != '0x0000000000000000000000000000000000000000', 'Burned'] = 0.0
# OR:
df['cum_sum'] = df['Burned'].cumsum()
df['percent_burned'] = df['cum_sum']/df['Quantity'].max()*100.0
per_day = df.groupby(df['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
The last line create a dataFrame called per_day which sorts all the transactions by day from the df dataFrame. Many of the transactions from df have 'Burned' = 0. I want to count the total number of transactions which is what my current code does. But I want to exclude transactions in which 'Burned'=0.
Also, how can I consoladate these 3 lines of code? I don't even want to create per_day_burned. How do I just make this its own column in per_day without doing it the way I did it?
per_day = dg.groupby(dg['DateTime'].dt.date)['Burned'].count().reset_index(name='Trx')
per_day_burned = dg.groupby(dg['DateTime'].dt.date)['Burned'].sum().reset_index(name='Burned')
per_day['Burned'] = per_day_burned['Burned']

Save columns as csv pandas

I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.

Categories

Resources