Why all , are not converted to decimals when importing in Pandas?

Why all , are not converted to decimals when importing in Pandas? - python

I am using the following code to import the CSV file. It works well except for when it encounters a three digit number followed by a decimal. Below is my code and the result
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def fft(x, Plot_ShareY=True):
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN') #loads the csv files
#replaces non-numeric symbols to NaN.
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
#print(dfs) #before dropping NaNs
#each column taken into a separate variable
time = dfs['Time'] #- np.min(dfs['Time'])
channelA = dfs['Channel A']
channelB = dfs['Channel B']
channelC = dfs['Channel C']
channelD = dfs['Channel D']
channels = [channelA, channelB, channelC, channelD]
#printing the smallest index number which is NaN
ind_num_A = np.where(channelA.isna())[0][0]
ind_num_B = np.where(channelB.isna())[0][0]
ind_num_C = np.where(channelC.isna())[0][0]
ind_num_D = np.where(channelD.isna())[0][0]
ind_num = [ind_num_A, ind_num_B, ind_num_C, ind_num_D]
#dropping all rows after the first NaN is found
rem_ind = np.amin(ind_num) #finds the array-wise minimum
#print('smallest index to be deleted is: ' +str(rem_ind))
dfs = dfs.drop(dfs.index[rem_ind:])
print(dfs) #after dropping NaNs
The result is as I want except for the last five rows in Channel B and C, where a comma is seen instead of a point to indicate decimal. I don't know why it works everywhere else but not for a few rows. The CSV file can be found here.

It looks like a data type issue. Some of the values are strings so pandas will not automatically convert to float before replacing ',' with '.'.
one option is to convert each column after you read the file with something like: df['colname'] = df['colname'].str.replace(',', '.').astype(float)

I think you need to replace the non-numeric symbols -∞ and ∞ as NaN already while reading, and not after the fact. If you do it after the data frame is created, then the values have been read in and it's parsed as data type str intead of float. This messes up the data types of the column.
So instead of na_values='NaN' do this na_values=["-∞", "∞"], so the code is like this:
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values=["-∞", "∞"])
#replaces non-numeric symbols to NaN.
# dfs = dfs.replace({'-∞': np.nan, '∞': np.nan}) # not needed anymore

Related

loading semi structured data to pandas

I have data that looks like this (from jq)
script_runtime{application="app1",runtime="1651394161"} 1651394161
folder_put_time{application="app1",runtime="1651394161"} 22
folder_get_time{application="app1",runtime="1651394161"} 128.544
folder_ls_time{application="app1",runtime="1651394161"} 3.868
folder_ls_count{application="app1",runtime="1651394161"} 5046
The dataframe should allow manipulation of each row to this:
script_runtime,app1,1651394161,1651394161
folder_put_time,app1,1651394161,22
Its in a textfile. How can I easily load it into pandas for data manipulation?

Load the .txt using pd.read_csv(), specifying a space as the separator (similar StackOverflow answer). The result will be a two-column dataframe with the bracketed text in the first column, and the float in the second column.
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
Parse the bracketed text into separate columns:
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
The result is a dataframe looks like this:
If you want to drop the first column which contains the bracketed value:
df = df.iloc[: , 1:]
Full code:
df = pd.read_csv("textfile.txt", header=None, delimiter=r"\s+")
df['function'] = df[0].str.split("{",expand=True)[0]
df['application'] = df[0].str.split("\"",expand=True)[1]
df['runtime'] = df[0].str.split("\"",expand=True)[3]
df = df.iloc[: , 1:]

Python pandas dataframe set_name length issue expecting one argument but is given two

So I have this prep_dat function and I am giving it the following csv data:
identifier,Hugo_Symbol,Tumor_Sample_Barcode,Variant_Classification,patient
1,patient,a,Silent,6
22,mutated,d,e,7
1,Hugo_Symbol,f,g,88
within this prep_data function, there is this line
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
However, it keeps erring out when it gets to the line saying
ValueError: Length of new names must be 1, got 2
Is there something wrong with the line or is it something wrong with the function
Here is the whole source code
import pandas as pd
import numpy as np
PRIMARY_TUMOR_PATIENT_ID_REGEX = '^.{4}-.{2}-.{4}-01.*'
SHORTEN_PATIENT_REGEX = '^(.{4}-.{2}-.{4}).*'
def mutations_for_gene(df):
mutated_patients = df['identifier'].unique()
return pd.DataFrame({'mutated': np.ones(len(mutated_patients))}, index=mutated_patients)
def prep_data(mutation_path):
df = pd.read_csv(mutation_path, low_memory=True, dtype=str, header = 0)#Line 24 reads in a line memory csv file from the given path and parses it based on '\t' delimators, and casts the data to str
df = df[~df['Hugo_Symbol'].str.contains('Hugo_Symbol')] #analyzes the 'Hugo_Symbol' heading within the data and makes a new dataframe where any row that contains 'Hugo_Symbol' is dropped
df['Hugo_Symbol'] = '\'' + df['Hugo_Symbol'].astype(str) # Appends ''\'' to all the data remaining in that column
df['Tumor_Sample_Barcode'] = df['Tumor_Sample_Barcode'].str.strip() #strips away whitespace from the data within this heading
non_silent = df.where(df['Variant_Classification'] != 'Silent') #creates a new dataframe where the data within the column 'Variant_Classification' is not equal to 'Silent'
df = non_silent.dropna(subset=['Variant_Classification']) #Drops all the rows that are missing at least one element
non_01_barcodes = df[~df['Tumor_Sample_Barcode'].str.contains(PRIMARY_TUMOR_PATIENT_ID_REGEX)]['Tumor_Sample_Barcode'] #Creates a new dataframe of all the data within the 'Tumor_Sample_Barcode' column that does not match the PRIMARY_TUMOR_PATIENT_ID_REGEX
#TODO: Double check that the extra ['Tumor_Sample_Barcode'] serves no purpose
df = df.drop(non_01_barcodes.index)
print(df)
shortened_patients = df['Tumor_Sample_Barcode'].str.extract(SHORTEN_PATIENT_REGEX, expand=False)
df['identifier'] = shortened_patients
gene_mutation_df = df.groupby(['Hugo_Symbol']).apply(mutations_for_gene)
gene_mutation_df.index.set_names(['Hugo_Symbol', 'patient'], inplace=True)
gene_mutation_df = gene_mutation_df.reset_index()
gene_patient_mutations = gene_mutation_df.pivot(index='Hugo_Symbol', columns='patient', values='mutated')
return gene_patient_mutations.transpose().fillna(0)
Any help would be greatly appreciated( I know this wasn't to specific, Im still trying to work out what this function does exactly and how I could make data to test it)

Pandas DataFrame takes too long

I am running the below code on a file with close to 300k lines. I know my code is not very efficient as it takes forever to finish, can anyone advise me on how I can speed it up?
import sys
import numpy as np
import pandas as pd
file = sys.argv[1]
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
orig_bytes = np.array(df['orig_bytes'])
resp_bytes = np.array(df['resp_bytes'])
size = np.array([])
ts = np.array([])
for i in range(len(df)):
if orig_bytes[i] > resp_bytes[i]:
size = np.append(size, orig_bytes[i])
ts = np.append(ts, df['ts'][i])
else:
size = np.append(size, resp_bytes[i])
ts = np.append(ts, df['ts'][i])
The aim is to only record instances where one of the two (orig_bytes or resp_bytes) is the larger one.
Thanking you all for your help

I can't guarantee that this will run faster than what you have, but it is a more direct route to where you want to go. Also, I'm assuming based on your example that you don't want to keep instances where the two byte values are equal and that you want a separate DataFrame in the end, not a new column in the existing df:
After you've created your DataFrame and renamed the columns, you can use query to drop all the instances where orig_bytes and resp_bytes are the same, create a new column with the max value of the two, and then narrow the DataFrame down to just the two columns you want.
df = pd.read_csv(file, delimiter=' ',header=None)
df.columns = ["ts", "proto", "orig_bytes", "orig_pkts", "resp_bytes", "resp_pkts", "duration", "conn_state"]
df_new = df.query("orig_bytes != resp_bytes")
df_new['biggest_bytes'] = df_new[['orig_bytes', 'resp_bytes']].max(axis=1)
df_new = df_new[['ts', 'biggest_bytes']]
If you do want to include the entries where they are equal to each other, then just skip the query step.

How to fix data getting loaded into a single column of a pandas dataframe?

I have the following code:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset2 = pd.read_csv(file_path, header=None, dtype=str)
v = dataset2.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
dataset1 = pd.DataFrame(f)
df = dataset1.astype('str')
dataset = df.values.tolist()
print (type (dataset))
print (type (dataset[1]))
print (type (dataset[1][1]))
The target is to transfer all the dataset into values from 1..n for each different distinct value in dataset and afterwards to transform it into list of lists where each element is string.
The above code works great. However when I change the dataset into:
file_path ='https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
I get error. How can it work for this dataset as well?

You need to understand the data you're working with. A quick print call would've helped you realise the delimiters with this one are different.
Furthermore, it appears to be numeric data; you don't need an str conversion anymore.
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
t = pd.read_csv(file_path, header=None, delim_whitespace=True)
v = t.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
df = pd.DataFrame(f)
If you want pandas to guess the delimiter format, you might employ the use of sep=None:
t = pd.read_csv(file_path, header=None, sep=None)
I don't recommend this because it is very easy for pandas to make mistakes when loading your data with an inferred delimiter.

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.

considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why all , are not converted to decimals when importing in Pandas? - python

It looks like a data type issue. Some of the values are strings so pandas will not automatically convert to float before replacing ',' with '.'. one option is to convert each column after you read the file with something like: df['colname'] = df['colname'].str.replace(',', '.').astype(float)

Related

loading semi structured data to pandas

Python pandas dataframe set_name length issue expecting one argument but is given two

Pandas DataFrame takes too long

How to fix data getting loaded into a single column of a pandas dataframe?

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

Categories

Resources