I produce a query with 13 columns of values. Every single ones of these values are manually entered. That means there is roughly less than 10% chance that the rows entered are wrong. However that is not the issue. the issue is sometimes certain special characters are entered that can cause havoc to the database. I need to filter/remove this content from the CSV file
Here is a simple sample of the output of the CSV file
TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;
;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)
U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; Jb2004#hotmail.com;
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; alice.schmidt#hotmail.com;
Now in rare occasions sometimes someone might want to add a second email which is not allowed but they still do it and whats worse they add a semicolon to it. Meaning that when the csv is loaded there are rows that surpass 13 columns.
U;98645;2022;5;6CD;59845;Billy;Snow;Male;United Kingdom;Freedom street; 2a; BillyS#gmail.com;Billysnow2004#hotmail.com;
Therefore to solve this problem I need to count the number of deliemters there are in each row, and if I do find a row that passed that count, I need to clear that excessve data even if it means losing that data for that particular person. So that means everything after the 13 column needs to be removed.
Here is my code sample in python. You will also notice that I am filtering other special characters from the csv file.
import pandas as pd
from datetime import datetime
data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")
data = data.convert_dtypes()
#df = data.dataframe()
rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'
data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)
print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)
very very basic aproach, but maybe will be enough:
import pandas as pd
df = pd.read_csv(r"C:\Test\test.csv", sep = ';')
data = df.iloc[:, : 13].copy() # data to use in later code
excessive_data = df.iloc[:, 13: ].copy().reset_index(drop=True) # excessive data will land after columns 13
if not excessive_data.empty:
# checking if any excessive data is present
pos = excessive_data[excessive_data.notnull().all(axis=1)].index.tolist()
print(f"excessive data is present in rows index:{pos}")
Related
I am using the following code to import the CSV file. It works well except for when it encounters a three digit number followed by a decimal. Below is my code and the result
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def fft(x, Plot_ShareY=True):
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values='NaN') #loads the csv files
#replaces non-numeric symbols to NaN.
dfs = dfs.replace({'-∞': np.nan, '∞': np.nan})
#print(dfs) #before dropping NaNs
#each column taken into a separate variable
time = dfs['Time'] #- np.min(dfs['Time'])
channelA = dfs['Channel A']
channelB = dfs['Channel B']
channelC = dfs['Channel C']
channelD = dfs['Channel D']
channels = [channelA, channelB, channelC, channelD]
#printing the smallest index number which is NaN
ind_num_A = np.where(channelA.isna())[0][0]
ind_num_B = np.where(channelB.isna())[0][0]
ind_num_C = np.where(channelC.isna())[0][0]
ind_num_D = np.where(channelD.isna())[0][0]
ind_num = [ind_num_A, ind_num_B, ind_num_C, ind_num_D]
#dropping all rows after the first NaN is found
rem_ind = np.amin(ind_num) #finds the array-wise minimum
#print('smallest index to be deleted is: ' +str(rem_ind))
dfs = dfs.drop(dfs.index[rem_ind:])
print(dfs) #after dropping NaNs
The result is as I want except for the last five rows in Channel B and C, where a comma is seen instead of a point to indicate decimal. I don't know why it works everywhere else but not for a few rows. The CSV file can be found here.
It looks like a data type issue. Some of the values are strings so pandas will not automatically convert to float before replacing ',' with '.'.
one option is to convert each column after you read the file with something like: df['colname'] = df['colname'].str.replace(',', '.').astype(float)
I think you need to replace the non-numeric symbols -∞ and ∞ as NaN already while reading, and not after the fact. If you do it after the data frame is created, then the values have been read in and it's parsed as data type str intead of float. This messes up the data types of the column.
So instead of na_values='NaN' do this na_values=["-∞", "∞"], so the code is like this:
dfs = pd.read_csv(x, delimiter=";", skiprows=(1,2), decimal=",", na_values=["-∞", "∞"])
#replaces non-numeric symbols to NaN.
# dfs = dfs.replace({'-∞': np.nan, '∞': np.nan}) # not needed anymore
I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
We are using Pandas to read a CSV into a dataframe:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
Since we are allowing bad lines to be skipped, we want to be able to track how many have been skipped and put it into a value so that we can metric off of it.
To do this, I was thinking of comparing how many rows we have in the dataframe vs the number of rows in the original file.
I think this does what I want:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
initialRowCount = sum(1 for line in open('our_filepath_here'))
difference = initialRowCount - len(someDataframe.index))
But the hardware running this is super limited and I would rather not open the file and iterate through the whole thing just to get a row count when we're already going through the whole thing once via .read_csv. Does anyone know of a better way to get both the successfully processed count and the initial row count for the CSV?
Though I haven't tested this personally, I believe you can count the number of warnings generated by capturing them and checking the length of the returned list of captured warnings. Then add that to current shape of your dataframe:
import warnings
import pandas as pd
with warnings.catch_warnings(record=True) as warning_list:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
# May want to check if each warning object a pandas "bad line warning"
number_of_warned_lines = len(warning_list)
initialRowCount = len(someDataframe) + number_of_warned_lines
https://docs.python.org/3/library/warnings.html#warnings.catch_warnings
Edit: took a little bit of toying around, but this seems to work with Pandas. Instead of depending on the warnings built-in, we'll just temporarily redirect stderr. Then we can count the number of times "Skipping Lines" occurs in that string and we'll end with the count of bad lines with this warning message!
import contextlib
import io
bad_data = io.StringIO("""
a,b,c,d
1,2,3,4
f,g,h,i,j,
l,m,n,o
p,q,r,s
7,8,9,10,11
""".lstrip())
new_stderr = io.StringIO()
with contextlib.redirect_stderr(new_stderr):
df = pd.read_csv(bad_data, error_bad_lines=False, warn_bad_lines=True)
n_warned_lines = new_stderr.getvalue().count("Skipping line")
print(n_warned_lines) # 2
I have a csv file that I am trying to convert into a data frame. But the data has some extra heading material that gets repeated. For example:
Results Generated Date Time
Sampling Info
Time; Data
1; 4.0
2; 5.2
3; 6.1
Results Generated Date Time
Sampling Info
Time; Data
6; 3.2
7; 4.1
8; 9.7
If it is a clean csv file without the extra heading material, I am using
df = pd.read_csv(r'Filelocation', sep=';', skiprows=2)
But I can't figure out how to remove the second set of header info. I don't want to lose the data below the second header set. Is there a way to remove it so the data is clean? The second header set is not always in the same location (basically a data acquisition mistake).
Thank you!
import pandas as pd
filename = 'filename.csv'
lines =open(filename).read().split('\n') # reading the csv file
list_ = [e for e in lines if e!='' ] # removing '' characters from lines list
list_ = [e for e in list_ if e[0].isdigit()] # removing string starting with non-numeric characters
Time = [float(i.split(';')[0]) for i in list_] # use int or float depending upon the requirements
Data = [float(i.split(';')[1].strip()) for i in list_]
df = pd.DataFrame({'Time':Time, 'Data':Data}) #making the dataframe
df
I hope this will do the work !
Try to split your text file after the first block of data. Then you can make two dataframes out of it and concatenate them.
with open("yourfile.txt", 'r') as f:
content = f.read()
# Make a list of subcontent
splitContent = content.split('Results Generated Date Time\nSampling Info\n')
Using "Results Generated Date Time\nSampling Info\n" as the argument for split, also removes those lines - This only works if the unnecessary header lines are always equal!
After this you get a list of your data as strings (variable: splitContent) separated by a delimiter (';').
Use this Answer to create dataframes from strings: https://stackoverflow.com/a/22605281/11005812.
Another approach could be to save each subcontent as a own file and read it again.
Concatening dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
My code:
import numpy as np
import pandas as pd
import time
tic = time.time()
I read a long file of the headers [meter] [daycode] [meter reading in kWh]. A time series of over 6,000 meters.
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
Because I have in fact total 6 files of this humungous size, I want to filter out those with insufficient information. These are the time series data with [meter] values under code 3 by category. I can collect this category information from another file. Following is where I extract this.
id_total = pd.read_csv("data/meter_id_code.csv", header = 0, encoding="cp1252")
#print(len(id_total.index))
id_total.set_index('Code', inplace=True)
id_other = id_total.loc[3].copy()
print id_other
And this is where I write to csv to check whether the last line is correctly performed:
id_other.to_csv('data/id_other.csv', sep='\t', encoding='utf-8')
print consum[~consum.index.isin(id_other)]
Output: (of print id_other)
Problem:
I get the following warning. Here it says it didn't affect the code from working but mine is affected. I checked the correct directory (earlier confused my remote connection to gpu server with my hardware) and csv file was created. It turns out the meter IDs in the file are not filtered.
How can I fix the last line?