Removing rows whose index match certain values - python

My code:
import numpy as np
import pandas as pd
import time
tic = time.time()
I read a long file of the headers [meter] [daycode] [meter reading in kWh]. A time series of over 6,000 meters.
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
Because I have in fact total 6 files of this humungous size, I want to filter out those with insufficient information. These are the time series data with [meter] values under code 3 by category. I can collect this category information from another file. Following is where I extract this.
id_total = pd.read_csv("data/meter_id_code.csv", header = 0, encoding="cp1252")
#print(len(id_total.index))
id_total.set_index('Code', inplace=True)
id_other = id_total.loc[3].copy()
print id_other
And this is where I write to csv to check whether the last line is correctly performed:
id_other.to_csv('data/id_other.csv', sep='\t', encoding='utf-8')
print consum[~consum.index.isin(id_other)]
Output: (of print id_other)
Problem:
I get the following warning. Here it says it didn't affect the code from working but mine is affected. I checked the correct directory (earlier confused my remote connection to gpu server with my hardware) and csv file was created. It turns out the meter IDs in the file are not filtered.
How can I fix the last line?

Related

How to count how many delimeters per row there are in Python

I produce a query with 13 columns of values. Every single ones of these values are manually entered. That means there is roughly less than 10% chance that the rows entered are wrong. However that is not the issue. the issue is sometimes certain special characters are entered that can cause havoc to the database. I need to filter/remove this content from the CSV file
Here is a simple sample of the output of the CSV file
TypeOfEntry;Schoolid;Schoolyear;Grade;Classname;ClassId;firstname;lastname;Gender;nationality;Street;housenumber;Email;
;;;;;;;;;;;;; (1st line empty, 13 semicolons per row)
U;98645;2022;4;4AG;59845;John;Bizley;Male;United Kingdom;Canterburrystreet; 15a; Jb2004#hotmail.com;
U;98645;2022;4;4AG;59847;Alice;Schmidt;Female;United Kingdom;Milton street; 2/3; alice.schmidt#hotmail.com;
Now in rare occasions sometimes someone might want to add a second email which is not allowed but they still do it and whats worse they add a semicolon to it. Meaning that when the csv is loaded there are rows that surpass 13 columns.
U;98645;2022;5;6CD;59845;Billy;Snow;Male;United Kingdom;Freedom street; 2a; BillyS#gmail.com;Billysnow2004#hotmail.com;
Therefore to solve this problem I need to count the number of deliemters there are in each row, and if I do find a row that passed that count, I need to clear that excessve data even if it means losing that data for that particular person. So that means everything after the 13 column needs to be removed.
Here is my code sample in python. You will also notice that I am filtering other special characters from the csv file.
import pandas as pd
from datetime import datetime
data = pd.read_csv("schooldata.csv", sep = ';')
data.columns = ['TypeOfEntry','Schoolid','Schoolyear','Grade','Classname','ClassId','Firstname','Lastname','Gender','Nationality','Street','Housenumber','Email']
date = datetime. now(). strftime("%Y_%m_%d")
data = data.convert_dtypes()
#df = data.dataframe()
rep_chars = '°|^|!|"|\(|\)|\?'
rep_chars2 = r'\'|\`|\´|\*|#'
data = data.replace(rep_chars, '', regex=True)
data = data.replace(rep_chars2, '', regex=True)
data = data.replace('\+', '-', regex=True)
print(data.head())
print(data .dtypes)
data.to_csv(f'scoolexport_{date}.csv', sep= ';', date_format='%Y%m%d', index=False)
very very basic aproach, but maybe will be enough:
import pandas as pd
df = pd.read_csv(r"C:\Test\test.csv", sep = ';')
data = df.iloc[:, : 13].copy() # data to use in later code
excessive_data = df.iloc[:, 13: ].copy().reset_index(drop=True) # excessive data will land after columns 13
if not excessive_data.empty:
# checking if any excessive data is present
pos = excessive_data[excessive_data.notnull().all(axis=1)].index.tolist()
print(f"excessive data is present in rows index:{pos}")

How to only output the calculations done in the code into a csv file python?

So I am working on a code where I take values from the csv file and multiply them with some numbers but when I save and export the results the values from the imported file are also copied to the new file along with the results. I just want the results in the output file.
df = pd.read_csv('DAQ4.csv')
df['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df['heat_pump_power'] = (df['pump_current']*230)*0.62
with open('DAQsol.csv', 'w', newline='') as f:
thewriter = csv.writer(f)
df.to_csv('DAQsol.csv')
This is not the full code but should be enough to understand. so basically I just want the heat pump power and the furnace power to appear in the output file not the whole pump current and voltage from the imported DAQ 4 file.
df['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df['heat_pump_power'] = (df['pump_current']*230)*0.62
These two lines just modify the dataframe that you loaded. This means that all the other columns still exist, but just aren't modified. By calling df.to_csv('DAQsol.csv') you are saving the whole dataframe, with the unwanted and unmodified columns.
One way to not export these columns to the output .csv file, is to drop them out. This can be achieved with the following code:
df.drop(columns=['unwanted_column_1', 'unwanted_column_two'])
Just make an empty dataframe and populate it with your new calculations:
df = pd.read_csv('DAQ4.csv')
df2 = pd.DataFrame()
df2['furnace_power'] = df['furnace_voltage']*df['furnace_current']*0.52 #calculating the furnace power
df2['heat_pump_power'] = (df['pump_current']*230)*0.62
df2.to_csv('DAQsol.csv')
Hi here you have an example reading and saving in a diferent file after some calculations:
from os import sep
import pandas as pd
#reading the csv file as dataframe
df= pd.read_csv('DAQ4.csv',delimiter=';',header=None)
df['furnace_power'] = df['furnace_power'].apply(lambda x: x*x*0.52)
df['heat_pump_power'] = df['furnace_power'].apply(lambda x: x*230*0.62)
#saving the dataframe as csv file
df.to_csv('out_file.csv', sep=';', index=False)
I hope it works for you.

Comparing number of lines in a CSV compared to number successfully processed into dataframe by Pandas?

We are using Pandas to read a CSV into a dataframe:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
Since we are allowing bad lines to be skipped, we want to be able to track how many have been skipped and put it into a value so that we can metric off of it.
To do this, I was thinking of comparing how many rows we have in the dataframe vs the number of rows in the original file.
I think this does what I want:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
initialRowCount = sum(1 for line in open('our_filepath_here'))
difference = initialRowCount - len(someDataframe.index))
But the hardware running this is super limited and I would rather not open the file and iterate through the whole thing just to get a row count when we're already going through the whole thing once via .read_csv. Does anyone know of a better way to get both the successfully processed count and the initial row count for the CSV?
Though I haven't tested this personally, I believe you can count the number of warnings generated by capturing them and checking the length of the returned list of captured warnings. Then add that to current shape of your dataframe:
import warnings
import pandas as pd
with warnings.catch_warnings(record=True) as warning_list:
someDataframe = pandas.read_csv(
filepath_or_buffer=our_filepath_here,
error_bad_lines=False,
warn_bad_lines=True
)
# May want to check if each warning object a pandas "bad line warning"
number_of_warned_lines = len(warning_list)
initialRowCount = len(someDataframe) + number_of_warned_lines
https://docs.python.org/3/library/warnings.html#warnings.catch_warnings
Edit: took a little bit of toying around, but this seems to work with Pandas. Instead of depending on the warnings built-in, we'll just temporarily redirect stderr. Then we can count the number of times "Skipping Lines" occurs in that string and we'll end with the count of bad lines with this warning message!
import contextlib
import io
bad_data = io.StringIO("""
a,b,c,d
1,2,3,4
f,g,h,i,j,
l,m,n,o
p,q,r,s
7,8,9,10,11
""".lstrip())
new_stderr = io.StringIO()
with contextlib.redirect_stderr(new_stderr):
df = pd.read_csv(bad_data, error_bad_lines=False, warn_bad_lines=True)
n_warned_lines = new_stderr.getvalue().count("Skipping line")
print(n_warned_lines) # 2

Pandas: Sensor data has 3 headers. Need to retain them after gap-filling

I'm working with sensor-data that comes off of Decagon Em50g data loggers. Before I do anything with these data they need to be gap-filled according to our study period (start and end data/time). The raw sensor data from the loggers have 3 headers that need to be retained after gap-filling (they have important metadata that gets pulled in a second script that tidy's up the data after gap-filling). I need to do this for many files and am trying to do it for one first.
My approach: I read in sensor-data in two ways, once to store the first 2 headers 'headers_for_insert' (for later insertion) and second to have the 3rd header as the main header ('with_headers') for joining/gap-filling. After this, I create a date_time series for the duration of our study trial and then join this to the 'with_headers' data-frame for gap-filling.
Then, I need to basically stack the other 2 headers back on the gap-filled data-frame so that the 3 headers are retained when I export as a csv.
I've tried so many things and would appreciate any guidance on how to come up with a solution.
Click here for an image of what I'm trying to accomplish:
image of tables and headers
image of what I need the final table to look like
# import necessary libraries
import pandas as pd
import glob
# read in data, one with headers and one without
without_headers = pd.read_excel('D1Y-05Feb2018-1331.xls', header=None)
with_headers = pd.read_excel('D1Y-05Feb2018-1331.xls', header=2)
# subset first two rows from 'without_headers' (these will be inserted as headers later)
headers_for_insert = without_headers.iloc[0:2,:]
# change date-time heading to 'date_time'
with_headers = with_headers.rename(columns = {'Measurement Time':'date_time'})
with_headers = with_headers.set_index('date_time')
# create variables for your start and end date/time
start='12/15/2017 00:00:00'
end='2/4/2018 12:00:00'
# create dataframe that has the date-time series for duration of study trial
date_range = pd.date_range(start, end, freq='H')
date_range_series = pd.Series(date_range)
date_range_df = pd.DataFrame(date_range_series)
date_range_df.columns = ['date_time']
date_range_df = date_range_df.set_index('date_time')
# Left hand join using created time-series as series to join on.
gap_filled = date_range_df.join(with_headers)
gap_filled = gap_filled.reset_index()
# stack 'headers_for_insert' on top of 'gap_filled' dataframe
'''Here I want to put the 2 headers that I stored in 'headers_for_insert'
on top of the header in 'gap_filled' so that the output csv will have a
total of 3 headers'''
Thank you to those who have given this question some thought.
Update:
I found a solution with the following code (opening in excel and all is good).
However, I'm getting an error when I try to read back into as a data-frame.
# write 'headers_for_insert' to csv as a way to start building table.
headers_for_insert.to_csv('headers.csv', index=False, header=False)
# open 'headers.csv' and append the 'gap_filled' dataframe to the 2 headers
with open('headers.csv', 'a') as f:
gap_filled.to_csv(f, index=False, header=True, encoding='utf-8')
# the above lines of code worked (here I'm reading in the csv to test)
solved = pd.read_csv('headers.csv')
However, I'm getting this error when I try to read in the csv as a dataframe: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 1: invalid start byte

Read specific columns with pandas or other python module

I have a csv file from this webpage.
I want to read some of the columns in the downloaded file (the csv version can be downloaded in the upper right corner).
Let's say I want 2 columns:
59 which in the header is star_name
60 which in the header is ra.
However, for some reason the authors of the webpage sometimes decide to move the columns around.
In the end I want something like this, keeping in mind that values can be missing.
data = #read data in a clever way
names = data['star_name']
ras = data['ra']
This will prevent my program to malfunction when the columns are changed again in the future, if they keep the name correct.
Until now I have tried various ways using the csv module and resently the pandas module. Both without any luck.
EDIT (added two lines + the header of my datafile. Sorry, but it's extremely long.)
# name, mass, mass_error_min, mass_error_max, radius, radius_error_min, radius_error_max, orbital_period, orbital_period_err_min, orbital_period_err_max, semi_major_axis, semi_major_axis_error_min, semi_major_axis_error_max, eccentricity, eccentricity_error_min, eccentricity_error_max, angular_distance, inclination, inclination_error_min, inclination_error_max, tzero_tr, tzero_tr_error_min, tzero_tr_error_max, tzero_tr_sec, tzero_tr_sec_error_min, tzero_tr_sec_error_max, lambda_angle, lambda_angle_error_min, lambda_angle_error_max, impact_parameter, impact_parameter_error_min, impact_parameter_error_max, tzero_vr, tzero_vr_error_min, tzero_vr_error_max, K, K_error_min, K_error_max, temp_calculated, temp_measured, hot_point_lon, albedo, albedo_error_min, albedo_error_max, log_g, publication_status, discovered, updated, omega, omega_error_min, omega_error_max, tperi, tperi_error_min, tperi_error_max, detection_type, mass_detection_type, radius_detection_type, alternate_names, molecules, star_name, ra, dec, mag_v, mag_i, mag_j, mag_h, mag_k, star_distance, star_metallicity, star_mass, star_radius, star_sp_type, star_age, star_teff, star_detected_disc, star_magnetic_field
11 Com b,19.4,1.5,1.5,,,,326.03,0.32,0.32,1.29,0.05,0.05,0.231,0.005,0.005,0.011664,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2008,2011-12-23,94.8,1.5,1.5,2452899.6,1.6,1.6,Radial Velocity,,,,,11 Com,185.1791667,17.7927778,4.74,,,,,110.6,-0.35,2.7,19.0,G8 III,,4742.0,,
11 UMi b,10.5,2.47,2.47,,,,516.22,3.25,3.25,1.54,0.07,0.07,0.08,0.03,0.03,0.012887,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,2009,2009-08-13,117.63,21.06,21.06,2452861.05,2.06,2.06,Radial Velocity,,,,,11 UMi,229.275,71.8238889,5.02,,,,,119.5,0.04,1.8,24.08,K4III,1.56,4340.0,,
An easy way to do this is using the pandas library like this.
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print df.keys()
# See content in 'star_name'
print df.star_name
The problem here was the skipinitialspace which remove the spaces in the header. So ' star_name' becomes 'star_name'
According to the latest pandas documentation you can read a csv file selecting only the columns which you want to read.
import pandas as pd
df = pd.read_csv('some_data.csv', usecols = ['col1','col2'], low_memory = True)
Here we use usecols which reads only selected columns in a dataframe.
We are using low_memory so that we Internally process the file in chunks.
Above answers are for python2. So for python 3 users I am giving this answer. You can use the code below:
import pandas as pd
fields = ['star_name', 'ra']
df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
# See the keys
print(df.keys())
# See content in 'star_name'
print(df.star_name)
Got a solution to above problem in a different way where in although i would read entire csv file, but would tweek the display part to show only the content which is desired.
import pandas as pd
df = pd.read_csv('data.csv', skipinitialspace=True)
print df[['star_name', 'ra']]
This one could help in some of the scenario's in learning basics and filtering data on the basis of columns in dataframe.
I think you need to try this method.
import pandas as pd
data_df = pd.read_csv('data.csv')
print(data_df['star_name'])
print(data_df['ra'])

Categories

Resources