I am trying to remove or delete headers of data I am reading in using pandas. One file has a header and the other doesn't but I want to be able to check for headers and then remove it.
So far, I have tried using header=None in the read_csv function
from pathlib import Path
import pandas as pd
def _reader(fname):
return pd.read_csv(fname, sep="\t", header=None)
folder = Path("C:\\Me\\Project1")
data = pd.concat([
_reader(txt)
for txt in folder.glob("*.txt")
])
I get the following error:
TypeError: must be str, not int
My two files look like this:
File1.txt
ISIN AVL_QTY
BAD 90000
AAB 8550000
BAD 173688
BAD 360000
BAD 90000
BAD 810000
BAD 900000
BAD 900000
File2.txt
TEST 543
HELLO 555
STOCK 900
CODE 785
First, you need a check if the first line is a header. E.g. you can check if any of the first row's entries begins with a number, as this would not be typical for a column header.
In fact without knowing your thousands of files the right approach for header detection is just guessing - but that's not really the point in your code.
To make use of a header detection, you should go with normal loop instead of a list comprehension, so that you can in each iteration: 1. check for header 2. read the file and append the data to a dataframe:
df = pd.DataFrame()
for f in folder.glob("*.txt"):
with open(f) as fin:
chk_lst = next(fin).split()
is_h = not any(v[0].isdecimal() for v in chk_lst)
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=(None, 0)[is_h])], axis=1)
# ISIN AVL_QTY 0 1
# 0 BAD 90000 TEST 543.775
# 1 AAB 8550000 HELLO 555.000
# 2 BAD 173688 STOCK 900.000
# 3 BAD 360000 CODE 785.000
# 4 BAD 90000 NaN NaN
# 5 BAD 810000 NaN NaN
# 6 BAD 900000 NaN NaN
# 7 BAD 900000 NaN NaN
Edit:
For concatenating row wise, you can use
df = pd.concat([df, pd.read_csv(f, sep='\s+', header=None, skiprows=(0, 1)[is_h])], axis=0, ignore_index=True)
# 0 1
# 0 BAD 90000
# 1 AAB 8550000
# 2 BAD 173688
# 3 BAD 360000
# 4 BAD 90000
# 5 BAD 810000
# 6 BAD 900000
# 7 BAD 900000
# 8 TEST 543
# 9 HELLO 555
# 10 STOCK 900
# 11 CODE 785
File2.txt does not have header, right? but in _reader you set header as None.
Add header to File2.txt and see what happens.
There are a couple ways to check if a csv file has a header
using the csv library
import csv
with open('example.csv', 'rb') as csvfile:
sniffer = csv.Sniffer()
has_header = sniffer.has_header(csvfile.read(2048))
csvfile.seek(0)
# ...
my source
or if you know your data, checking if there are any digits in the first row
is_header = not any(cell.isdigit() for cell in csv_table[0])
my source
or with pandas itself, if you know what the header might be called
df = (pd.read_csv(filename, header=None, names=cols)
[lambda x: np.ones(len(x)).astype(bool)
if (x.iloc[0] != cols).all()
else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)
my source
and of course if you want to preprocess the files with the command line first, it'll probably be faster....
I have a series of very messy *.csv files that are being read in by pandas. An example csv is:
Instrument 35392
"Log File Name : station"
"Setup Date (MMDDYY) : 031114"
"Setup Time (HHMMSS) : 073648"
"Starting Date (MMDDYY) : 031114"
"Starting Time (HHMMSS) : 090000"
"Stopping Date (MMDDYY) : 031115"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000200"
"Circltr warmup (HHMMSS) : 000200"
"Date","Time","","Temp","","SpCond","","Sal","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","Volts",""
"Random message here 031114 073721 to 031114 083200"
03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,""
03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,""
03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,""
03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,""
03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,""
03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,""
"message 3"
"message 4"**
I have been using this code to import the *csv file, process the double headers, pull out the empty columns, and then strip the offending rows with bad data:
DF = pd.read_csv(BADFILE,parse_dates={'Datetime_(ascii)': [0,1]}, sep=",", \
header=[10,11],na_values=['','na', 'nan nan'], \
skiprows=[10], encoding='cp1252')
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2)
droplist = ['message', 'Random']
DF = DF[~DF['Datetime_(ascii)'].str.contains('|'.join(droplist))]
DF.head()
Datetime_(ascii) (Temp, øC) (SpCond, mS/cm) (Sal, ppt) (IBatt, Volts)
0 03/11/14 09:00:00 15.85 1.408 0.74 6.2
1 03/11/14 10:00:00 15.99 1.960 1.05 6.3
2 03/11/14 11:00:00 14.20 40.800 26.12 6.2
3 03/11/14 12:00:01 14.20 41.700 26.77 6.2
4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
This was working fine and dandy until I have a file that has an erronious 1 row line after the header: "Random message here 031114 073721 to 031114 083200"
The error I receieve is:
*C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)
1554 data, names = _process_date_conversion(
1555 data, self._date_conv, self.parse_dates, self.index_col,
-> 1556 self.index_names, names,
keep_date_col=self.keep_date_col)
1557
1558 return names, data
C:\Users\USER\AppData\Local\Continuum\Anaconda3\lib\site-
packages\pandas\io\parsers.py in _process_date_conversion(data_dict,
converter, parse_spec, index_col, index_names, columns, keep_date_col)
2975 if not keep_date_col:
2976 for c in list(date_cols):
-> 2977 data_dict.pop(c)
2978 new_cols.remove(c)
2979
KeyError: ('Time', 'HHMMSS')*
If I remove that line, the code works fine. Similarly, if I remove the header= line the code works fine. However, I want to be able to preserve this because I am reading in hundreds of these files.
Difficulty: I would prefer to not open each file before the call to pandas.read_csv() as these files can be rather large - thus I don't want to read and save multiple times! Also, I would prefer a real pandas/pythonic solution that doesn't involve openning the file first as a stringIO buffer to removing offending lines.
Here's one approach, making use of the fact that skip_rows accepts a callable function. The function receives only the row index being considered, which is a built-in limitation of that parameter.
As such, the callable function skip_test() first checks whether the current index is in the set of known indices to skip. If not, then it opens the actual file and checks the corresponding row to see if its contents match.
The skip_test() function is a little hacky in the sense that it does inspect the actual file, although it only inspects up until the current row index it's evaluating. It also assumes that the bad line always begins with the same string (in the example case, "foo"), but that seems to be a safe assumption given OP.
# example data
""" foo.csv
uid,a,b,c
0,1,2,3
skip me
1,11,22,33
foo
2,111,222,333
"""
import pandas as pd
def skip_test(r, fn, fail_on, known):
if r in known: # we know we always want to skip these
return True
# check if row index matches problem line in file
# for efficiency, quit after we pass row index in file
f = open(fn, "r")
data = f.read()
for i, line in enumerate(data.splitlines()):
if (i == r) & line.startswith(fail_on):
return True
elif i > r:
break
return False
fname = "foo.csv"
fail_str = "foo"
known_skip = [2]
pd.read_csv(fname, sep=",", header=0,
skiprows=lambda x: skip_test(x, fname, fail_str, known_skip))
# output
uid a b c
0 0 1 2 3
1 1 11 22 33
2 2 111 222 333
If you know exactly which line the random message will appear on when it does appear, then this will be much faster, as you can just tell it not to inspect the file contents for any index past the potential offending line.
After some tinkering yesterday I found a solution and what the potential issue may be.
I tried the skip_test() function answer above, but I was still getting errors with the size of the table:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)()
pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)()
ParserError: Error tokenizing data. C error: Expected 1 fields in line 14, saw 11
So after playing around with skiprows= I discovered that I was just not getting the behavior I wanted when using the engine='c'. read_csv() was still determining the size of the file from those first few rows, and some of those single column rows were still being passed. It may be that I have a few more bad single column rows in my csv set that I did not plan on.
Instead, I create an arbitrary sized DataFrame as a template. I pull in the entire .csv file, then use logic to strip out the NaN rows.
For example, I know that the largest table that I will encounter with my data will be 10 rows long. So my call to pandas is:
DF = pd.read_csv(csv_file, sep=',', \
parse_dates={'Datetime_(ascii)': [0,1]},\
na_values=['','na', '999999', '#'], engine='c',\
encoding='cp1252', names = list(range(0,10)))
I then use these two lines to drop the NaN rows and columns from the DataFrame:
#drop the null columns created by double deliminators
DF = DF.dropna(how="all", axis=1)
DF = DF.dropna(thresh=2) # drop if we don't have at least 2 cells with real values
If anyone in the future comes across this question, pandas has now implemented the on_bad_lines argument. You can now solve this problem by using on_bad_lines = "skip"
making a change from R to Python I have some difficulties to write multiple csv using pandas from a list of multiple DataFrames:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize,
DelayFunction)
diamonds = [diamonds, diamonds, diamonds]
path = "/user/me/"
def extractDiomands(path, diamonds):
for each in diamonds:
df = DplyFrame(each) >> select(X.carat, X.cut, X.price) >> head(5)
df = pd.DataFrame(df) # not sure if that is required
df.to_csv(os.path.join('.csv', each))
extractDiomands(path,diamonds)
That however generates an errors. Appreciate any suggestions!
Welcome to Python! First I'll load a couple libraries and download an example dataset.
import os
import pandas as pd
example_data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(example_data.head(5))
first few rows of our example data:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Now here's what I think you want done:
# spawn a few datasets to loop through
df_1, df_2, df_3 = example_data.head(20), example_data.tail(20), example_data.head(10)
list_of_datasets = [df_1, df_2, df_3]
output_path = 'scratch'
# in Python you can loop through collections of items directly, its pretty cool.
# with enumerate(), you get the index and the item from the sequence, each step through
for index, dataset in enumerate(list_of_datasets):
# Filter to keep just a couple columns
keep_columns = ['gre', 'admit']
dataset = dataset[keep_columns]
# Export to CSV
filepath = os.path.join(output_path, 'dataset_'+str(index)+'.csv')
dataset.to_csv(filepath)
At the end, my folder 'scratch' has three new csv's called dataset_0.csv, dataset_1.csv, and dataset_2.csv
I have the following data set in a csv file:
vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---44:13.0---18.13533401---19.10000038---316---389.1700134
I am trying to write a function launch_time() with two inputs (dataframe, vehicle name) that returns the first time the gspd is reported above 10.0 m/s.
The output time must be converted from a string (HH:MM:SS.SS) to a minutes after 12:00 format.
It should look something like this:
>>> launch_time(df, veh_1)
30.0
I will use this function to iterate through each vehicle and then need to record the results into a list of tuples with the format (v_name, launch time) in launch sequence order.
It should look something like this:
'veh_1', 30.0, 'veh_2', 15.0
Disclosure: my python/pandas knowledge is very entry-level.
You can use read_csv with separator -{3,} - read csv with 3 and more -:
import pandas as pd
from pandas.compat import StringIO
temp=u"""vehicle---time-----aspd[m/s]------gspd[m/s]----hdg---alt[m-msl]
veh_1---17:19.5---0.163471505---0.140000001---213---273.8900146
veh_2---17:19.5---0.505786836---0.170000002---214---273.9100037
veh_3---17:19.8---0.173484877---0.109999999---213---273.980011
veh_4---44:12.4---18.64673424---19.22999954---316---388.9299927
veh_5---45:13.0---18.13533401---19.10000038---316---389.1700134"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="-{3,}", engine='python')
print (df)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
0 veh_1 17:19.5 0.163472 0.14 213 273.890015
1 veh_2 17:19.5 0.505787 0.17 214 273.910004
2 veh_3 17:19.8 0.173485 0.11 213 273.980011
3 veh_4 44:12.4 18.646734 19.23 316 388.929993
4 veh_5 45:13.0 18.135334 19.10 316 389.170013
Then convert column time to_timedelta, filter all rows above 10m/s by boolean indexing, sort_values, group on vehicles using groupby, then get the first value in each group and last zip columns vehicle and time and convert to list:
df.time = pd.to_timedelta('00:' + df.time, unit='h').\
astype('timedelta64[m]').astype(int)
req = df[df['gspd[m/s]'] > 10].\
sort_values('time', ascending=True).\
groupby('vehicle', as_index=False).head(1)
print(req)
vehicle time aspd[m/s] gspd[m/s] hdg alt[m-msl]
4 veh_5 45 18.135334 19.10 316 389.170013
3 veh_4 44 18.646734 19.23 316 388.929993
L = list(zip(req['vehicle'],req['time']))
print (L)
[('veh_5', 45), ('veh_4', 44)]
I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:
import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np
# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)
# download data
if not os.path.isfile(local_file):
print('Downloading and unzipping file!')
urllib.request.urlretrieve(url, local_zipfile)
zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))
# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)
# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]
# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates
All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:
df2.convert_objects(convert_numeric=True)
But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):
for d in df2.columns:
if d is not 'dates':
df2[d] = df2[d].map(lambda x: float(x)/100)
But this gives me the following warning:
home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
try:
I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks
Edit:
df2=df2.convert_objects(convert_numeric=True)
Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)
Some of df2:
dates Agric Food Soda Beer Smoke Toys Fun \
dates
192607 192607 2.37 0.12 -99.99 -5.19 1.29 8.65 2.50
192608 192608 2.23 2.68 -99.99 27.03 6.50 16.81 -0.76
192609 192609 -0.57 1.58 -99.99 4.02 1.26 8.33 6.42
192610 192610 -0.46 -3.68 -99.99 -3.31 1.06 -1.40 -5.09
192611 192611 6.75 6.26 -99.99 7.29 4.55 0.00 1.82
Edit2: the solution is actually more simple than I thought:
df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100
I would try the following to force convert everything into floats:
df2=df2.astype(float)
You can convert specific column to float(or any numerical type for that matter) by
df["column_name"] = pd.to_numeric(df["column_name"])
Posting this because pandas.convert_objects is deprecated in pandas 0.20.1
You need to assign the result of convert_objects as there is no inplace param:
df2=df2.convert_objects(convert_numeric=True)
you refer to the rename method but that one has an inplace param which you set to True.
Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.
Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..