Merging two .csv files python-pandas - python

I have two .csv files with the same initial column-header:
NAME RA DEC Mean_I1 Mean_I2 alpha_K24 class alpha_K8 class.1 Av avgAv
Mon-000101 100.27242 9.608597 11.082 10.034 0.39 I 0.39 I 31.1 31.1
Mon-000171 100.29230 9.522860 14.834 14.385 0.45 I 0.45 I 33.7 33.7
and
NAME Sdev_I1 Sdev_I2
Mon-000002, 0.023, 0.028000001,
Mon-000003, 0.016000001, 0.016000001,
I want to merge the two together so that the 'NAME' columns match up, basically just add the two Sdev_I1/Sdev_I2 to the end of the first sample. I've tried...
import pandas as pd
df1 = pd.read_csv('h7.csv',sep=r'\s+')
df2 = pd.read_csv('NEW.csv',sep=r'\s+')
df = pd.merge(df1,df2)
df.to_csv('Newh7.csv',index=False)
but it's printing the 'NAME' twice and everything seems to be out of order and with a lot of added zeroes as well. I thought I had solved this one awhile back, but I've totally lost it. Help would be appreciated. Thanks.
Here's the output file:
NAME,RA,DEC,Mean_I1,Mean_I2,alpha_K24,class,alpha_K8,class.1,Av,avgAv,Sdev_I1,Sdev_I2

Seems you didn't strip the comma symbol in the second csv, you might try to use converters to convert them:
In [81]: converters = {
'NAME': lambda x:x[:-1],
'Sdev_I1': lambda x: float(x[:-1]),
'Sdev_I2': lambda x: float(x[:-1])
}
In [82]: pd.read_csv('NEW.csv',sep=r'\s+', converters=converters)
Out[82]:
NAME Sdev_I1 Sdev_I2
0 Mon-000002 0.023 0.028
1 Mon-000003 0.016 0.016

Related

How to add Column titles to a dataset without any in pandas python

I have these two datasets.
! curl -O https://raw.githubusercontent.com/msu-cmse-courses/cmse202-S21-student/master/data/Dataset.data
! curl -O https://raw.githubusercontent.com/msu-cmse-courses/cmse202-S21-student/master/data/Dataset.spec
So I read the data in using
import pandas as pd
data = pd.read_csv("Dataset.data", header = None)
Then I want to make column titles for the Dataset.data since it doesn't have any, just the rows with the data for each snail.
I tried using
data.columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']
to add it to the data set but it gives me the error:
Length mismatch: Expected axis has 1 elements, new values have 9 elements
Can anyone help me I just want my data to have these column titles in it. Currently it has no column titles just numbers
Cheers.
Your data is delimited by space, but read_csv defaults to comma, so you need to specify the delimiter manually:
data = pd.read_csv('Dataset.data', delimiter=' ', header=None)
data.columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']
data.head(2)
sex
length
diameter
height
whole_weight
shucked_weight
viscera_weight
shell_weight
rings
0
M
0.455
0.365
0.095
0.5140
0.2245
0.1010
0.15
15
1
M
0.350
0.265
0.090
0.2255
0.0995
0.0485
0.07
7

How can I remove extra digits of a float64 value?

I have a data frame column.
P08107 3.658940e-11
P62979 4.817399e-05
P16401 7.784275e-05
Q96B49 7.784275e-05
Q15637 2.099078e-04
P31689 1.274387e-03
P62258 1.662718e-03
P07437 3.029516e-03
O00410 3.029516e-03
P23381 3.029516e-03
P27348 5.733834e-03
P29590 9.559550e-03
P25685 9.957186e-03
P09429 1.181282e-02
P62937 1.260040e-02
P11021 1.396807e-02
P31946 1.409311e-02
P19338 1.503901e-02
Q14974 2.213431e-02
P11142 2.402201e-02
I want to leave one decimal and remove extra digits, that it looks like
3.7e-11
instead of
3.658940e-11
and etc with all the others.
I know how to slice a string but it doesn't seem to work here.
If you have a pandas dataframe you could set the display options.
import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.2f}'.format
pd.DataFrame(dict(randomvalues=np.random.random_sample((5,))))
Returns:
randomvalues
0 0.02
1 0.66
2 0.24
3 0.87
4 0.63
You could use str.format:
>>> '{:.2g}'.format(3.658940e-11)
'3.7e-11'
String slicing will not work here, because it does not round the values:
>>> s = '3.658940e-11'
>>> s[:3] + 'e' + s.split('e')[1]
'3.6e-11'

Python pandas--how to merge 2 csv if the relevant column data is "close enough?"

I want to merge two csvs using pandas. The column of interest is relative time and I am interested in merging if the two times are "close enough"--i.e. the times do not have to be exactly the same, just near each other. I think I have successfully merged the csv and created an output csv but it is blank and I am getting Key Error: 'name of column I am merging' so I think this is because none of the values are exactly the same.
raw = pd.read_csv('Specimen_RawData_2.csv', low_memory = False)
optical = pd.read_csv('results_avg_optical_strain3.csv', low_memory = False)
result = pd.merge(raw, optical[['Exx, mean']], on = 'Relative Time (s)')
result.to_csv('Results.csv')
I am not really interested in rounding the time values, because the exact value is of interest. I just want to merge the data where the two values are nearest each other. Ex:
raw:
22 1.097 0.11339 1.47275 0.053
23 1.101 0.12211 1.59291 0.057
24 1.105 0.13051 1.71423 0.061
optical:
1 1 1 4.54E-05 cam0_006807_418.910.csv 418.91 0.058
2 2 2 4.48E-05 cam0_006808_418.975.csv 418.975 0.123
3 3 3 0.000138274 cam0_006809_419.037.csv 419.037 0.185
I would merge these where the time value is 0.057 in raw and 0.058 in optical.
Any help is appreciated. I will edit this if there are any questions as I know this is confusing to read through!

How to exclude first word in Pandas header?

I'm importing text files to Pandas data frames. Number of columns can vary and also the names varies.
However, the header line always starts with ~A and read_csv interprets this a s the name of the first column, subsequently all the column names are shifted on step to the right.
Earlier I used np.genfromtxt() with the argument deletechars = 'A__' but I haven't find any equivalent function for pandas. Is there a way to exclude the name when reading or, as an second option, delete the first name but keep the columns intact?
I'm reading file like this:
in_file = pd.read_csv(file_name, header=header_row,delim_whitespace=True)
Now I got this (just as the text file looks):
~A DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59 NaN
11706 2.93 10525.38 NaN 168.29 368.00 4.75 NaN
11707 2.92 10525.38 126.14 166.71 369.86 4.93 NaN
but I want' to get this:
DEPTH TIME TX1 TX2 TX3 OUT6
11705 2.94 10525.38 126.14 169.71 353.86 4.59
11706 2.93 10525.38 NaN 168.29 368.00 4.75
11707 2.92 10525.38 126.14 166.71 369.86 4.93
Why not just post-process?
df = ...
df_modified = df[df.columns[:-1]]
df_modified.columns = df.columns[1:]
How about if you read the file twice? First, use pd.read_csv() but skip your header row. Second, use open.readline() to parse the header and drop the first item. This can then be assigned to your dataframe.
in_file = pd.read_csv(file_name, delim_whitespace=True, header = None, skiprows = [0])
with open(file_name,'rt') as h:
hdrs = h.readline().rstrip('\n').split(',')
in_file.columns = hdrs[1:]
Choose which columns to import
in_file = pd.read_csv(file_name, header=header_row,
delim_whitespace=True,
usecols=['DEPTH','TIME','TX1','TX2','TX3','OUT6')
Ok so if the number of columns vary
and you want to remove the first column (who's name varies)
AND you do not want too do this in a Post-cv_read phase...
then
.... (Drum Roll)
import pandas as pd
#Tim.csv is
#1,2,3
#2,3,4
#3,4,5
headers=['BADCOL','Happy','Sad']
data = pd.read_csv('tim.csv').iloc[:,1:]
Data will now look like
b c
2 3
3 4
4 5
Not sure if this counts as Post-CSV processing or not...

Pandas dataframe converting specific columns from string to float

I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:
import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np
# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)
# download data
if not os.path.isfile(local_file):
print('Downloading and unzipping file!')
urllib.request.urlretrieve(url, local_zipfile)
zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))
# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)
# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]
# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates
All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:
df2.convert_objects(convert_numeric=True)
But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):
for d in df2.columns:
if d is not 'dates':
df2[d] = df2[d].map(lambda x: float(x)/100)
But this gives me the following warning:
home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
try:
I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks
Edit:
df2=df2.convert_objects(convert_numeric=True)
Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)
Some of df2:
dates Agric Food Soda Beer Smoke Toys Fun \
dates
192607 192607 2.37 0.12 -99.99 -5.19 1.29 8.65 2.50
192608 192608 2.23 2.68 -99.99 27.03 6.50 16.81 -0.76
192609 192609 -0.57 1.58 -99.99 4.02 1.26 8.33 6.42
192610 192610 -0.46 -3.68 -99.99 -3.31 1.06 -1.40 -5.09
192611 192611 6.75 6.26 -99.99 7.29 4.55 0.00 1.82
Edit2: the solution is actually more simple than I thought:
df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100
I would try the following to force convert everything into floats:
df2=df2.astype(float)
You can convert specific column to float(or any numerical type for that matter) by
df["column_name"] = pd.to_numeric(df["column_name"])
Posting this because pandas.convert_objects is deprecated in pandas 0.20.1
You need to assign the result of convert_objects as there is no inplace param:
df2=df2.convert_objects(convert_numeric=True)
you refer to the rename method but that one has an inplace param which you set to True.
Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.
Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..

Categories

Resources