I am trying to perform a some arithmetic operations in Python Pandas and merge the result in one of the file.
Path_1: File_1.csv, File_2.csv, ....
This path has several file which are supposed to be increasing in time intervals. with the following columns
File_1.csv | File_2.csv
Nos,12:00:00 | Nos,12:30:00
123,1451 485,5464
656,4544 456,4865
853,5484 658,4584
Path_2: Master_1.csv
Nos,00:00:00
123,2000
485,1500
656,1000
853,2500
456,4500
658,5000
I am trying to read the n number of .csv files from Path_1 and compare the col[1] header timeseries with col[last] timeseries of Master_1.csv.
If Master_1.csv does not have that time it should create a new column with timeseries from path_1 .csv files and update the values with respect col['Nos'] while subtracting them from col[1] of Master_1.csv.
If the col with time from path_1 file is present then look for col['Nos'] and then replace the NAN with the subtracted values respect to that col['Nos'].
i.e.
Expected Output in Master_1.csv
Nos,00:00:00,12:00:00,12:30:00,
123,2000,549,NAN,
485,1500,NAN,3964,
656,1000,3544,NAN
853,2500,2984,NAN
456,4500,NAN,365
658,5000,NAN,-416
I can understand the arithmetic calculations but I am not able to loop in with respect to Nos and timeseries I have tried to put some code together and trying to work around looping. Need help in that context. Thanks
import pandas as pd
import numpy as np
path_1 = '/'
path_2 = '/'
df_1 = pd.read_csv(os.path_1('/.*csv'), Index=None, columns=['Nos', 'timeseries'] #times series is different in every file eg: 12:00, 12:30, 17:30 etc
df_2 = pd.read_csv('master_1.csv', Index=None, columns=['Nos', '00:00:00']) #00:00:00 time series
for Nos in df_1 and df_2:
df_1['Nos'] = df_2['Nos']
new_tseries = df_2['00:00:00'] - df_1['timeseries']
merged.concat('master_1.csv', Index=None, columns=['Nos', '00:00:00', 'new_tseries'], axis=0) # new_timeseries is the dynamic time series that every .csv file will have from path_1
You can do it in three steps
Read your csv's in to a list of dataframes
Merge the dataframes together (equivalent to a SQL left join or an Excel VLOOKUP
Calculate your derived columns using a vectorized subtraction.
Here's some code you could try:
#read dataframes into a list
import glob
L = []
for fname in glob.glob(path_1+'*.csv'):
L.append(df.read_csv(fname))
#read master dataframe, and merge in other dataframes
df_2 = pd.read_csv('master_1.csv')
for df in L:
df_2 = pd.merge(df_2,df, on = 'Nos', how = 'left')
#for each column, caluculate the difference with the master column
df_2.apply(lambda x: x - df_2['00:00:00'])
Related
So I am trying to create a new dataframe that includes some data from 300+ csv files.
Each file contains upto 200,000 rows of data, I am only interested in 1 of the columns within each file (the same column for each file)
I am trying to combine these columns into 1 dataframe, where column 6 from csv 1 would be in the 1st column of the new dataframe, column 6 from csv 2 would be in the 2nd column of the new dataframe, and so on up until the 315th csv file.
I dont need all 200,000 rows of data to be extracted, but I am unsure of how I would extract just 2,000 rows from the middle section of the data (each file ranges in number of rows so the same exact rows for each file arent necessary, as long as it is the middle 2000)
Any help in extracting the 2000 rows from each file to populate different columns in the new dataframe would be greatly appreciated.
So far, I have manipulated the data to only contain the relevant column for each file. This displays all the rows of data in the column for each file individually.
I tried to use the iloc function to reduce this to 2000 rows but it did not display any actual data in the output.
I am unsure as to how I would now extract this data into a dataframe for all the columns to be contained.
import pandas as pd
import os
import glob
import itertools
#glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join('filepath/', "*.csv"))
#loop list of csv files
for f in csv_files:
df = pd.read_csv(f, header=None)
df.rename(columns={6: 'AE'}, inplace=True)
new_df = df.filter(['AE'])
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(new_df)
print()
Based on your description, I am inferring that you have a number of different files in csv format, each of which has at least 2000 lines and 6 columns. You want to take the data only from the 6th column of each file and only for the middle 2000 records in each file and to put all of those blocks of 2000 records into a new dataframe, with a column that in some way identifies which file the block came from.
You can read each file using pandas, as you have done, and then you need to use loc, as one of the commenters said, to select the 2000 records you want to keep. If you save each of those blocks of records in a separate dataframe you can then use the pandas concat method to join them all together into different columns of a new dataframe.
Here is some code that I hope will be self-explanatory. I have assumed that you want the 6th column, which is the one with index 5 in pandas because we start counting from 0. I have also used usecols to keep only the 6th column, and I rename the column to an index number based on the order in which the files are being read. You would need to change this for your own choice of column naming.
I choose the middle 2000 records by defining the starting point as record x, say, so that x + 2000 + x = total number of records, therefore x=(total number of records) / 2 - 1000. This might not be exactly how you want to define the middle 2000 records, so you could change this.
df_middles is a list to which we append every new dataframe of the new file's middle 2000 records. We use pd.concat at the end to put all the columns into a new dataframe.
import os
import glob
import pandas as pd
# glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join("filepath/", "*.csv"))
df_middles = []
# loop list of csv files
for idx, f in enumerate(csv_files, 1):
# only keep the 6th column (index 5)
df = pd.read_csv(f, header=None, usecols=[5])
colname = f"column_{idx}"
df.rename(columns={5: colname}, inplace=True)
number_of_lines = len(df)
if number_of_lines < 2000:
raise IOError(f"Not enough lines in the input file: {f}")
middle_range_start = int(number_of_lines / 2) - 1000
middle_range_end = middle_range_start + 1999
df_middle = df.loc[middle_range_start:middle_range_end].reset_index(drop=True)
df_middles.append(df_middle)
df_final = pd.concat(df_middles, axis="columns")
I am using pandas in python.
Both of my files start with an identical "tissue" column and I am getting those tissues repeated as a result. The information connected to "tissue" from each separate file comes in as the same row as the tissue from the separate files, resulting in a number of NAN data spaces. I am trying to remove the duplicates from the first column and the NAN data from the following columns.
import pandas as pd
#defining both files
df1 = pd.read_csv('Aaron test 2-CLDN3.csv')
df2 = pd.read_csv('Aaron test 2-CLDN4.csv')
#combining files into one
combined = pd.concat([df1, df2])
#selecting column order
result_df = combined[['Tissue','mean(CLDN3)','mean(CLDN4)','var(CLDN3)','var(CLDN4)']]
print (result_df)
Result
I have 7 csv files that each contain the same # of columns and rows. I am trying to merge the data from these into 1 csv where each cell is the average of the 7 identical cells. (ex. new-csv(c3) = average(input-csv's(c3)
Here is an example of what the inputs look like. The output should look identical (6 columns x 15 rows) except the values will be averaged in each cell.
So far I have this code to load the csv files, and am reading about making them into a matrix but I don't see anything about merging and averaging by each cell, only row or column.
listdrs = os.listdir(dir_path)
listdrs_path = [ dir_path + x for x in listdrs]
failed_list = []
csv_matrix = []
for file_path in listdrs_path:
tickercsv = file_path.replace(string, '')
ticker = tickercsv.replace('.csv', '')
data = pd.read_csv(file_path, index_col=0)
csv_matrix.append(data)
If you run this in the directory with all of your csv files, you can use glob to find them all, then create a tuple of dfs using the pd.read_csv() with the optional parameter header=None depending on whether or not you have column names. Then you can concat them, group by the index, and take the mean.
import pandas as pd
import glob
files = glob.glob('*.csv')
dfs = (pd.read_csv(f, headers=None) for f in files)
pd.concat(dfs).groupby(level=0).mean()
I'm reading in an excel .csv file using pandas.read_csv(). I want to read in 2 separate column ranges of the excel spreadsheet, e.g. columns A:D AND H:J, to appear in the final DataFrame. I know I can do it once the file has been loaded in using indexing, but can I specify 2 ranges of columns to load in?
I've tried something like this....
usecols=[0:3,7:9]
I know I could list each column number induvidually e.g.
usecols=[0,1,2,3,7,8,9]
but I have simplified the file in question, in my real file I have a large number of rows so I need to be able to select 2 large ranges to read in...
I'm not sure if there's an official-pretty-pandaic-way to do it with pandas.
But, You can do it this way:
# say you want to extract 2 ranges of columns
# columns 5 to 14
# and columns 30 to 66
import pandas as pd
range1 = [i for i in range(5,15)]
range2 = [i for i in range(30,67)]
usecols = range1 + range2
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=usecols)
As #jezrael notes you can use numpy.r to do this in a more pythonic and legible way
import pandas as pd
import numpy as np
file_name = 'path/to/csv/file.csv'
df = pd.read_csv(file_name, usecols=np.r_[0:3, 7:9])
Gotchas
Watch out when use in combination with names that you have allowed for the extra column pandas adds for the index ie. For csv columns 1,2,3 (3 items) np.r_ needs to be 0:3 (4 items)
Still not getting the hang of pandas, I am attempting to join two data frames in Pandas using merge. I have read in the CSVs into two data frames (named dropData and deosData in the code below). Both data frames have the column ‘Date_Time’, which is a parsed column of Date and Time information to create a unique id for each entry. The deosData file is an entire year’s worth of observations that I am trying to match up with corresponding entries in dropData.
CSV files:
deosData: https://www.dropbox.com/s/3rr7hf7jzrmxdke/inputDeos.csv?dl=0
dropData: https://www.dropbox.com/s/z9mv4xccjzlsyif/inputDrop.csv?dl=0
I have gone through the documentation for the merge function and have tried the following code in various iterations, so far I have only been able to have a blank data frame with correct header row, or have the two data frames merged on the 0--(N-1) indexing that is assigned by default:
My code:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
#read in CSV to dataframe
dropData=pd.read_csv("inputDrop.csv", header=0, index_col=None)
deosData=pd.read_csv("inputDeos.csv", header=0, index_col=None)
#merging dataframes into single sf
merge=pd.merge(dropData,deosData, how='inner', on='Date_Time')
#comment out during debugging
#merge.to_csv('output.csv', sep=',', headers=True, index=False)
#check merge dataframe creation
print merge.head(1)
After searching on SE and the Doc’s I have tried resetting the index, ignoring the index columns, copying the ‘Date_Time’ column as a separate index and trying to merge on the new column, I have tried using ‘on=None’, ‘left_on’ and ‘right_on’ as permutations of ‘Date_Time’ to no avail. I have checked the column data types, ‘Date_Time’ in both are dtype Objects, I do not know if this is the source of the error, since the only issues I could find searching revolved around matching different dtypes to each other.
What I am looking to do is have the two data frames merge where the two 'Date_Time' columns intersect. For example:
Date_Time,Volume(Max),Volume(Sum),Volume(Min),Volume(Mean),Diameter(Count),Diameter(Max),Diameter(Sum),Diameter(Min),Diameter(Mean),Depth(Sum),Velocity(Max),Velocity(Sum),Velocity(Min),Velocity(Mean), Air Temperature (deg. C), Relative humidity (%), Wind Speed (m.s-1), Wind Direction (deg.), Wind Gust Speed (5) (m.s-1), Barometric Pressure (mbar), Gage Precipitation (5) (mm)
9/1/2014 0:00,2.266188524,2.989272461,0.052464219,0.332141385,9,1.629668,5.972978,0.464467,0.663664222,0.003736591,2.288401,16.889656,1.495487,1.876628444,22.5,99,0,216.1,0.4,1016.2,0
Any help would be greatly appreciated.
You need to parse_dates when reading csv file, so that Date_Time columns in both dataframes are of pd.Timestamp object instead of raw strings. (if you look at your csv file, one is in ISO format YYYY-MM-DD HH:MM:SS whereas the other is in MM/DD/YYYY HH:MM) Try the following codes:
#read in CSV to dataframe
dropData = pd.read_csv("inputDrop.csv", header=0, index_col=None, parse_dates=['Date_Time'])
deosData = pd.read_csv("inputDeos.csv", header=0, index_col=None, parse_dates=['Date_Time'])
and then do your merge.
You can use join, but you first need to set the index:
dropData=pd.read_csv('.../inputDrop.csv', header=0, index_col='Date_Time', parse_dates=True)
deosData=pd.read_csv('.../inputDeos.csv', header=0, index_col='Date_Time', parse_dates=True)
dropData.join(deosData)