Using iloc keeps picking the last row of multiple data files - python

I have a list of 50 .csv files named data1.csv, data2.csv etc, I would like plot the first row, third column of each of these files. But first I would like to check the 50 values to ensure I'm plotting the correct thing, I have:
import glob
import pandas as pd
files = glob.glob('data*.csv')
for f in sorted(files):
df = pd.read_csv(f)
print(df.iloc[0,2])
The problem here is in the last line, df.iloc[0,2] prints the 3rd column of the LAST row when I want it to print the 3rd column of the FIRST row.
Essentially print(df.iloc[0,2]) prints the same values as print(df.iloc[-1,2]) and I have no idea why.
How can I check what values the first row, third column are in all of my files?

My mistake, pd.read.csv considers headers, but my .csv files have no headers, so we need:
df = pd.read_csv(f,headers=None)

Related

How can I skip the first line in CSV files imported into a pandas df but keep the header for one of the files?

I essentially want to preserve the header for one of the csv files to make them the column names in the csv but for the rest of the files I want to skip the header. Is there an easier solution to doing this except for the following:
import as no headers, then change column names after all csv files are imported and deleted duplicate rows from df.
My current code is:
import glob
import pandas as pd
import os
path = r"C:\Users\..."
my_files = glob.glob(os.path.join(path, "filename*.xlsx"))
file_li = []
for filename in my_files:
df = pd.read_excel(filename, index_col=None, header=None)
file_li.append(df)
I am trying to append 365 files into one based on the condition that the file name meets the above criteria. The files looks like this:
Colunn1
Colunn2
Colunn3
Colunn4
Colunn5
Colunn6
Colunn7
Colunn8
Colunn9
Colunn10
Colunn11
2
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
DATA
3
4
5
6
7
I want to keep the column names (column1, 2.,) for the first file but then skip it for the rest so I dont have to reindex it or change the df after. The reason for this is I dont want to have duplicate rows with column headers in the DF or have missing headers...is this complicating an easier solution?
Why are you putting them in a list?
Pandas concat lets you combine DF's while doing the column name management for you.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Creating a new dataframe to contain a section of 1 column from multiple csv files in Python

So I am trying to create a new dataframe that includes some data from 300+ csv files.
Each file contains upto 200,000 rows of data, I am only interested in 1 of the columns within each file (the same column for each file)
I am trying to combine these columns into 1 dataframe, where column 6 from csv 1 would be in the 1st column of the new dataframe, column 6 from csv 2 would be in the 2nd column of the new dataframe, and so on up until the 315th csv file.
I dont need all 200,000 rows of data to be extracted, but I am unsure of how I would extract just 2,000 rows from the middle section of the data (each file ranges in number of rows so the same exact rows for each file arent necessary, as long as it is the middle 2000)
Any help in extracting the 2000 rows from each file to populate different columns in the new dataframe would be greatly appreciated.
So far, I have manipulated the data to only contain the relevant column for each file. This displays all the rows of data in the column for each file individually.
I tried to use the iloc function to reduce this to 2000 rows but it did not display any actual data in the output.
I am unsure as to how I would now extract this data into a dataframe for all the columns to be contained.
import pandas as pd
import os
import glob
import itertools
#glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join('filepath/', "*.csv"))
#loop list of csv files
for f in csv_files:
df = pd.read_csv(f, header=None)
df.rename(columns={6: 'AE'}, inplace=True)
new_df = df.filter(['AE'])
print('Location:', f)
print('File Name:', f.split("\\")[-1])
print('Content:')
display(new_df)
print()
Based on your description, I am inferring that you have a number of different files in csv format, each of which has at least 2000 lines and 6 columns. You want to take the data only from the 6th column of each file and only for the middle 2000 records in each file and to put all of those blocks of 2000 records into a new dataframe, with a column that in some way identifies which file the block came from.
You can read each file using pandas, as you have done, and then you need to use loc, as one of the commenters said, to select the 2000 records you want to keep. If you save each of those blocks of records in a separate dataframe you can then use the pandas concat method to join them all together into different columns of a new dataframe.
Here is some code that I hope will be self-explanatory. I have assumed that you want the 6th column, which is the one with index 5 in pandas because we start counting from 0. I have also used usecols to keep only the 6th column, and I rename the column to an index number based on the order in which the files are being read. You would need to change this for your own choice of column naming.
I choose the middle 2000 records by defining the starting point as record x, say, so that x + 2000 + x = total number of records, therefore x=(total number of records) / 2 - 1000. This might not be exactly how you want to define the middle 2000 records, so you could change this.
df_middles is a list to which we append every new dataframe of the new file's middle 2000 records. We use pd.concat at the end to put all the columns into a new dataframe.
import os
import glob
import pandas as pd
# glob to get all csv files
path = os.getcwd()
csv_files = glob.glob(os.path.join("filepath/", "*.csv"))
df_middles = []
# loop list of csv files
for idx, f in enumerate(csv_files, 1):
# only keep the 6th column (index 5)
df = pd.read_csv(f, header=None, usecols=[5])
colname = f"column_{idx}"
df.rename(columns={5: colname}, inplace=True)
number_of_lines = len(df)
if number_of_lines < 2000:
raise IOError(f"Not enough lines in the input file: {f}")
middle_range_start = int(number_of_lines / 2) - 1000
middle_range_end = middle_range_start + 1999
df_middle = df.loc[middle_range_start:middle_range_end].reset_index(drop=True)
df_middles.append(df_middle)
df_final = pd.concat(df_middles, axis="columns")

Skipping every nth row and copy/transform data into another matrix

I've extracted information from 142 different files, which is stored in CSV-file with one column, which contains both number and text. I want to copy row 11-145, transform it, and paste it into another file (xlsx or csv doesn't matter). Then, I want to skip the next 10 rows, and copy row 156-290, transform and paste it etc etc. I have tried the following code:
import numpy as np
overview = np.zeros((145, 135))
for i in original:
original[i+11:i+145, 1] = overview[1, i+1:i+135]
print(overview)
The original file is the imported file, for which I used pd.read_csv.
pd.read_csv is a function that returns a dataframe.
To select specific rows from a dataframe you can use this function :
df.loc[start:stop:step]
so it would look something like this :
df = pd.read_csv(your_file)
new_df = df.loc[11:140]
#transform it as you please
#convert it to excel or csv
new_df .to_excel("new_file.xlsx") or new_df .to_csv("new_file.csv")

How do you compare two csv files with identical columns but different values?

Here's my problem, I need to compare two procmon scans which I converted into CSV files.
Both files have identical column names, but obviously the contents differ.
I need to check the "Path" (5th column) from the first file to the one to the second file and print out the ENTIRE row of the second file into a third CSV, if there are corresponding matches.
I've been googling for quite a while and can't seem to get this to work like I want it to, any help is appreciated!
I've tried numerous online tools and other python scripts, to no avail.
Have you tried using pandas and numpy together?
It would look something like this:
import pandas as pd
import numpy as np
#get your second file as a Dataframe, since you need the whole rows later
file2 = pd.read_csv("file2.csv")
#get your columns to compare
file1Column5 = pd.read_csv("file1.csv")["name of column 5"]
file2Column5 = file2["name of column 5"]
#add a column where if values match, row marked True, else False
file2["ColumnsMatch"] = np.where(file1Column5 == file2Column5, 'True', 'False')
#filter rows based on that column and remove the extra column
file2 = file2[file2['ColumnsMatch'] == 'True'].drop('ColumnsMatch', 1)
#write to new file
file2.to_csv(r'file3.csv')
Just write for such things your own code. It's probably easier than you are expecting.
#!/usr/bin/env python
import pandas as pd
# read the csv files
csv1 = pd.read_csv('<first_filename>')
csv2 = pd.read_csv('<sencond_filename>')
# create a comapare series of the files
iseq = csv1['Path'] == csv2['Path']
# push compared data with 'True' from csv2 to csv3
csv3 = pd.DataFrame(csv2[iseq])
# write to a new csv file
csv3.to_csv('<new_filename>')

Cleaning dataframe- assign value in one cell to column

I am reading multiple CSV files from a folder into a dataframe. I loop for all the files in the folder and then concat the dataframes to obtain the final dataframe.
However the CSV file has one summary row from which I want to extract the date, and then add as a new column for all the rows in that csv/dataframe.
'''
df=pd.read_csv(f,header=None,names=['Inverter',"Day Yield",'month Yield','Year Yield','SpecificYieldDay','SYMth','SYYear','Power'],sep=';', **kwargs)
df['date']=df.loc[[0],['Day Yield']]
df
I expect ['date'] column to be filled with the date for that file for all the rows in that particular csv, but it gets filled correctly only for the first row.
Refer to image of dataframe. I want all the rows of the 'date' column to be showing 7/25/2019 instead of only the first row.
I have also added an example of one of the csv files I am reading from
csv file
If I understood correctly, the value that you want to add as a new column for all rows is in df.loc[[0],['Day Yield']].
If that is correct you can do the following:
df = df.assign(date=[df.loc[[0],['Day Yield']]]*len(df))

Categories

Resources