How to merge data frame into one csv file after using glob? - python

I have tried to work on several csv files using glob, for example:
import glob
import pandas as pd
import numpy as np
import csv
# Read all csv files with same file name in the folder
filenames = sorted(glob.glob('./16_2018-02*.csv'))
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
# Save the output data to csv with different name
df.to_csv(f'{f[:-4]}-train.csv', index=False)
In the end of the code, I used to save each dataframe into a new csv file with different name. Considering now I have so many csv data to work with, I want to concatenate them without first writing into each csv file. How should I do that?
Original dataset first 5 rows:
Date RSSI Data Code
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6
After:
Date RSSI Data Code ID X
2018-02-20T00:00:20.886+09:00 -99 1068ffd703d101ec77f425ea98b201 F2D5 16 0.065384
2018-02-20T00:00:21.904+09:00 -95 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:22.415+09:00 -97 103cffbc032901ee77f49dea98b301 F2D5 16 0.065340
2018-02-20T00:00:46.580+09:00 -96 10fdfda803ff01f477f49dfd98cb03 F2D1 16 0.065021
2018-02-20T00:00:48.593+09:00 -96 101bfed3037401f577f49dfe98cd03 F2D6 16 0.065051

Try the below code [for appending all the files to 1 file]:
filenames = sorted(glob.glob('./16_2018-02*.csv'))
appended_data=[] #create a list
for f in filenames:
df = pd.read_csv(f, names=['Date','RSSI','Data','Code'],
index_col=None)
# Slicing information
df["ID"] = df["Data"].str.slice(0,2)
df["X"] = df["Data"].str.slice(2,4)
appended_data.append(df) #append to the list
appended_data = pd.concat(appended_data, axis=1) #concat them together
#remove axis=1 if need to append vertically
The appended_data is now a dataframe with all files appended together post which you can export the same to csv/excel.

Related

Automatic import of multiple CSV files with pandas

I have 2 folders with 365 CSV files each. However, I only need certain columns from these CSV files.
I have already solved this problem with pandas usecols. But only for one file. I want to automate the whole thing.
With an incrementing variable date
f{date}_sds011_sensor_3659.csv
I don't know what's smartest though.
loop through it first and insert it into the database at the end? insert after each loop iteration?
I've been stuck with the problem for 2 weeks, I've tried all possible variants, but I can't find a solution that covers all areas (automated import + only selected columns + skip the first line in each case)
folder names: dht22 and sds011 (the names of 2 sensors)
format of the csv file names: 2020-09-25_sds011_sensor_3659.csv
Start date: 25.09.2020
End date: 24.09.2021
400-700 rows in each file
the sds011 sensor has 12 columns and i need 3 (timestamp, P1, P2 (size of particulate matter)
the dht22 has 8 and i need 3 (timestamp, temperature and humidity)
Possible solution is the following:
import glob
import pandas as pd
all_files = glob.glob('folder_name/*.csv', recursive=True)
all_data = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0, usecols=['col1', 'col2'])
all_data.append(df)
result = pd.concat(all_data, axis=0, ignore_index=True)

Create new dataframe based on condition

I have a series of csv files inside a directory. Each csv file has the following columns:
slotID; NLunUn; NLunTot; MeanBPM
I would like, starting from the values contained within the slotID column, to create data frames that contain the relative values. Eg
the 1st csv has the following values:
slotID NLun An NLunTot MeanBPM
7 11 78 129,7
11 6 63 123,3
12 6 33 120,6
13 5 41 124,5
14 4 43 118,9
the 2nd csv has the following values
slotID NMarAn NMarTot MeanBPM
7 10 72 131,2
11 5 48 121,5
12 4 17 120,9
13 4 19 125,6
16 6 45 127,4
I would like to create a dataframe which for example is called dataframe1 which has the values of slot 7 inside, another csv which contains the values of slot 11 etc ... Any suggestion is welcome, I've been trying for several days but can't seem to jump out, please help me. This is what i've done so far:
import pandas as pd
#import matplotlib.pyplot as plt
import os
import glob
import numpy as np
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
for f in csv_files:
dfDay = pd.read_csv(f, encoding = "ISO-8859-1", sep = ';')
//inside dfday there are all the files that contain the data
Provided that all the csv-files have the same structure (i.e. column names) you could do something like this:
...
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat(
(pd.read_csv(f, encoding='ISO-8859-1', sep=';') for f in csv_files),
ignore_index=True
)
slot_dfs = {slot: group for slot, group in df.groupby("slotID")}
# Exporting to csv-files
for n, df_slot in enumerate(slot_dfs.values(), start=1):
df_slot.to_csv(f"dataframe{n}.csv", index=False)
The dictionary slot_dfs contains the dataframes for each available slot.
If you really want to create variables for the dataframes then you could try
for n, (_, group) in enumerate(df.groupby("slotID"), start=1):
globals()[f"dataframe{n}"] = group
# Exporting to csv-file
group.to_csv(f"dataframe{n}.csv", index=False)
instead of creating the slot_dfs dictionary. After that print(dataframe1) should show the dataframe for the first slot etc.

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv
You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)

Adding a column to dataframe while reading csv files [pandas]

I'm reading multiple csv files and combining them into a single dataframe like below:
pd.concat([pd.read_csv(f, encoding='latin-1') for f in glob.glob('*.csv')],
ignore_index=False, sort=False)
Problem:
I want to add a column that doesn't exist in any csv (to the dataframe) based on the csv file name for every csv file that is getting concatenated to the dataframe. Any help will be appreciated.
glob.glob returns normal string so you can just add a column to every individual dataframe in a loop.
Assuming you have files df1.csv and df2.csv in your directory:
import glob
import pandas as pd
files = glob.glob('df*csv')
dfs = []
for file in files:
df = pd.read_csv(file)
df['filename'] = file
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df
a b filename
0 1 2 df1.csv
1 3 4 df1.csv
2 5 6 df2.csv
3 7 8 df2.csv
I have multiple csv files in my local directory. Each filename contains some numbers. Some of those numbers identify years for which the file is. I need to add a column year to each file that I'm concatenating and while I do I want to get the year information from the filename and insert it into that column. I'm using regex to extract the year and concatenate it like 20 + 11 = 2011. Then, I'm setting the column's data type to int32.
pd.concat(
[
pd.read_csv(f)
.assign(year = '20' + re.search('[a-z]+(?P<year>[0-9]{2})', f).group('year'))
.astype({'year' : 'int32'})
for f in glob.glob('stateoutflow*[0-9].csv')
],
ignore_index = True
)

Python: read through multiple but not all csv files in my folder

I want to read some csv files from my folder and concatenate them to a big pandas dataframe. All of my csv files end with a number, and I only want to read files whose number end with (6~10, 16~20, 26~30.) My goal is to read the files iteratively. Attached is my code so far:
import pandas as pd
data_one = pd.read_csv('Datafile-6.csv', header=None)
for i in range(7,11):
data99 = pd.read_csv('Datafile-'+i+'*.csv', header=None) #this line needs work
data_one = pd.concat([data_one, data99.iloc[:,1]],axis=1,ignore_index=True)
data_two = pd.read_csv('Datafile-16.csv', header=None)
for j in range(17,21):
#Repeat similar process
What should I do about 'data99' such that 'data_one' contains columns from 'Datafile-6' through 'Datafile-10'?
The first five rows of data_one should look like this, after getting data from Datafiles 6-10.
0 1 2 3 4 5
0 -40.0 0.179836 0.179630 0.179397 0.179192 0.179031
1 -39.0 0.183696 0.183441 0.183204 0.182977 0.182795
2 -38.0 0.186720 0.186446 0.186191 0.185949 0.185762
3 -37.0 0.189490 0.189207 0.188935 0.188686 0.188475
4 -36.0 0.192154 0.191851 0.191569 0.191301 0.191086
Column 0 is included in all of the data files, so I'm only concatenating column 1 of all of the subsequent data files.
You need to use glob module:
import glob, os
import pandas as pd
path =r'C:\YourFolder' #path to folder with .csv files
all = glob.glob(path + "/*.csv")
d_frame = pd.DataFrame()
list_ = []
for file_ in all:
df = pd.read_csv(file_,index_col=None, header=0)
if df['YourColumns'].tail(1).isin([6,7,8,9,10,16,17,18,19,20,26,27,28,29,30]) == True: #You can modify list with conditions you need
list_.append(df)
d_frame = pd.concat(list_)

Categories

Resources