Automatic import of multiple CSV files with pandas - python

I have 2 folders with 365 CSV files each. However, I only need certain columns from these CSV files.
I have already solved this problem with pandas usecols. But only for one file. I want to automate the whole thing.
With an incrementing variable date
f{date}_sds011_sensor_3659.csv
I don't know what's smartest though.
loop through it first and insert it into the database at the end? insert after each loop iteration?
I've been stuck with the problem for 2 weeks, I've tried all possible variants, but I can't find a solution that covers all areas (automated import + only selected columns + skip the first line in each case)
folder names: dht22 and sds011 (the names of 2 sensors)
format of the csv file names: 2020-09-25_sds011_sensor_3659.csv
Start date: 25.09.2020
End date: 24.09.2021
400-700 rows in each file
the sds011 sensor has 12 columns and i need 3 (timestamp, P1, P2 (size of particulate matter)
the dht22 has 8 and i need 3 (timestamp, temperature and humidity)

Possible solution is the following:
import glob
import pandas as pd
all_files = glob.glob('folder_name/*.csv', recursive=True)
all_data = []
for file in all_files:
df = pd.read_csv(file, index_col=None, header=0, usecols=['col1', 'col2'])
all_data.append(df)
result = pd.concat(all_data, axis=0, ignore_index=True)

Related

Merging csv files into one (columnwise) in Python

I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)

sqlite python import just selected columns

I have a folder with hundreds of csv files with 9 values from a temperature sensor in it. The columns are sensor_id, lat, lon (for the coordinates) and some other stuff that i don't need. The columns that i need are just the three [timestamp, temperature and humidity].
I already tried to use a module to import just the columns that i want and
i tried to delete the columns that i dont want with loops.
slowly i despair, can someone help me pls?
If you are open to use Pandas, you can do it simply by using usecols parameter, while reading the csv file.
df = pd.read_csv('your_file/path/file.csv', usecols=['col1', 'col2'])
print(df.shape)
df.head()
Here's some code that should do it, just add in your target directory and change the numbers on the last line to the index of the column you want (with the first column being 0):
import os
import csv
targetdir = "" # fill this in
allrows = []
files = os.listdir(targetDir)
for file in files:
with open('innovators.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
allrows.append([row[1], row[3], row[5]])

How do I bring the filename into the data frame with read_excel?

I have a directory of excel files that will continue to grow with weekly snapshots of the same data fields. Each file has a date stamp added to the file name (e.g. "_2021_09_30").
Here are my source files:
I have figured out how to read all of the excel files into a python data frame using the code below:
import os
import pandas as pd
cwd = os.path.abspath('NETWORK DRIVE DIRECTORY')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(cwd+"/"+file), ignore_index=True)
df.head()
Since these files are snapshots of the same data fields, I want to be able to track how the underlying data changes from week to week. So I would like to add/include a column that has the filename so I can incorporate the date stamp in downstream analysis.
Any thoughts? Thank you in advance.
Welcome to StackOverflow! I agree with the comments that it's not exactly clear what you're looking for, so maybe clearing that up will help us be more helpful.
For example, with the filename "A_FILENAME_2020-01-23", do you want to use the name "A_FILENAME", or "A_FILENAME_2020-01-23"? Or are you not sure, because you're trying to think through how to track this downstream?
If the latter approach, this is what you would do for adding a new column:
for file in files:
if file.endswith('.xlsx'):
tmp = pd.read_excel(cwd+"/"+file)
tmp['filename'] = file
df = df.Append(tmp, ignore_index=True)
This would allow you to search the table by the starting of the 'filename' column, and pull the discrete data of each snapshot of the file side by side. Unfortuantely, this is a LOT of data.
If you ONLY want to store differences, you'd be able to use the .drop_duplicates function to try to drop based off a unique value that you use to decide whether there's a new, modified, or deleted row: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
But, if you don't have a unique identifier for rows, that makes this quite a tough engineering problem. Very. Do you have a unique identifier you can use as your diffing strategy?
Extra code to split up the Files into separate columns for easier filtering later on (no harm in adding these, more columns the better I think):
from datetime import datetime
tmp['filename_stripped']=file[-11:]
tmp['filename_date']=datetime.strptime(file[:10], "%Y_%m_%d")
You could add additional column on the dataframe.
Modifying from your code
temp = pd.read_excel(cwd+"/"+file), ignore_index=True
temp['date'] = file[-11:]
df = df.append(temp)
You can use glob to easily combine xlsx or csv files into one dataframe. You just have to copy-paste your files' absolute path to where it says "/xlsx_path". You can also change read_excel to read_csv if you have csv files.
import pandas as pd
import glob
all_files = glob.glob(r'/xlsx_path' + "/*.xlsx")
file_list = [pd.read_excel(f) for f in all_files]
all_df = pd.concat(file_list, axis=0, ignore_index=True)
Alternatively you can use the one-liner below:
all_df = pd.concat(map(pd.read_excel, glob.glob('/xlsx_path/*.xlsx')))
Not sure what you really want, but related to tracking changes, let's say you have 2 excel files, you can track changes doing the following :
df1 = pd.read_excel("file-1.xlsx")
df1
values
0 aa
1 bb
2 cc
3 dd
4 ee
df2 = pd.read_excel("file-2.xlsx")
df2
values
0 aa
1 bb
2 cc
3 ddd
4 e
..and generate a new dataframe having rows that have changed between your 2 files :
df = pd.concat([df1, df2])
df = df.reset_index(drop=True)
new_df = df.groupby(list(df.columns))
diff = [x[0] for x in new_df.groups.values() if len(x) == 1]
df.reindex(diff)
Output :
values
0 dd
1 ddd
2 e
3 ee

Full join between tsv files and column renaming based on file of origin

I have 176 .tsv files as a result of a gene alignment looking like these:
target_id
length
tpm
ENST0001
12
100
ENST0001
9
5
In these files, I expect a certain overlap between target_id columns but not complete, so I would like to do a full join and keep all rows. Additionally, I am interested in keeping only the tpm values of each file and rename the column according to the file name.
The expected dataframe would be something similar to:
target_id
SRR100001
SRR100002
ENST0001
100
7
ENST00015
5
0
I am aware of the join function in bash, but it can be used for two files per time and if I understood correctly I cannot select specific columns...
Thank you in advance!
EDIT: The files are named as SRR*.tsv
Let me know if this code works for you, it's hard to test without having the files.
import re
import os
import sys
import pandas as pd
tpm_dict = {}
for fn in os.listdir(sys.argv[1]):
if re.match('.*\.tsv$', fn):
header = fn.replace('.tsv', '')
this_df = pd.read_csv(os.path.join(sys.argv[1], fn), sep='\t')
for i, row in this_df.iterrows():
try:
tpm_dict[row['target_id']][header] = row['tpm']
except KeyError:
try:
tpm_dict[row['target_id']] = {header: row['tpm']}
except:
print(f"Problem in {fn} at row {i}")
df = pd.DataFrame.from_dict(tpm_dict, orient='index')
df.to_csv('joined.tsv', sep='\t')
Save as tsvjoin.py and then run python3 tsvjoin.py <folder with TSVs>
Edit: typos

Appending multiple CSV files and creating a new column with the filename in python

I am trying to work with pandas library if there a way possible to make the filename as a column name
for example, my files names are with dates.
stock_2019-10-11.csv,
stock_2019-11-11.csv.
I want to make 2 different columns with the filenames and get the append the values
something I expect to get out a CSV file as :
coulmns-primary_key, article_numerber,stock_2019-10-11,stock_2019-11-11
data-0 101,201,4,2
data-1 102,301,5,2
something like above, the new columns have values coming in from the CSV's merged.
import pandas as pd
import glob
import os
import sys
import csv
data = [] # pd.concat takes a list of dataframes
for csv in globbed_files(my directiry of files):
frame = pd.read_csv(csv,encoding='utf_16',error_bad_lines=False,index_col=False)
frame['filename'] = os.path.basename(csv)
data.append(frame)
frame1 = pd.concat(data, ignore_index=True
)
Firstly add filename as column name toa particular file, then add each file to dataframe.write dataframe to csv
(considering each file has a 1 column.Customize the column header as per your columns)
import pandas as pd
df=pd.DataFrame()
filenames=["C:/Users/sghungurde/Documents/server2.csv","C:/Users/sghungurde/Documents/server3.csv"]
i=0
while(i<len(filenames)):
extracting filename from filepath
c1= (filenames[i].split("/")[4]).split(".")[0]
reading csv file and assigning column name to header
f1=pd.read_csv(filenames[i],names=[c1])
adding file column to dataframe
df[c1]=f1[c1]
i+=1
print(df)
writing final df merging result to csv
df.to_csv("C:/Users/sghungurde/Documents/merge.csv",index=False)
output
server2 server3
209.10.31.50 609.10.31.50
204.12.31.53 704.12.31.53
203.12.31.53 903.12.31.53
102.71.99.13 102.71.99.13

Categories

Resources