I have multiple csv files, named as 2C-BEB-29-2009-01-18.csv,2C-BEB-29-2010-02-18.csv,2C-BEB-29-2010-03-28.csv, 2C-ISI-12-2010-01-01.csv, and so on.
2C- Part is default in all csv files.
BEB means name of the recording device
29 stands for the user ID
2009-01-18 stands for the date of the recording.
I have around 150 different IDs and their recordings with different devices. I would like to automate the following approach which I have done for a single user ID for all user IDs
When I use the following code for the single user, namely for pattern='2C-BEB-29-*.csv', in string format. Note that I am in the correct directory.
def pd_read_pattern(pattern):
files = glob.glob(pattern)
df = pd.DataFrame()
for f in files:
csv_file = open(f)
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
#date column should be changed depending on patient id
a['date'] = str(csv_file.name).rsplit('29-',1)[-1].rsplit('.',1)[0]
#df = df.append(a)
#df = df[df['hf']!=0]
return df.reset_index(drop=True)
To apply the above code for all user IDs, I have read the CSV files in the following way and saved them into a list. To avoid duplicate IDs I have converted the list into set at the end of this snippet.
import glob
lst=[]
for name in glob.glob('*.csv'):
if len(name)>15:
a = name.split('-',3)[0]+"-"+name.split('-',3)[1]+"-"+name.split('-',3)[2]+'-*'
lst.append(a)
lst = set(lst)
Now, having names of unique Ids in this example format: '2C-BEB-29-*.csv'. Withe the help of below code snippet, I am trying to read user IDs. However, I get unicode/decode error in the pd.read_csv row. Could you help me with this issue?
for file in lst:
#print(type(file))
files = glob.glob(file)
#print(files)
df = pd.DataFrame()
for f in files:
csv_file = open(f)
#print(f, type(f))
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
#date column should be changed depending on patient id
#a['date'] = str(csv_file.name).rsplit(f.split('-',3)[2]+'-',1)[-1].rsplit('.',1)[0]
#df = df.append(a)
#df = df[df['hf']!=0]
#return df.reset_index(drop=True)
Firstly,
import chardet
Then, replace your code snippet of
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
with this one
with open(f, 'rb') as file:
encodings = chardet.detect(file.read())["encoding"]
a = pd.read_csv(f,sep='\s+|;|,', engine='python', encoding=encodings)
Related
I'm trying to merge my 119 csv files into one file through a python code. The only issue I'm facing is that even though I've applied the sort method it isnt working and my files are not ordered , which is causing the date column to be un-ordered. Below is the code, when I run this and open my new csv file "call.sms.merged" it appears that after my 1st csv file, data is inserted or merged from the 10th csv then 100th csv till 109 csv & then it starts to begin from csv 11. I'm attaching an image for better understanding.
file_path = "C:\\Users\\PROJECT\\Data Set\\SMS Data\\"
file_list = [file_path + f for f in os.listdir(file_path) if f.startswith('call. sms ')]
csv_list = []
for file in sorted(file_list):
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
csv_merged = pd.concat(csv_list, ignore_index=True)
csv_merged.to_csv(file_path + 'calls.sms.merged.csv', index=False)
UN-SORTED DATA
Incorrect order of csv
un-ordered
Python Code and Error :
Python Code Screenshot
Error Screenshot
You can extract the number of each call/file with pandas.Series.str.extract then use pandas.DataFrame.sort_values to make an ascending sort along this column/number.
Try this :
file_path = "C:\\Users\\PROJECT\\Data Set\\SMS Data\\"
file_list = [file_path + f for f in os.listdir(file_path) if f.startswith('call. sms ')]
csv_list = []
for file in file_list:
csv_list.append(pd.read_csv(file).assign(File_Name = os.path.basename(file)))
csv_merged = (
pd.concat(csv_list, ignore_index=True)
.assign(num_call= lambda x: x["File_Name"].str.extract("(\d{1,})", expand=False).astype(int))
.sort_values(by="num_call", ignore_index=True)
.drop(columns= "num_call")
)
csv_merged.to_csv(file_path + 'calls.sms.merged.csv', index=False)
I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
I have two csv files, and I want to combine these two csv files into one csv file. Assume that the two csv files are A.csv and B.csv, I have already known that there are some conflicts in them. For example, there are two columns, ID and name, in A.csv ID "12345" has name "Jack", in B.csv ID "12345" has name "Tom". So there are conflicts that the same ID has different name. Now I want to keep ID "12345", and I want to choose name from A.csv, and abandon name from B.csv. How could I do that?
Here is some code I have tried, but it can only combine two csv files but connot deal with the conflicts, or more precisely, it cannot choose definite value from A.csv :
import pandas as pd
import glob
def merge(csv_list, outputfile):
for input_file in csv_list:
f = open(input_file, 'r', encoding='utf-8')
data = pd.read_csv(f, error_bad_lines=False)
data.to_csv(outputfile, mode='a', index=False)
print('Combine Completed')
def distinct(file):
df = pd.read_csv(file, header=None)
datalist = df.drop_duplicates()
datalist.to_csv('result_new_month01.csv', index = False, header = False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = glob.glob('*.csv')
output_csv_path = 'result.csv'
print(csv_list)
merge(csv_list)
distinct(output_csv_path)
P.S. English is not my native language. Please excuse my syntax error.
Will put down my comments here as an answer:
The problem with your merge function is, you're reading one file and writing it out the same result.csv in append mode without doing any check for duplicate names. In order to check for duplicates, they need to be in the same dataframe. From your code, you're combining multiple CSV files, not necessary just A.csv and B.csv. So when you say "I want to choose name from A.csv, and abandon name from B.csv" - it looks like you really mean "keep the first one".
Anyway, fix your merge() function to keep reading files into a list of dataframes - with A.csv being first. And then use #gold_cy's answer to concatenate the dataframes keeping only the first occurrence.
Or in your distinct() function, instead of datalist = df.drop_duplicates(), put datalist = df.drop_duplicates("ID", keep="first").reset_index(drop=True) - but this can be done after the read-loop instead writing out a CSV full of duplicates, first drop the dupes and then write out to csv.
So here's the change using the first method:
import pandas as pd
import glob
def merge_distinct(csv_list, outputfile):
all_frames = [] # list of dataframes
for input_file in csv_list:
# skip your file-open line and pass those params to pd.read_csv
data = pd.read_csv(input_file, encoding='utf-8', error_bad_lines=False)
all_frames.append(data) # append to list of dataframes
print('Combine Completed')
final = pd.concat(all_frames).drop_duplicates("ID", keep="first").reset_index(drop=True)
final.to_csv('result_new_month01.csv', index=False, header=False)
print('Distint Completed')
if __name__ = '__main__':
csv_list = sorted(glob.glob('*.csv')) # sort the list
output_csv_path = 'result.csv'
print(csv_list)
merge_distinct(csv_list, output_csv_path)
Notes:
Instead of doing f = open(...) just pass those arguments to pd.read_csv().
why are you writing out the final csv with header=False? That's useful to have.
glob.glob() doesn't guarantee sorting (depends on filesystem) so I've used sorted() above.
File-system sorting is different from just sorting in sorted which is essentially in ASCII/Unicode index order.
If you want to keep one DataFrame value over the other, then concatenate them and keep the first duplicate in the output. This means the preferred values should be in the first argument to the sequence you provide to concatenate as shown below.
df = pd.DataFrame({"ID": ["12345", "4567", "897"], "name": ["Jack", "Tom", "Frank"]})
df1 = pd.DataFrame({"ID": ["12345", "333", "897"], "name": ["Tom", "Sam", "Rob"]})
pd.concat([df, df1]).drop_duplicates("ID", keep="first").reset_index(drop=True)
ID name
0 12345 Jack
1 4567 Tom
2 897 Frank
3 333 Sam
I have two csv files containing email addresses. One file consists of email addresses that i need to remove from the second file. i have a code but it seeems to be giving IndexError.
The sample code i worked on is
import csv
# Open details file and get a unique set of links
details_csv = csv.DictReader(open('D:/emails_to_remove.csv','r'))
details = set(i.get('link') for i in details_csv)
# Open master file and only retain the data not in the set
master_csv = csv.DictReader(open('D:/emails-list.csv','r'))
master = [i for i in master_csv if i.get('link') not in details]
# Overwrite master file with the new results
with open('D:/master-output.csv', 'w') as file:
writer = csv.DictWriter(file, master[0].keys(), lineterminator='\n')
writer.writeheader()
writer.writerows(master)
Content of file 1:
abc#123.com
efg#456.com
Content of file2:
ijk#987.com
abc#123.com
Desired Output:
efg#456.com
ijk#987.com
The problem can be easily solved with sets like so
set1 = {"abc#123.com", "efg#456.com"}
set2 = {"ijk#987.com", "abc#123.com"}
set3 = set1.union(set2) - set1.intersection(set2)
print(set3)
# set(['ijk#987.com', 'efg#456.com'])
A good source to learn what can be done with sets is e.g. https://www.geeksforgeeks.org/intersection-function-python/.
You can use pandas dataframe for this purpose.
import pandas as pd
details_csv = pd.read_csv('D:/emails_to_remove.csv')
master_csv = pd.read_csv('D:/emails-list.csv')
fn = master_csv[~(master_csv["emails"].isin(details_csv["emails"]))].reset_index(drop = True)
cn = details_csv[~(details_csv["emails"].isin(master_csv["emails"]))].reset_index(drop=True)
final = pd.concat([cn,fn])
df.to_csv(r'Path\File Name.csv')
print(final)
sample code is working for your problem but you must add "emails" header to the csv files.
There is a pandas package that helps you simplify csv processing. Below is how you use it for your purpose
import pandas as pd
details_df = pd.read_csv('D:/emails_to_remove.csv')
master_df = pd.read_csv('D:/emails-list.csv')
# 1. Concat both csv
merged_df = pd.concat([details_df, master_df], ignore_index=True).reset_index(drop=True)
# 2. Drop rows with duplicates email
merged_df.drop_duplicates(subset='emails', keep=False)
# You can save them if you wish
merged_df.to_csv("D:/final.csv")
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')