CSV files and Python - python

I'm working on a Python script that should merge some columns of some CSV files (a lot, something around 200 files).
All the files look like:
Timestamp; ...; ...; ...; Value; ...
date1;...;...;...; FirstValue;...
date2;...;...;...; SecondValue;...
and so on.
From the first file I want to extract the timestamp and the column Value. From the other files I need only the column Values.
My script for now is:
#!/usr/bin/python
import csv
import os, sys
# Open a file
path = "Z:/myfolder"
dirs = os.listdir( path )
#Conto il numero di file nella cartella
print len(dirs)
#Assegno il nome del primo file
file = dirs[0]
#Apro il primo file per la lettura di timestamp e primo valore (Value)
primofile = csv.reader(open(file, 'rb'), delimiter=";", quotechar='|')
timestamp, firstValue = [], []
#Per ogni riga del primofile
for row in primofile:
#Copio timestamp
timestamp.append(row[2])
#e Value
firstValue.append(row[15])
with open("provacript.csv", 'wb') as f:
writer = csv.writer(f, delimiter=';')
i = 0
while i < len(timestamp):
writer.writerow([timestamp[i]] + [firstValue[i]])
i = i+1
So in "provascript.csv" I have the timestamp and the first column with my values from the first file. The next step is to open, one by one, the files in the list "dirs", read the column "Values" (the 15th column), save this column in an array and write it in "provascript.csv".
My code is:
for file in dirs:
data = csv.reader(open(file, 'rb'), delimiter=";", quotechar='|')
column = []
for row in data:
column.append(row[15])
In the array "column" I should have the values. I have to add this values in a new column in "provascript.csv" and move on doing the same thing with all the files. How can I do that?
I would like to have something like
TimestampFromFirstFile;ValueFromFirstFile;ValueFromSecondFile;ValueFromThirdFile;...
date1;value;value,value;...
date2;value;value;value;...
date3;value;value;value;...
So far so good. I fixed it (thanks), but instead of reading and writing Value in the first row I would like to write a part of the name. Instead of having Timestamp;Value;Value;Value I would prefer Timestamp;Temperature1;Temperature2;Presence1;Presence2.
How can I do it?

I should create the full structure and finally i will save it in the output file (assuming that files are ordered between them)
#create the full structure: output_rows
primofile = csv.reader(open(file, 'rb'), delimiter=";", quotechar='|')
output_rows = []
for row in primofile:
output_rows.append([row[2], row[15]])
Once we have an ordered list of lists, complete them with the other files
for file in dirs:
data = csv.reader(open(file, 'rb'), delimiter=";", quotechar='|')
column = []
for idx,row in enumerate(data):
output_rows[idx].append(row[15])
Finally save it to a file
with open("output.csv", 'wb') as f:
writer = csv.writer(f, delimiter=';')
for row in output_rows:
writer.writerow(row)

You can do it with Pandas :
file1 = pd.read_csv("file1", index_col=0, sep=";", skipinitialspace=1)
file2 = pd.read_csv("file2", index_col=0, sep=";", skipinitialspace=1)
file3 = pd.read_csv("file3", index_col=0, sep=";", skipinitialspace=1)
here, you have plenty of options, notably to parse date while reading your csv.
file 1 being :
... ....1 ....2 Value ....3
Timestamp
date1 ... ... ... FirstValue ...
date2 ... ... ... SecondValue ...
f1 = pd.DataFrame(file1.Value)
f2 = pd.DataFrame(file2.Value)
f3 = pd.DataFrame(file3.Value)
f2
Value
Timestamp
date1 AAA
date2 BBB
f3
Value
Timestamp
date1 456
date2 123
Then you define a function for recursive merge :
def recursive_merge(list_df):
suffixe = range(1,len(list_df)+1)
merged = list_df[0]
for i in range(1,len(list_df)):
merged = merged.merge(list_df[i], left_index=True, right_index=True,
suffixes=('_%s' %suffixe[i-1], '_%s' %suffixe[i]))
if len(list_df)%2 !=0 :
merged.rename(
columns = {'Value':"Value_%s" %suffixe[i]},
inplace = True) # if number of recursive merge is odd
return merged
and call :
recursive_merge([f1,f2,f3])
Output :
Value_1 Value_2 Value_3
Timestamp
date1 FirstValue AAA 456
date2 SecondValue BBB 123
And then you can easily write that dataframe with :
recursive_merge([f1,f2,f3]).to_csv("output.csv")
Of course if you have more than 3 files, you can make for-loops and or functions to open files and end up with a list like [f1,f2,f3,...f200]
Hope this helps

Related

Read CSV file with quotechar-comma combination in string - Python

I have got multiple csv files which look like this:
ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15
I am currently importing the data using the pandas read_csv() function.
df = pd.read_csv(filename, sep = ',', quotechar='"')
This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4
Is there a way to solve this issue?
Expected output:
ID Text Value
1 I play football 10
2 I am hungry 12
3 Unfortunately, I get an error 15
You can try to fix the CSV using re module:
import re
import pandas as pd
from io import StringIO
with open("your_file.csv", "r") as f_in:
s = re.sub(
r'"(.*)"',
lambda g: '"' + g.group(1).replace('"', "\\") + '"',
f_in.read(),
)
df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)
Prints:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15
One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:
import csv
# Specify the column index (0-based)
column_index = 1
# Open the input CSV file
with open('input.csv', 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
with open('output.csv', 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
This code creates a new modified csv file
Then your problematic csv row will look like that:
3,"Unfortunately,I get an error",15"
Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')
To automate this conversion for all csv files within a directory:
import csv
import glob
# Specify the column index (0-based)
column_index = 1
# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')
# Iterate through the CSV files
for csv_file in csv_files:
# Open the input CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
output_file = csv_file.replace('.csv', '_new.csv')
with open(output_file, 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".
A possible solution:
df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)
Another possible solution:
df = pd.read_csv(StringIO(text), sep='\t')
df[['ID', 'Text']] = df.iloc[:, 0].str.split(',', expand=True, n=1)
df[['Text', 'Value']] = df['Text'].str.rsplit(',', expand=True, n=1)
df = df.drop(df.columns[0], axis=1).assign(
Text=df['Text'].replace('\"', '', regex=True))
Output:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15

I need to read a csv with unknown number of columns and then write the data to a csv with a set number of columns

So I have a file that looks like this:
name,number,email,job1,job2,job3,job4
I need to convert it to one that looks like this:
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
How would I do this in Python?
As said in a comment that you can use pandas to read, write and manipulate csv file.
Here is one example of how you can solve your problem with pandas in python
import pandas as pd
# df = pd.read_csv("filename.csv") # read csv file from disk
# comment out below line when open from disk
df = pd.DataFrame([['ss','0152','ss#','student','others']],columns=['name','number','email','job1','job2'])
print(df)
this line output is
name number email job1 job2
0 ss 0152 ss# student others
Now we need to know how many columns are there:
x = len(df.columns)
print(x)
it will store the number of column in x
5
Now let's create a empty Dataframe with columns= [name,number,email,job]
c = pd.DataFrame(columns=['name','number','email','job'])
print(c)
output:
Columns: [name, number, email, job]
Index: []
Now we use loop from range 3 to end of the column and concat datafarme with our empty dataframe:
for i in range(3,x):
df1 = df.iloc[:,0:3].copy() # we took first 3 column
df2 = df.iloc[:,[i]].copy() # we took ith coulmn
df1['job'] = df2; # added ith coulmn to the df1
c = pd.concat([df1,c]); # concat df1 and c
print(c)
output:
name number email job
0 ss 0152 ss# others
0 ss 0152 ss# student
Dataframe c has your desired output. Now you can save it using
c.to_csv('ouput.csv')
Let's assume this is the dataframe:
import pandas as pd
df = pd.DataFrame(columns=['name','number','email','job1','job2','job3','job4'])
df = df.append({'name':'jon', 'number':123, 'email':'smth#smth.smth', 'job1':'a','job2':'b','job3':'c','job4':'d'},ignore_index=True)
We define a new dataframe:
new_df = pd.DataFrame(columns=['name','number','email','job'])
Now, we loop over the old one to split it based on the jobs. I assume you have 4 jobs to split:
for i, row in df.iterrows():
for job in range(1,5):
job_col = "job" + str(job)
new_df = new_df.append({'name':row['name'], 'number':row['number'], 'email':row['email'], 'job':row[job_col]}, ignore_index=True)
You can use the csv module and Python's unpacking syntax to get the data from the input file and write it to the output file.
import csv
with open('input.csv', newline='') as infile, open('output.csv', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
# Skip header row, if necessary
next(reader)
# Use sequence unpacking to get the fixed variables and
# and arbitrary number of "jobs".
for name, number, email, *jobs in reader:
for job in jobs:
writer.writerow([name, number, email, job])
Below:
with open('input.csv') as f_in:
lines = [l.strip() for l in f_in.readlines()]
with open('output.csv','w') as f_out:
for idx,line in enumerate(lines):
if idx > 0:
fields = line.split(',')
for idx in range(3,len(fields)):
f_out.write(','.join(fields[:3]) + ',' + fields[idx] + '\n')
input.csv
header row
name,number,email,job1,job2,job3,job4
name1,number1,email1,job11,job21,job31,job41
output.csv
name,number,email,job1
name,number,email,job2
name,number,email,job3
name,number,email,job4
name1,number1,email1,job11
name1,number1,email1,job21
name1,number1,email1,job31
name1,number1,email1,job41

Subtracting data frames in Python returning NaN in columns

I'm trying to subtract the second columns of two csv files(mycsv.csv, mycsv2.csv), while keeping the first columns of both the same. It does the latter perfectly fine as you can see below, but the prices columns (2 and 4), just give back NaN.
col2 col4
col1
MMM NaN NaN
WBAI NaN NaN
WUBA NaN NaN
EGHT NaN NaN
AHC NaN NaN
I don't know where this error is coming from, so I apologize for some much code. Thank you for any help you can give!
data_sheet1 = pd.read_excel('C:\\Users\\sss\\Downloads\\Book1.xlsx')
data_impor = data_sheet1['DDD'].tolist()
def get_ohlc(symbols):
data = get_quotes(symbol=symbols)
symbols_and_lastPrices = [] #create empty list
for symbol in symbols:
symbols_and_lastPrices.append([symbol, data[symbol]['lastPrice']]) #append [symbol, lastPrice]-pairs to list.
return symbols_and_lastPrices #return list
csv_data = get_ohlc(data_impor) #save returned list
#write csv
with open ("mycsv.csv", "w" , newline='' ) as f:
thewriter = csv.writer(f)
thewriter.writerow(['col1', 'col2'])
thewriter.writerows(csv_data) #write all data rows at the same time
with open('mycsv.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
def get_ohlc(symbols):
data = get_quotes(symbol=symbols)
symbols_and_lastPrices = [] #create empty list
for symbol in symbols:
symbols_and_lastPrices.append([symbol, data[symbol]['lastPrice']]) #append [symbol, lastPrice]-pairs to list.
return symbols_and_lastPrices #return list
time.sleep(2)
csv2_data = get_ohlc(data_impor) #save returned list
#write csv
with open ("mycsv2.csv", "w" , newline='' ) as f:
thewriter = csv.writer(f)
thewriter.writerow(['col3', 'col4'])
thewriter.writerows(csv2_data) #write all data rows at the same time
with open('mycsv2.csv', 'r') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
df1 = pd.read_csv('mycsv.csv', index_col = 'col1')
df2 = pd.read_csv('mycsv2.csv', index_col = 'col3')
df3 = df1.sub(df2)
print(df3.head())
I think the issue is with the initial import of the document.
Here:
data_sheet1 = pd.read_excel('C:\Users\sss\Downloads\Book1.xlsx')
data_impor = data_sheet1['DDD'].tolist()
You are only importing the first column and the rest of the data is not being saved to data_impor
Then you pass this data to the get_ohlc function but it never receives any data for the other columns. Also, why are you defining the get_ohlc function twice?

calculation optimization on a CSV

I am preparing data from several CSV files. One is like this:
userID, createdAt, collectedAt
6301, 2006-09-18 01:07:50, 2010-01-17 20:38:25
10836, 2006-10-27 14:38:04, 2010-06-18 03:35:34
10997, 2006-10-29 09:50:38, 2010-04-24 01:12:40
And another like:
userID, seriesOfNumber
6301, "3269,3310,3695,3732,3788,3872,3929,3893"
10836, "1949,1963,1963,1963,1963,1963,1963,1962,1961,1961"
10997, "1119,1119,999,999,1050,1170,1071,799,799,799,862,862,862,862"
I want to have a output.csv with all information from csv1 and standart deviation of the series from csv2.
Actually this is what I'm doing:
import pandas as pd
import csv
import statistics
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d %H:%M:%S")
d2 = datetime.strptime(d2, "%Y-%m-%d %H:%M:%S")
return abs((d2 - d1).days)
def stddev(id):
with open('csvFile/csv1.csv', encoding="utf-8") as csv1:
follow = csv.DictReader(csv1)
for row in follow:
if id == row["userID"]:
return(statistics.stdev(list(map(int, row["seriesOfNumberOfFollowings"].split(","))))) #Split one line in array of str then transfort str array to int array
with open('csvFile/user.csv', encoding="utf-8") as user, open('output.csv', 'w') as output:
user = csv.DictReader(user)
writer = csv.writer(output, delimiter=',')
writer.writerow(["accountLongevity", "standardDevidationFollowing"]) # write a header row
for row in user:
data = []
data.append(days_between(row['createdAt'], row['collectedAt']))
data.append(stddev(row['userID']))
writer.writerow(data)
This actualy work. But take a very long time (I've more then 500 000 users). I think I can avoid to open csv1 at each row of user. I tried to pass my csv in parameter of my function but it's work for my 1st. It's like the fuction never increment.
And here my problem is for 2 csv. Put my output it a summary of 4 csv.
As Zealseeker suggest I use pandas for reading my CSV files. Next I'm abble to merge on ID. Here the code:
def legitimate():
csv1 = pd.read_csv("csvFile/csv1.csvt", sep='\t', names=["userID", "createdAt", "collectedAt"])
csv2 = pd.read_csv("csvFile/csv3.csv", sep='\t', names=["userID", "seriesOfNumberOfFollowings"])
mergeCSV = pd.merge(csv1 , csv2, on="userID", how = "inner")
mergeCSV ['class'] = 0
for ind in mergeCSV .index:
mergeCSV.at[ind, 'seriesOfNumberOfFollowings'] = stddev(mergeCSV ['seriesOfNumberOfFollowings'][ind])
mergeCSV.at[ind, 'createdAt'] = days_between(mergeCSV ['createdAt'][ind], mergeCSV ['collectedAt'][ind])
return mergeCSV
With this method I can merge more than 2 CSV pretty quickly

How do i write csv basis on comparing two csv file[column based]

I have two csv files:
csv1
csv2
(*note headers can be differ)
csv1 has 1 single column an csv2 has 5 columns
now column 1 of csv1 has some matching values in column2 of csv2
my concern is how can i write a csv where column1 of csv1 does not have a MATCHING VALUES to column2 of csv2
I have attached three files csv1, csv2 and expected output..
Expected Output:
ProfileID,id,name,class ,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
5,jlh,antriskh,ASDA,AD
CSV 1:
id,name
10927,prince
109582,kabir
f546416,rahul
g44674,saini
r7341,antriskh
CSV 2:
ProfileID,id,name,class ,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
3,f546416,rahul,AD,FF
44,g44674,saini,DD,FF
5,jlh,antriskh,ASDA,AD
I tried using converting them into dictionary and match them csv1 keys to csv2 values but it is not working as expected
def read_csv1(filename):
prj_structure = {}
f = open(filename, "r")
data = f.read()
f.close()
lst = data.split("\n")
prj = ""
for i in range(0, len(lst)):
val = lst[i].split(",")
if len(val)>0:
prj = val[0]
if prj!="":
if prj not in prj_structure.keys():
prj_structure[prj] = []
prj_structure[prj].append([val[1], val[2], val[3], val[4])
return prj_structure
def read_csv2(filename):
prj_structure = {}
f = open(filename, "r")
data = f.read()
f.close()
lst = data.split("\n")
prj = ""
for i in range(0, len(lst)):
val = lst[i].split(",")
if len(val)>0:
prj = val[0]
if prj!="":
if prj not in prj_structure.keys():
prj_structure[prj] = []
prj_structure[prj].append([val[0])
return prj_structure
csv1_data = read_csv1("csv1.csv")
csv2_data = read_csv2("csv2.csv")
for k, v in csv1_data.items():
for ks, vs in csv2_data.items():
if k==vs[0][0]:
#here it is not working
sublist = []
sublist.append(k)
Use the DictReader from the csv package.
import csv
f1 = open('csv1.csv')
csv_1 = csv.DictReader(f1)
f2 = open('csv2.csv')
csv_2 = csv.DictReader(f2)
first_dict = {}
for row in csv_1:
first_dict[row['name']]=row
f1.close()
f_out = open('output.csv','w')
csv_out = csv.DictWriter(f_out,csv_2.fieldnames)
csv_out.writeheader()
for second_row in csv_2:
if second_row['name'] in first_dict:
first_row = first_dict[second_row['name']]
if first_row['id']!=second_row['id']:
csv_out.writerow(second_row)
f2.close()
f_out.close()
If you have the option, I have always found pandas as a great tool to import and manipulate CSV files.
import pandas as pd
# Read in both the CSV files
df_1 = pd.DataFrame(pd.read_csv('csv1.csv'))
df_2 = pd.DataFrame(pd.read_csv('csv2.csv'))
# Iterate over both DataFrames and if any id's from in df_2 match
# df_1, remove them from df_2
for num1, row1 in df_1.iterrows():
for num2, row2 in df_2.iterrows():
if row1['id'] == row2['id']:
df_2.drop(num2, inplace=True)
df_2.head()
For any kind of csv processing, using the builtin csv module makes most of the error prone processing trivial. Given your example values, the following code should produce the desired results. I use comprehensions to do the filtering.
import csv
import io
# example data, as StringIO that will behave like file objects
raw_csv_1 = io.StringIO('''\
id,name
10927,prince
109582,kabir
f546416,rahul
g44674,saini
r7341,antriskh''')
raw_csv_2 = io.StringIO('''\
ProfileID,id,name,class,rollnumber
1,lkha,prince,sfasd,DAS
2,hgfhfk,kabir,AD,AD
3,f546416,rahul,AD,FF
44,g44674,saini,DD,FF
5,jlh,antriskh,ASDA,AD''')
# in your actual data, you would use actual file objects instead, like
# with open('location/of/your/csv_1') as file_1:
# raw_csv_1 = file_1.read()
# with open('location/of/your/csv_2') as file_2:
# raw_csv_2 = file_2.read()
Then we need to transform then into csv.reader objects:
csv_1 = csv.reader(raw_csv_1)
next(csv_1) # consume once to skip the header
csv_2 = csv.reader(raw_csv_2)
header = next(csv_2) # consume once to skip the header, but store it
Last but not least, collect the names of the first csv in a set to use them as lookup table, filter the second csv with it, and write it back as 'result.csv' into your file system.
skip_keys = {id_ for id_, name in vals_1}
result = [row for row in vals_2 if row[1] not in skip_keys]
# at this point, result contains
# [['1', 'lkha', 'prince', 'sfasd', 'DAS'],
# ['2', 'hgfhfk', 'kabir', 'AD', 'AD'],
# ['5', 'jlh', 'antriskh', 'ASDA', 'AD']]
with open('result.csv', 'w') as result_file:
csv.writer(result_file).writerows(header+result)

Categories

Resources