Consecutive reading of files in Python - python

I have 13 files with the extension .Las. Each has 80 columns and 180 thousand lines.
I need to read my files sequentially one after the other: the first, then the second, and so on.
Next is my script, in which I process data from files.
At the end, the program must output the data to the file with the same extension.
Thank you in advance for your response!!!
And i programming in pandas, Jupyter notedook
cols = ['IK05', 'IK20', 'DA20', 'LLS', 'LLD', 'STP']
data = pd.read_table("data/1.las", delim_whitespace = True,na_values = '-999.25', index_col=False)
ndata = data.STP_AX.as_matrix(columns=None)
nstop = 1
stop = 1
for i in range(len(ndata)):
if(ndata[i]>0.1):
stop = 0
ndata[i] = nstop
else:
if(stop == 0):
stop = 1
nstop = nstop + 1
nstop
data.STP = ndata
data.STP = ndata
df = data[cols]
df1 = df.groupby('STP')
df1.head()
dfp = pd.DataFrame()
for name, group in df1:
df1 = df.groupby('STP')
df1.head()
​
dfp = pd.DataFrame()
​
​
for name, group in df1:
#print(name)
#print(group)
k,p=stats.mstats.normaltest(group[5:-5])
#print(p)
dfp[name] = p

Assuming you don't want to create variables for each file name, write them all in a dict, I'm giving an example using 3 files-
d = ["file1.las", "file2.las", "file3.las"]
Assuming your output file is output.Las
open the file using w+ that allows for writing and appending a file
output = open("output.las", 'w+')
now, you can use a for loop to open each file, process it and then output the data into the output file
for i in d:
file = open(i, 'r')
contents = file.read()
*your processing here*
output.write(processedData)
file.close()
Finally, you might want to close the output file:
output.close()

Related

Pandas Read CSV Error When Reading Multiple Files

I have multiple csv files, named as 2C-BEB-29-2009-01-18.csv,2C-BEB-29-2010-02-18.csv,2C-BEB-29-2010-03-28.csv, 2C-ISI-12-2010-01-01.csv, and so on.
2C- Part is default in all csv files.
BEB means name of the recording device
29 stands for the user ID
2009-01-18 stands for the date of the recording.
I have around 150 different IDs and their recordings with different devices. I would like to automate the following approach which I have done for a single user ID for all user IDs
When I use the following code for the single user, namely for pattern='2C-BEB-29-*.csv', in string format. Note that I am in the correct directory.
def pd_read_pattern(pattern):
files = glob.glob(pattern)
df = pd.DataFrame()
for f in files:
csv_file = open(f)
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
#date column should be changed depending on patient id
a['date'] = str(csv_file.name).rsplit('29-',1)[-1].rsplit('.',1)[0]
#df = df.append(a)
#df = df[df['hf']!=0]
return df.reset_index(drop=True)
To apply the above code for all user IDs, I have read the CSV files in the following way and saved them into a list. To avoid duplicate IDs I have converted the list into set at the end of this snippet.
import glob
lst=[]
for name in glob.glob('*.csv'):
if len(name)>15:
a = name.split('-',3)[0]+"-"+name.split('-',3)[1]+"-"+name.split('-',3)[2]+'-*'
lst.append(a)
lst = set(lst)
Now, having names of unique Ids in this example format: '2C-BEB-29-*.csv'. Withe the help of below code snippet, I am trying to read user IDs. However, I get unicode/decode error in the pd.read_csv row. Could you help me with this issue?
for file in lst:
#print(type(file))
files = glob.glob(file)
#print(files)
df = pd.DataFrame()
for f in files:
csv_file = open(f)
#print(f, type(f))
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
#date column should be changed depending on patient id
#a['date'] = str(csv_file.name).rsplit(f.split('-',3)[2]+'-',1)[-1].rsplit('.',1)[0]
#df = df.append(a)
#df = df[df['hf']!=0]
#return df.reset_index(drop=True)
Firstly,
import chardet
Then, replace your code snippet of
a = pd.read_csv(f,sep='\s+|;|,', engine='python')
with this one
with open(f, 'rb') as file:
encodings = chardet.detect(file.read())["encoding"]
a = pd.read_csv(f,sep='\s+|;|,', engine='python', encoding=encodings)

Import data from text file with multiple conditions using Pandas

I'm trying to parse this text file using Pandas data frame.
The text file is in this particular format:
Name: Tom
Gender: Male
Books:
The problem of Pain
The reason for God: belief in an age of skepticism
My code so far to import the data is:
import pandas as pd
df = pd.read_table(filename, sep=":|\n", engine='python', index_col=0)
print df
The output I got is:
Name Tom
Gender Male
Books NaN
The problem of Pain NaN
The reason for God belief in an age of skepticism
How should I change the code such that the output I get will be: (edited output)
Name Gender Books
Tom Male The problem of Pain, The reason for God: belief in an age of skepticism
Thanks for helping!
You can do two things. You can use enumerate(), and use an if statement:, I used a text file named test.txt in the below code.
import pandas as pd
d = {}
value_list = []
for index, text in enumerate(open('test.txt', "r")):
if index < 2:
d[text.split(':')[0]] = text.split(':')[1].rstrip('\n')
elif index ==2:
value = text.split(':')[0]
else:
value_list.append(text.rstrip('\n'))
d[value] = [value_list]
df = pd.DataFrame.(d)
Instead you can use readlines() and then slice through each line to get and populate the dictionary and then create a dataframe.
import pandas as pd:
text_file = open('test.txt', "r")
lines = text_file.readlines()
d = {}
d[lines[0:1][0].split(':')[0]] = lines[0:1][0].split(':')[1].rstrip('\n')
d[lines[1:2][0].split(':')[0]] = lines[1:2][0].split(':')[1].rstrip('\n')
d[lines[2:3][0].split(':')[0]] = [lines[3:]]
df = pd.DataFrame(d)
The method I use is simple: regex.
import os, re
import pandas as pd
# List out the all files in dir that ends with .txt
files = [file for file in os.listdir(PROFILES) if file.endswith(".txt")]
HEADERS = ['Name', 'Gender', 'Books']
DATA = [] # create the empty list to store profiles
for file in files: # iterate over each file
filename = PROFILES + file # full path name of the data files
text_file = open(filename, "r") # open the file
lines = text_file.read() # read the file in memory
text_file.close() # close the file
###############################################################
# Regex to filter out all the column header and row data. ####
# Odd Number == Header, Even Number == Data ##################
###############################################################
books = re."(Name):(.*)\n+(Gender):(.*)\n+(Books):((?<=Books:)\D+)",lines)
# append data into DATA list
DATA.append([books.group(i).strip() for i in range(len(books.groups()) + 1) if not i % 2 and i != 0])
profilesDF = pd.DataFrame(DATA, columns=HEADERS) # create the dataframe

Python Pandas, Reading in file and skipping rows ahead of header

I am trying to loop over some files and skip the rows before the header in each file using pandas. All of the files are in the same data format except some have different number of rows to skip before the header. Is there a way to loop over the files and start at the header of each file when some have more rows to skip than others?
For example,
some files require this:
f = pd.read_csv(fname,skiprows = 7,parse_dates=[0])
And some require this:
f = pd.read_csv(fname,skiprows = 15, parse_dates=[0])
Here is my chunk of code looping over my files:
for name,ID in stations:
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
f = pd.read_csv(fname,skiprows=15,parse_dates=[0]) #could also skip 7 depending on file
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
One way is to read your file using pure Python I/O to extract the index, then feed this into the skip_rows argument of pd.read_csv.
This is fairly efficient since the first step uses a generator expression which reads only until the desired row is reached.
from io import StringIO
import pandas as pd
from copy import copy
mystr = StringIO("""dasfaf
kgafsda
Date/Time,num1,num2
2018-01-01,0,1
2018-01-02,2,3
""")
mystr2 = copy(mystr)
# replace mystr with open('file.csv', 'r')
with mystr as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time'))
# replace mystr2 with 'file.csv'
df = pd.read_csv(mystr2, skiprows=idx-1, parse_dates=[0])
print(df)
Date/Time num1 num2
0 2018-01-01 0 1
1 2018-01-02 2 3
Wrap this in a function if you need to repeat the task:
def calc_skiprows(fname):
with fname as fin:
idx = next(i for i, j in enumerate(fin) if j.startswith('Date/Time')) - 1
df = pd.read_csv(fname, skiprows=calc_skiprows(fname), parse_dates=[0])
The first suggestion/answer seemed like a really good way to handle it but I couldn't get it to work for me for some reason. I did find another way to fix my problem using the try and except funcitons in python:
for name,ID in stations:
#read in each stations .csv files, concatenate together, insert station id column
path = str(ID)+'/*.csv'
for fname in glob.glob(path):
print(fname)
try:
f = pd.read_csv(fname,skiprows=7,parse_dates=[0])
except:
f = pd.read_csv(fname,skiprows=15,parse_dates=[0])
ws = f['Wind Spd (km/h)']*0.27778 #convert to m/s from km/h
dt = f['Date/Time']
This way if the first attempt to read in the file fails (skipping 7 rows), then it tries again using the other read_csv line (skipping 15 rows). This is not 100% correct since I am still hardcoding the number of lines to skip, but works for my needs right now.

Project guidance - multiple cvs / modification to data frame

Just a quick question on a guidance before having my nerves broken ! :)
I have multiple CSV file that i would like to merge to one BIG new CSV file.
All the files have the same exact structure :
Muzicast V2;;;;;;;;
Zoom mÈdia sur Virgin Radio;;;;;;;;;
Sem. 16 : Du 15 avril 2016 au 21 avril 2016;;;;;;;;;
;;;;;;;;;
;;;;;;;;;
TOP 40;;;;;;;;;
Rg;Evo.;Rg-1;Artiste;Titre;Genre;Label;Audience;Nb.Diffs;Nb.Sem
1;+3;4;Twenty One Pilots;Stressed out;Pop/Rock International;WEA;5 982 000;56;18
2;+1;3;Coldplay;Hymn for the weekend;Pop/Rock International;WEA;5 933 000;55;13
3;-2;1;Imany;Don't be so shy (Filatov & Karas remix);Dance;THINK ZIK;5 354 000;55;7
4;-2;2;Lukas Graham;7 years;Pop/Rock International;POLYDOR;5 927 000;54;16
5; =;5;Justin Bieber;Love yourself;Pop/Rock International;MERCURY GROUP;5 481 000;49;21
All the cvs files have the same formatting.
I would like to :
- open each file one after the other / ignore the 10 first lines
- take all the infos with ";" as a separator
- insert variables at the beginning of each lines
- write on a new file with all the infos from each files.
I managed to open a file and made the changes I needed :
handle = open(file_dir+'/'+'virgin092016.csv','r')
results = []
for line in handle :
line = '12;2016;'+line
line = line.lower()
line = line.strip()
line = line.split(';')
line = line[0],line[1],line[5]
results.append(line)
df = pd.DataFrame(results)
print df
I managed to open multiple files and create a DataFrame =
file_dir = "VIRGIN"
main_df = pd.DataFrame()
for i, file_name in enumerate(os.listdir(file_dir)):
if i == 0 :
main_df = pd.read_csv(file_dir + "/" + file_name, sep=";")
main_df["file_name"] = file_name
else :
current_df = pd.read_csv(file_dir + "/" + file_name, sep=";")
current_df["file_name"] = file_name
current_df = current_df
main_df = pd.concat([main_df,current_df],ignore_index=True)
print main_df
But now I have an issue trying to do both of them at the same time.
I am missing a part and I think it is because I am not sure of the order I have to do my code.
Do i have to open open a file make the changes and then write directly to the MAIN.CSV (which will have the final infos of all files) and then do a DataFrame
OR should i open a file > do a data frame and after that make the changes I need.
I'm new to python... taking multiple online courses and reading books.... but I feel that I'm still not really "pythonic" in my way of thinking.
Help would be much appreciated.
Thanks
I'm assuming all your csv files are in "./data/", defined in main_dir, and that the sum of all your csv does not exceed your RAM memory. The trick is to use a temporary variable current_df and then to append it to a final dataframe final_df with pd.concat.
import os
import pandas as pd
main_dir = "./data/"
all_files = os.listdir(main_dir)
for i, file_name in enumerate(all_files):
current_df = pd.read_csv(main_dir+file_name,
sep=";",
skiprows=10)
#add here whatever information you need to your dataframe
#dump the results into a separate file with current_df.to_csv()
if i == 0:
final_df = current_df
else:
final_df = pd.concat([final_df, current_df], axis=0)

Python_How to write data in new columns for every file read by numpy?

I have several text files with such a construction. Same number of columns but different rows:
1.txt
2013-08-29T15:11:18.55912 0.019494552 0.110042184 0.164076427 0.587849877
2013-08-29T15:11:18.65912 0.036270974 0.097213155 0.122628797 0.556928624
2013-08-29T15:11:18.75912 0.055350041 0.104121094 0.121641949 0.593113069
2013-08-29T15:11:18.85912 0.057159263 0.107410588 0.198122695 0.591797271
2013-08-29T15:11:18.95912 0.05288292 0.102476346 0.172958062 0.591139372
2013-08-29T15:11:19.05912 0.043507861 0.104121094 0.162102731 0.598376261
2013-08-29T15:11:19.15912 0.068343545 0.102805296 0.168517245 0.587849877
2013-08-29T15:11:19.25912 0.054527668 0.105765841 0.184306818 0.587191978
2013-08-29T15:11:19.35912 0.055678991 0.107739538 0.169997517 0.539165352
2013-08-29T15:11:19.45912 0.05321187 0.102476346 0.167530397 0.645744989
2.txt
2013-08-29T16:46:05.41730 0.048771052 0.10642374 0.180852849 0.430612023
2013-08-29T16:46:05.51730 0.046303932 0.112673779 0.166050124 0.518112585
2013-08-29T16:46:05.61730 0.059955334 0.149845068 0.164569851 0.511533595
2013-08-29T16:46:05.71730 0.042192064 0.107410588 0.115227435 0.476007051
2013-08-29T16:46:05.81730 0.037915721 0.115634324 0.177892304 0.519428383
2013-08-29T16:46:05.91730 0.043507861 0.120568566 0.187267364 0.483243939
2013-08-29T16:46:06.01730 0.042356538 0.10642374 0.143352612 0.522059978
This code reads all the text files in the folder, do some math and is supposed to write results of each text file in new columns in a single csv.
files_ = glob.glob('D:\Test files\New folder\*.txt')
averages_ = []
seg_len = 3
def cum_sum(lis):
total = 0
for x in lis:
total += x[1]
yield total
with open ('outfile.csv', 'wb') as outfile:
writer = csv.writer(outfile)
for i in files_:
acols, f_column, average_original, fcol = [], [], [], []
data = loadtxt(i , usecols = (1,2,3,4))
for x in range(0, len(data[:,0]), seg_len):
#some math on each column
sample_means = [x] + [mean(data[x:x+seg_len,i]) for i in range(4)]
#change types and save in a list
float_means = ["%1f" % (x) for x in sample_means]
#append previous two lines in lists
average_original.append(sample_means)
acols.append(float_means)
fcol = list(cum_sum(average_original))
#write fcol in a column next to acols
acols = [row + [col] for row, col in zip(acols, fcol)]
averages_.append(acols)
for row in averages_:
writer.writerows(row)
Q:
But I cannot get the code to write new columns for each new file. The most relevant post I found was Python : How do i get a new column for every file I read?, but line.strip() doesn't work for me.
I appreciate any hints how to approach this please.
Will this work for you?
import pandas as pd
df = pd.DataFrame()
mad = lambda x: x[0] + x.mean()
A = []
for f in ['1.txt', '2.txt']:
tmp = pd.read_csv(f, header=None, delim_whitespace=True)
tmp = tmp.ix[:,1:5]
df = pd.concat([df, pd.rolling_apply(tmp, 3, mad)], axis=1)
df.to_csv('test.csv')
The rolling_apply function applies a moving function along columns with a window of 3 in this case.
I'm sorry if this isn't quite what you want, but I think it shows how powerful pandas can be.

Categories

Resources