Convert dta files in csv - python

I want to convert several dta files into csv.
So far my code is (to be honest I used an answer I found on stackoverflow...)
library(foreign)
setwd("C:\Users\Victor\Folder")
for (f in Sys.glob('*.dta'))
write.csv(read.dta(f), file = gsub('dta$', 'csv', f))
It works, but if my folder contains sub-folders they are ignored.
My problem is that I have 11 sub-folders (which may contain sub-folders themselves) I would like to find a way to loop my folder and sub-folders, because right now I need to change my working directory for each sub-folders and.
I'm using R now, I tried to use pandas (python) but it seems that the quality of the conversion is debatable...
Thank you

In R to do this you just set recursive = T in list.files.
Actually, specifying recursion when dealing with directories is kind of general -- it works with command line operations in OS's including Linux and Windows with commands like rm -rf and applies to multiple functions in R.
This post has a nice example:
How to use R to Iterate through Subfolders and bind CSV files of the same ID?
Their example (which is different only in what they're doing with the results of the directory/subdirectory search) is:
lapply(c('1234' ,'1345','1456','1560'),function(x){
sources.files <- list.files(path=TF,
recursive=T,
pattern=paste('*09061*',x,'*.csv',sep='')
,full.names=T)
## You read all files with the id and bind them
dat <- do.call(rbind,lapply(sources.files,read.csv))
### write the file for the
write(dat,paste('agg',x,'.csv',sep='')
}
So for you pattern = '.dta' and just set your base directory in path.

Consider using base R's list.files() as the recursive argument specifies to search in subdirectories. You will also want full.names set to return absolute paths for file referencing.
So, set your pattern to look for .dta extensions (i.e., Stata datasets) and then run the read in and write out function:
import foreign
statafiles <- list.files("C:\\Users\\Victor\\Folder", pattern="\\.dta$",
recursive = TRUE, full.names = TRUE)
lapply(statafiles, function(x) {
df <- read.dta(x)
write.csv(df, gsub(".dta", ".csv", x))
})
And the counterpart in Python pandas which has built-in methods to read and write stata files:
import os
import pandas as pd
for dirpath, subdirs, files in os.walk("C:\\Users\\Victor\\Folder"):
for f in files:
if f.endswith(".dta"):
df = pd.read_stata(os.path.join(dirpath, f))
df.to_csv(os.path.join(dirpath, f.replace(".dta", ".csv")))

Related

Ignore all files other than specific type of file, for directory comparison in Python

I want to compare two directories for all ".bin" files in them. There can be some other extension type files such as ".txt", ".tar.bz2" in those directories. I want to get the common files as well as files which are not common.
I tried using filecmp.dircmp(), but I am not able to use the ignore parameter with some wild card to ignore those files. Is there any solution which I can use to serve my purpose.
Select the common subset of *.bin files in the two folders and remove the first part of the path (the folder name), then pass them to cmpfiles():
import filecmp
from pathlib import Path
dir1_files = [f.relative_to('folder1') for f in Path('folder1').glob('*.bin')]
dir2_files = [f.relative_to('folder2') for f in Path('folder2').glob('*.bin')]
common_files = set(dir1_files).intersection(dir2_files)
match, mismatch, error = filecmp.cmpfiles('folder1', 'folder2', common_files)
If you want to avoid the preselection of common files, you can instead take the union of the two sets:
common_files = set(dir1_files).union(dir2_files)

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

Randomly choose a file inside a folder using Hypothesis

I want to add tests using the Hypothesis library (already use in the software for testing).
For these tests, I have to use a set of txt files contained in a folder.
I need to randomly choose one of these files each time I run my tests.
How to do that using Hypothesis?
edit
Here basically how it would look like, to comply the templates of already existing tests.
#given(doc=)
def mytest(doc):
# assert some stuff according to doc
assert some_stuff
Static case
If files list is assumed to be "frozen" (no files will be deleted/added) then we can use os.listdir + hypohtesis.strategies.sampled_from like
import os
from hypothesis import strategies
directory_path = 'path/to/directory/with/txt/files'
txt_files_names = strategies.sampled_from(sorted(os.listdir(directory_path)))
or if we need full paths
from functools import partial
...
txt_files_paths = (strategies.sampled_from(sorted(os.listdir(directory_path)))
.map(partial(os.path.join, directory_path)))
or if the directory may have files of different extensions and we need only .txt ones we can use glob.glob
import glob
...
txt_files_paths = strategies.sampled_from(sorted(glob.glob(os.path.join(directory_path, '*.txt'))))
Dynamic case
If directory contents may change and we want to make directory scan on each data generation attempt it can be done like
dynamic_txt_files_names = (strategies.builds(os.listdir,
strategies.just(directory_path))
.map(sorted)
.flatmap(strategies.sampled_from))
or with full paths
dynamic_txt_files_paths = (strategies.builds(os.listdir,
strategies.just(directory_path))
.map(sorted)
.flatmap(strategies.sampled_from)
.map(partial(os.path.join, directory_path)))
or with glob.glob
dynamic_txt_files_paths = (strategies.builds(glob.glob,
strategies.just(os.path.join(
directory_path,
'*.txt')))
.map(sorted)
.flatmap(strategies.sampled_from))
Edit
Added sorted following comment by #Zac Hatfield-Dodds.

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.
If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.
Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)
you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Script to rename files in folder to match names of files in another folder

I need to do a batch rename given the following scenario:
I have a bunch of files in Folder A
A bunch of files in Folder B.
The files in Folder A are all ".doc",
the files in Folder B are all ".jpg".
The files in Folder A are named "A0001.doc"
The files in Folder B are named "A0001johnsmith.jpg"
I want to merge the folders, and rename the files in Folder A so that they append the name portion of the matching file in Folder B.
Example:
Before:
FOLDER A: Folder B:
A0001.doc A0001johnsmith.jpg
After:
Folder C:
A0001johnsmith.doc
A0001johnsmith.jpg
I have seen some batch renaming scripts, but the only difference is that i need to assign a variable to contain the name portion so I can append it to the end of the corresponding file in Folder A.
I figure that the best way to do it would be to do a simple python script that would do a recursive loop, working on each item in the folder as follows:
Parse filename of A0001.doc
Match string to filenames in Folder B
Take the portion following the string that matched but before the "." and assign variable
Take the original string A0001 and append the variable containing the name element and rename it
Copy both files to Folder C (non-destructive, in case of errors etc)
I was thinking of using python for this, but I could use some help with syntax and such. I only know a little bit using the base python library, and I am guessing I would be importing libraries such as "OS", and maybe "SYS". I have never used them before, any help would be appreciated. I am also open to using a windows batch script or even powershell. Any input is helpful.
This is Powershell since you said you would use that.
Please note that I HAVE NOT TESTED THIS. I don't have access to a Windows machine right now so I can't test it. I'm basing this off of memory, but I think it's mostly correct.
foreach($aFile in ls "/path/to/FolderA")
{
$matchString = $aFile.Name.Split("."}[0]
$bFile = $(ls "/path/to/FolderB" |? { $_.Name -Match $matchString })[0]
$addString = $bFile.Name.Split(".")[0].Replace($matchString, "")
cp $aFile ("/path/to/FolderC/" + $matchString + $addString + ".doc")
cp $bFile "/path/to/FolderC"
}
This makes a lot of assumptions about the name structure. For example, I assumed the string to add doesn't appear in the common filename strings.
It is very simple with a plain batch script.
#echo off
for %%A in ("folderA\*.doc") do (
for %%B in ("folderB\%%~nA*.jpg") do (
copy "%%A" "folderC\%%~nB.doc"
copy "%%B" "folderC"
)
)
I haven't added any error checking.
You could have problems if you have a file like "A1.doc" matching multiple files like "A1file1.jpg" and "A10file2.jpg".
As long as the .doc files have fixed width names, and there exists a .jpg for every .doc, then I think the code should work.
Obviously more code could be added to handle various scenarios and error conditions.

Categories

Resources