Reading Multiple S3 Folders / Paths Into PySpark

Reading Multiple S3 Folders / Paths Into PySpark - python

I am conducting a big data analysis using PySpark. I am able to import all CSV files, stored in a particular folder of a particular bucket, using the following command:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv')
(where * acts like a wildcard)
The issues I have are the following:
What if I want to do my analysis on 2014 and 2015 data i.e. file 1 is .load('file:///home/path/SFweather/data2014/*.csv'), file 2 is .load('file:///home/path/SFweather/data2015/*.csv') and file 3 is .load('file:///home/path/NYCweather/data2014/*.csv') and file 4 is .load('file:///home/path/NYCweather/data2015/*.csv'). How do I import multiple paths at the same time to get one dataframe? Do I need to store them all individually as dataframes and then join them together within PySpark? (You may assume they all CSVs have the same schema)
Suppose it is November 2014 now. What if I want to run the analysis again, but on the "most recent data" run e.g. dec14 when it is December 2014? For example, I want to load in file 2: .load('file:///home/path/datafolder/data2014/dec14/*.csv') in December 14 and use this file: .load('file:///home/path/datafolder/data2014/nov14/*.csv') for the original analysis. Is there a way to schedule the Jupyter notebook (or similar) to update the load path and import the latest run (in this case 'nov14' would be replaced by 'dec14' and then 'jan15' etc).
I had a look through the previous questions but was unable to find an answer given this is AWS / PySpark integration specific.
Thank you in advance for the help!
[Background: I have been given access to many S3 buckets from various teams containing various big data sets. Copying it over to my S3 bucket, then building a Jupyter notebook seems like a lot more work than just pulling in the data directly from their bucket and building a model / table / etc ontop of it and saving the processed output into a database. Hence I am posting the questions above. If my thinking is completely wrong, please stop me! :)]

You can read in multiple paths with wildcards as long as the files are all in the same format.
In your example:
.load('file:///home/path/SFweather/data2014/*.csv')
.load('file:///home/path/SFweather/data2015/*.csv')
.load('file:///home/path/NYCweather/data2014/*.csv')
.load('file:///home/path/NYCweather/data2015/*.csv')
You could replace the 4 load statements above with the following path to read all csv's in at once to one dataframe:
.load('file:///home/path/*/*/*.csv')
If you want to be more specific in order to avoid reading in certain files/folders, you can do the following:
.load('file:///home/path/[SF|NYC]weather/data201[4|5]/*.csv')

You can load multiple paths at once using lists of pattern strings. The pyspark.sql.DataFrameReader.load method accepts a list of path strings, which is especially helpful if you can't express all of the paths you want to load using a single Hadoop glob pattern:
?
Matches any single character.
*
Matches zero or more characters.
[abc]
Matches a single character from character set {a,b,c}.
[a-b]
Matches a single character from the character range {a...b}.
Note that character a must be lexicographically less than or
equal to character b.
[^a]
Matches a single character that is not from character set or
range {a}. Note that the ^ character must occur immediately
to the right of the opening bracket.
\c
Removes (escapes) any special meaning of character c.
{ab,cd}
Matches a string from the string set {ab, cd}
{ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}
For example, if you want to load the following paths:
[
's3a://bucket/prefix/key=1/year=2010/*.csv',
's3a://bucket/prefix/key=1/year=2011/*.csv',
's3a://bucket/prefix/key=2/year=2020/*.csv',
's3a://bucket/prefix/key=2/year=2021/*.csv',
]
You could reduce these to two path patterns,
s3a://bucket/prefix/key=1/year=201[0-1]/*.csv and
s3a://bucket/prefix/key=2/year=202[0-1]/*.csv,
and call load() twice. You could go further and reduce these to a single pattern string using {ab,cd} alternation, but I think the most readable way to express paths like these using glob patterns with a single call to load() is to pass a list of path patterns:
spark.read.format('csv').load(
[
's3a://bucket/prefix/key=1/year=201[0-1]/*.csv',
's3a://bucket/prefix/key=2/year=202[0-1]/*.csv',
]
)
For the paths you listed in your issue № 1, you can express all four with a single pattern string:
'file:///home/path/{NY,SF}weather/data201[45]/*.csv'
For your issue № 2, you can write logic to construct the paths you want to load.

Related

Find most matched string from pandas dataframe object

I have created a pandas dataframe object with all the Files, FilePaths and FileDirectory names which are in a specific folder. Now I am reading filenames from a json file and want to find the exact location of the files by searching 'FilePaths' or 'FileDirectory' from the dataframe/pickle(as it is much much faster to search).
What I am trying for example:
>> dcm_sure_full_path = '/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'
>> set(df[df['FileDirectory'].str.contains(os.path.basename(dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
This gives me 3 different paths, which means some part of the files matches with three different locations.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_VascuCAP CT Development_SE/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712',
'/media/banikr2/CAP_Exam_Data0/Working Storage/wilist_WP A.2 Development_images_noTruth/69930316/im0-1.2.840.113845.13.4353.3528386102.230081008712'}
but you can clearly see I exactly needed the first one which matches the most with the desired path. Now I tried to get the most matched one by the following code:
>> set(df[np.char.find(df['FileDirectory'].values.astype(str), dcm_sure_full_path) > -1]['FileDirectory'])#.iloc[rn])
or just change the os.path.basename in previous one:
set(df[df['FileDirectory'].str.contains((dcm_sure_full_path), regex=False)]['FileDirectory'])#.iloc[0]
which in result gives the desired path and discards two others.
{'/media/banikr2/CAP_Exam_Data0/Images/20150121100254179/1.2.840.113845.13.4353.3528386102.229789272626/1.2.840.113845.13.4353.3528386102.230081008712'}
My ques is, are there better, smarter ways to do this sort of search and find with more accuracy so that I don't miss any file directory or so?

I want to be able to group same file sequence from a list python

Working on a python function which parses a file containing a list of strings.
Basically a walked folder structure parsed to a txt file so I don't have to work on real raid while in prod. That is also a requirement. To work from a txt file containing list of paths.
lpaths =[
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1025.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1042.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/render/SC11_1_Shot012.v01_1016.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v1.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot012/2d/app/Shot012_v02.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/workspace.cfg',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/SC11_1_Shot004_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/3d/app2/scenes/Shot004_camera_v01.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1112.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v01_1034.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1116.exr',
'/projects/0100/dbu/shots/11_1/SC11_1_Shot004/render/SC11_1_Shot004.v02_1126.exr'
]
This is partial list of the cleaned list version ive already worked out and works fine.
The real problem, need to parse all frames from a folder to into a list so it can hold a proper listed sequence.
There could be 1 frame or 1000, also there are multiple sequences in same folder as seen in the list.
My goal is to have a list for each sequence in a folder, so I can push them ahead to do more work down the road.
Code:
groups = [list(group) for key, group in itertools.groupby(sorted(lpaths), len)]
pp.pprint(groups)

Since you seem to have differing naming conventions you need to write a function that takes a single string and, possibly using regular expressions, returns an unambiguous key for you to sort on, lets say that you names are critically identified by the shot number which can be identified by r".*[Ss]hot_?(\d+).*\.ext" you could return an integer for the match base 10 so discarding any leading 0s.
Since you also may have a version number you could do a similar operation to get an unambiguous version number, (and possibly only process the latest version of a given shot).

Using python to parse a large set of filenames concatenated from inconsistent object names

/tldr Looking to parse a large set of filenames that are a concatenation of two names (container + child) for the original two names where nomenclature is inconsistent. Python library suggestions or any other guidance appreciated.
I am looking for a way to parse strings for information where the nomenclature and formatting of information within those strings will likely be inconsistent to some degree.
Background
Industry: Automation controls
Problem to be solved:
Time series data is exported from an automation system with a single data point being saved to a single .csv file. (example: If the controls system were an environmental controls system the point might be the measured temperature of a room taken at 15 minute intervals.) It is possible to have an environment where there are a few dozen points that export to CSV files or several thousand points that export to CSV files. The structure that the points are normally stored in is as follows: points are contained within a controller, controllers are integrated under a management system and occasionally management systems could be integrated into another management system. The resulting structure is a simple hierarchical tree.
The filenames associated with the CSV files are assembled from the path structure of each point as follows: Directories are created for the management systems (nested if necessary) and under those directories are the CSV files where the filename is a concatenation of the controller name and the point name.
I have written a python script that processes a monthly export of the CSV files (currently about 5500 of them [growing]) into a structured data store and another that assembles spreadsheets for others to review. Currently, I am using some really ugly regular expressions and even uglier string.find()s with a list of static string values that I have hand entered to parse out control names and point names for each file so that they can be inserted into the structured data store.
Unfortunately, as mentioned above, the nomenclature used in these environments are rarely consistent. Point names vary widely. The point referenced above might be known as ROOMTEMP, RM_T, RM-T, ROOM-T, ZN_T, ZNT, RMT or several other possibilities. This applies to almost any point contained within a controller. Controller names are also somewhat inconsistent where they may be named for what type of device they are controlling, the geographic location of the device or even an asset number associated with the device.
I would very much like to get out of the business of hand writing regular expressions to parse file names every time a new location is added. I would like to write code that reads in filenames and looks for patterns across the filenames and then makes a recommendation for parsing the controller and point name out of each filename. I already have an interface where I can assign controller name and point name to each point object by hand so if there are errors with the parse I can modify the results. Ideally, the patterns created by the existing objects would influence the suggested names of new files being parsed.
Some examples of filenames are as follows:
UNIT1254_SAT.csv, UNIT1254_RMT.csv, UNIT1254_fil.csv, AHU_5311_CLG_O.csv, QE239-01_DISCH_STPT.csv, HX_E2_CHW_Return.csv, Plant_RM221_CHW_Sys_Enable.csv, TU_E7_Actual Clg Setpoint.csv, 1725_ROOMTEMP.csv, 1725_DA_T.csv, 1725_RA_T.csv
The order will always be consistent where it is a concatenation of controller name and then point name. There will most likely be a consistent character used to separate controller name from point name (normally an underscore, but occasionally a dash or some other character.)
Does anyone have any recommendations on how to get started with parsing these file names? I’ve thought through a few ideas, but keep shelving them before trying them prior to implementation because I keep finding the potential for performance issues or identifying failure points. The rest of my code is working pretty much the way I need it to, I just haven’t figured out an efficient or useful way to pull the correct names out of the filename. Unfortunately, It is not an option to modify the names on the control system side to be consistent.

I don't know if the following code will help you, but I hope it'll give you at least some idea.
Considering that a filename as "QE239-01_STPT_1725_ROOMTEMP_DA" can contain following names
'QE239-01'
'QE239-01_STPT'
'QE239-01_STPT_1725'
'QE239-01_STPT_1725_ROOMTEMP'
'QE239-01_STPT_1725_ROOMTEMP_DA'
'STPT'
'STPT_1725'
'STPT_1725_ROOMTEMP'
'STPT_1725_ROOMTEMP_DA'
'1725'
'1725_ROOMTEMP'
'1725_ROOMTEMP_DA'
'ROOMTEMP'
'ROOMTEMP_DA'
'DA'
as being possible elements (container name or point name) of the filename,
I defined the function treat() to return this list from the name.
Then the code treats all the filenames to find all the possible elements of filenames.
The function is based on the idea that in the chosen example the element ROOMTEMP can't follow the element STPT because STPT_ROOMTEMP isn't a possible container name in this example string since there is 1725 between these two elements.
And then, with the help of a function in difflib module, I try to discriminate elements that may have some similarity, in order to try to detect patterns under which several elements of names can be gathered.
You must play with the value passed as argument to cutoff parameter to choose what could be the best to give interesting results for you.
It's far from being good, certainly, but I didn't understood all aspects of your problem.
s =\
"""UNIT1254_SAT
UNIT1254_RMT
UNIT1254_fil
AHU_5311_CLG_O
QE239-01_DISCH_STPT,
HX_E2_CHW_Return
Plant_RM221_CHW_Sys_Enable
TU_E7_Actual Clg Setpoint
1725_ROOMTEMP
1725_DA_T
1725_RA_T
UNT147_ROOMTEMP
TRU_EZ_RM_T
HXX_V2_RM-T
RHXX_V2_ROOM-T
SIX8_ZN_T
Plint_RP228_ZNT
SOHO79_EZ_RMT"""
li = s.split('\n')
print(li)
print('- - - - - - - - - - - - - - - - - ')
import difflib
from pprint import pprint
def treat(name):
lu = name.split('_')
W = []
while lu:
W.extend('_'.join(lu[0:x]) for x in range(1,len(lu)+1))
lu.pop(0)
return W
if 0:
q = "QE239-01_STPT_1725_ROOMTEMP_DA"
pprint(treat(q))
print('==========================================')
WALL = []
for t in li:
WALL.extend(treat(t))
pprint(WALL)
for x in WALL:
j = set(difflib.get_close_matches(x, WALL, n=9000000, cutoff=0.7 ))
if len(j)>1:
print(j,'\n')

How to find a specific file in Python

I have a directory with files of the following structure
A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...
The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.
In Python, I want to execute the following line:
f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")
here, line[1+i] correctly identifies the Uniprot ID.
What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?
The entire directory is about 8,116 files -- so not that many.
Thank you for your help!

What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.
Think about how you'd do this in the shell:
$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt
Or, if you're on Windows:
D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt
And you can do the exact same thing in Python, with the glob module:
>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']
And of course you can wrap this up in a function:
>>> def find_gene(uniprot):
... pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
... return glob.glob(pattern)[0]
But is there a way I can do that smarter? Should I use a dictionary?
Whether that's "smarter" depends on your use pattern.
If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, e.g., reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.
But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:
d = {}
for f in os.listdir('mutation_directory'):
gene, uniprot, suffix = f.split('_')
d[uniprot] = f
And then:
>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'
Can I just do a quick regex search?
For your simple case, you don't need regular expressions.*
More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob—or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.
* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.

You can use glob
In [4]: import glob
In [5]: files = glob.glob('*_Q9UNA3_*')
In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

How to find and replace 6 digit numbers within HREF links from map of values across site files, ideally using SED/Python

I need to create a BASH script, ideally using SED to find and replace value lists in href URL link constructs with HTML sit files, looking-up in a map (old to new values), that have a given URL construct. There are around 25K site files to look through, and the map has around 6,000 entries that I have to search through.
All old and new values have 6 digits.
The URL construct is:
One value:
HREF=".*jsp\?.*N=[0-9]{1,}.*"
List of values:
HREF=".*\.jsp\?.*N=[0-9]{1,}+N=[0-9]{1,}+N=[0-9]{1,}...*"
The list of values are delimited by + PLUS symbol, and the list can be 1 to n values in length.
I want to ignore a construct such as this:
HREF=".*\.jsp\?.*N=0.*"
IE the list is only N=0
Effectively I'm only interested in URL's that include one or more values that are in the file map, that are not prepended with CHANGED -- IE the list requires updating.
PLEASE NOTE: in the above construct examples: .* means any character that isn't a digit; I'm just interested in any 6 digit values in the list of values after N=; so I've trying to isolate the N= list from the rest of the URL construct, and it should be noted that this N= list can appear anywhere within this URL construct.
Initially, I want to create a script that will create a report of all links that fulfills the above criteria and that have a 6 digital OLD value that's in the map file, with its file path, to get an understanding of links impacted. EG:
Filename link
filea.jsp /jsp/search/results.jsp?N=204200+731&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
filea.jsp /jsp/search/BROWSE.jsp?Ntx=mode+matchallpartial&N=213890+217867+731&
fileb.jsp /jsp/search/results.jsp?N=0+450+207827+213767&Ntx=mode+matchallpartial&Ntk=gensearch&Ntt=
Lastly, I'd like to find and replace all 6 digit numbers, within the URL construct lists, as outlined above, as efficiently as possible (I'd like it to be reasonably fast as there could be around 25K files, with 6K values to look up, with potentially multiple values in the list).
**PLEASE NOTE:** There is an additional issue I have, when finding and replacing, is that an old value could have been assigned a new value, that's already been used, that may also have to be replaced.
E.G. If the map file is as below:
MAP-FILE.txt
OLD NEW
214865 218494
214866 217854
214867 214868
214868 218633
... ...
and there is a HREF link such as:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214867+214868
214867 changes to 214868 - this would need to be prepended to flag that this value has been changed, and should not be replaced, otherwise what was 214867 would become 218633 as all 214868 would be changed to 218633. Hope this makes sense - I would then need to run through file and remove all 6 digit numbers that had been marked with the prepended flag, such that link would become:
/jsp/search/results.jsp?Ntx=mode+matchallpartial&Ntk=gensearch&N=0+450+214868CHANGED+218633CHANGED
Unless there's a better way to manage these infile changes.
Could someone please help me on this, I'm note an expert with these kind of changes - so help would be massively appreciated.
Many thanks in advance,
Alex

I will write the outline for the code in some kind of pseudocode. And I don't remember Python well to quickly write the code in Python.
First find what type it is (if contains N=0 then type 3, if contains "+" then type 2, else type 1) and get a list of strings containing "N=..." by exploding (name of PHP function) by "+" sign.
The first loop is on links. The second loop is for each N= number. The third loop looks in map file and finds the replacing value. Load the data of the map file to a variable before all the loops. File reading is the slowest operation you have in programming.
You replace the value in the third loop, then implode (PHP function) the list of new strings to a new link when returning to a first loop.
Probably you have several files with the links then you need another loop for the files.
When dealing with repeated codes you nees a while loop until spare number found. And you need to save the numbers that are already used in a list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.