I am trying to read a multiline file using pyspark in databricks and then flatten the file to get respective column which will invariably be stored as a delta table. I have successfully done it for 40 odd files but having performance issue with the last one.
This flatten part is taking 10+ mins to generate but the write part is running for ever.
#Flatten the dataframe to get all keys
def flatten(df):
complex_fields = dict([
(field.name, field.dataType)
for field in df.schema.fields
if isinstance(field.dataType, T.ArrayType) or isinstance(field.dataType, T.StructType)
qualify = list(complex_fields.keys())[0] + "_"
while len(complex_fields) != 0:
col_name = list(complex_fields.keys())[0]
if isinstance(complex_fields[col_name], T.StructType):
expanded = [F.col(col_name + '.' + k).alias(col_name + '_' + k)
for k in [ n.name for n in complex_fields[col_name]]
df = df.select("*", *expanded).drop(col_name)
elif isinstance(complex_fields[col_name], T.ArrayType):
complex_fields = dict([
(field.name, field.dataType)
for field in df.schema.fields
if isinstance(field.dataType, T.ArrayType) or isinstance(field.dataType, T.StructType)
for df_col_name in df.columns:
df = df.withColumnRenamed(df_col_name, df_col_name.replace(qualify, ""))
return df
To increase performance i have increased the cores and executor memory , not sure what else can be done. This is 1 single file but have multiple arrays, looking at the DAG looks like one input record is going to create millions of records in the target file.
Need help if anyone has faced similar issues.
I have attached the sql interpeter that is generated automatically. From stdout log all i can understand is Garbace Collector(GC) is trying to free up space , and also has multiple Full GC running.
I want to create some code in TabPy that will count the frequency of words in a column and remove stop words for a word cloud in Tableau.
I'm able to do this easily enough in Python:
other1_count = other1.answer.str.split(expand=True).stack().value_counts()
other1_count = other1_count.to_frame().reset_index()
other1_count.columns = ['Word', 'Count']
### Remove stopwords
other1_count['Word'] = other1_count['Word'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
other1_count['Word'].replace('', np.nan, inplace=True)
other1_count.dropna(subset=['Word'], inplace=True)
other1_count = other1_count[~other1_count.Word.str.contains("nan")]
But less sure how to run this through TabPy. Anyone familiar with TabPy and how I can make this run?
Thanks in advance.
I worked on a project that accomplished something very similar a while back in R. Here's a video example showing the proof-of-concept (no audio). https://www.screencast.com/t/xa0yemiDPl
It essentially shows the end state of using Tableau to interactively examine the description of wines in a word-cloud for the selected countries. The key components were:
have Tableau connect to the data to analyze, as well as a placeholder dataset that has the number of records you expect to get back from your Python/R code (the call out to Python/R from Tableau expects to get back the same number of records it sends off to process... that can be problematic if your sending text data, but processing it to return back many more records - as would be the case in the word cloud example)
have the Python/R code connect to your data and return the Word and Frequency counts in a single vector, separated by a delimiter (what Tableau will require for a word cloud)
split the single vector using Tableau Calculated Fields
leverage parameter actions to select parameter values to pass to the Python/R code
High-Level Overview
Tableau Calculated Field - [R Words+Freq]:
print(.arg2) # grouping
print(.arg1) # selected country
# TEST VARIABLE (non-prod)
.MaxSourceDataRecords = 1000 # -1 to disable
.country = "' + [Country Parameter] + '"
.wordsToReturn = ' + str([Return Top N Words]) + '
.countryUseAll = (.country == "All")
#setwd("C:/Users/jbelliveau/....FILL IN HERE...")
.fileIn = ' + [Source Data Path] + '
#.fileOut = "winemag-with-DTM.csv"
#install.packages("RColorBrewer") # not needed if installed wordcloud package
library(RColorBrewer) # color package (maps or wordclouds)
wineAll = read.csv(.fileIn, stringsAsFactors=FALSE)
# TODO separately... polarity
# use all the data or just the parameter selected
if ( .countryUseAll ) {
wine = wineAll # filter down to parameter passed from Tableau
wine = wineAll[c(wineAll$country == .country),] # filter down to parameter passed from Tableau
# limited data for speed (NOT FOR PRODUCTION)
if( .MaxSourceDataRecords > 0 ){
print("limiting the number of records to use from input data")
wine = head(wine, .MaxSourceDataRecords)
corpus = Corpus(VectorSource(wine$description))
corpus = tm_map(corpus, tolower)
#corpus = tm_map(corpus, PlainTextDocument) # https://stackoverflow.com/questions/32523544/how-to-remove-error-in-term-document-matrix-in-r/36161902
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("English"))
dtm = DocumentTermMatrix(corpus)
mysample = dtm # no sampling (used Head on data read... for speed/simplicity on this example)
#mysample <- dtm[sample(1:nrow(dtm), 5000, replace=FALSE),]
wineSample = as.data.frame(as.matrix(mysample))
# column names (the words)
# use colnames to get a vector of the words
# freq of words
# colSums to get the frequency of the words
#wineWordFreq = colSums(wineSample)
# structure in a way Tableau will like it
wordCloudData = data.frame(words=colnames(wineSample), freq=colSums(wineSample))
# sort by word freq
wordCloudDataSorted = wordCloudData[order(-wordCloudData$freq),]
# join together by ~ for processing once Tableau gets it
wordAndFreq = paste(wordCloudDataSorted[, 1], wordCloudDataSorted[, 2], sep = "~")
#write.table(wordCloudData, .fileOut, sep=",",row.names=FALSE) # if needed for performance refactors
topWords = head(wordAndFreq, .wordsToReturn)
return( topWords )
Max([Country Parameter])
, MAX([RowNum]) // for testing the grouping being sent to R
Tableau Calculated Field for the Word Value:
// grab the first token to the left of ~
Left([R Words+Freq], Find([R Words+Freq],"~") - 1)
Tableau Calculated Field for the Frequency Value:
INT(REPLACE([R Words+Freq],[Word]+"~",""))
If you're not familiar with Tableau, you'll likely want to work alongside a Tableau analyst at your company that is. They'll be able to help you create the calculated fields and configure Tableau to connect to TabPy.
I think that the best way to get familiar with Python related to Tableau could be this (old) thread on the Tableau community:
It explains step-by-step the initial set up and how to "call" Python via Tableau Calculated fields.
In addition, you'll find at the top of the post the reference to the more updated TabPy GitHub repository:
Assume this is a sample of my data: dataframe
the entire dataframe is stored in a csv file (dataframe.csv) that is 40GBs so I can't open all of it at once.
I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data).
To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?
SUMMARY -- sorry if the question is a bit long:
the csv file storing the data is very big and I can't open it at once, but I'd like to find the top 25 dominant names in the data. Any ideas on what to do and how to do it?
I'd appreciate any help I can get! :)
Thanks for your interesting task! I've implemented pure numpy + pandas solution. It uses sorted array to keep names and counts. Hence algorithm should be around O(n * log n) complexity.
I didn't any hash table in numpy, hash table definitely would be faster (O(n)). Hence I used existing sorting/inserting routines of numpy.
Also I used .read_csv() from pandas with iterator = True, chunksize = 1 << 24 params, this allows reading file in chunks and producing pandas dataframes of fixed size from each chunk.
Note! In the first runs (until program is debugged) set limit_chunks (number of chunks to process) in code to small value (like 5). This is to check that whole program runs correctly on partial data.
Program needs to run one time command python -m pip install pandas numpy to install these 2 packages if you don't have them.
Progress is printed once in a while, total megabytes done plus speed.
Result will be printed to console plus saved to res_fname file name, all constants configuring script are placed in the beginning of script. topk constant controls how many top names will be outputed to file/console.
Interesting how fast is my solution. If it is to slow maybe I devote some time to write nice HashTable class using pure numpy.
You can also try and run next code here online.
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
cfg['progressed'] = total
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())
The second line will convert your csv into a pandas dataframe and the third line should print the occurances of each name.
This doesn't seem to be a case where pandas is really going to be an advantage. But if you're committed to going down that route, change the read_csv chunksize paramater, then filter out the useless columns.
Perhaps consider using a different set of tooling such as a database or even vanilla python using a generator to populate a dict in the form of name:count.
Hoping someone can help me here. I have two bigquery tables that I read into 2 different p collections, p1 and p2. I essentially want to update product based on a type II transformation that keeps track of history (previous values in the nested column in product) and appends new values from dwsku.
The idea is to check every row in each collection. If there is a match based on some table values (between p1 and p2), then check product's nested data to see if it contains all values in p1 (based on it's sku number and brand). If it does not contain the most recent data from p2 then take a copy of the format of the current nested data in product, and fit the new data into it. Take this nested format and add it to the existing nested products in product.
def process_changes(element, productdata):
for data in productdata:
if element['sku_number'] == data['sku_number'] and element['brand'] == data['brand']:
logging.info('Processing Product: ' + str(element['sku_number']) + ' brand:' + str(element['brand']))
datatoappend = []
for nestline in data['product']:
logging.info('Nested Data: ' + nestline['product'])
if nestline['in_use'] == 'Y' and (nestline['sku_description'] != element['sku_description'] or nestline['department_id'] != element['department_id'] or nestline['department_description'] != element['department_description']
or nestline['class_id'] != element['class_id'] or nestline['class_description'] != element['class_description'] or nestline['sub_class_id'] != element['sub_class_id'] or nestline['sub_class_description'] != element['sub_class_description'] ):
logging.info('we found a sku we need to update')
logging.info('sku is ' + data['sku_number'])
newline = nestline.copy()
logging.info('most recent nested product element turned off...')
nestline['in_use'] = 'N'
nestline['expiration_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
logging.info('inserting most recent change in dwsku inside nest')
newline['sku_description'] = element['sku_description']
newline['department_id'] = element['department_id']
newline['department_description'] = element['department_description']
newline['class_id'] = element['class_id']
newline['class_description'] = element['class_description']
newline['sub_class_id'] = element['sub_class_id']
newline['sub_class_description'] = element['sub_class_description']
newline['in_use'] = 'Y'
newline['effective_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_time'] = "%s:%s:%s" % (curdate.hour, curdate.minute, curdate.second)
nestline['expiration_date'] = "9999-01-01"
logging.info('Nothing changed for sku ' + str(data['sku_number']))
for dt in datatoappend:
logging.info('processed sku ' + str(element['sku_number']))
logging.info('adding the changes (if any)')
return data
changed_product = p1 | beam.FlatMap(process_changes, AsIter(p2))
Afterwards I want to add all values in p1 not in p2 in a nested format as seen in nestline.
Any help would be appreciated as I'm wondering why my job is taking hours to run with nothing to show. Even the output logs in dataflow UI don't show anything.
Thanks in advance!
This can be quite expensive if side input PCollection p2 is large. From your code snippets it's not clear how PCollection p2 is constructed. But if it is, for example, a text file that is if size 62.7MB, processing it per element can be pretty expensive. Can you consider using CoGroupByKey: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
Also note that from a FlatMap, you are supposed to return a iterator of elements from the processing method. Seems like you are returning a dictionary('data') which probably is incorrect.
I'm trying to analyze a large amount of GitHub Archive Data and am stumped by many limitations.
So my analysis requires me too search a 350GB Data set. I have a local copy of the data and there is also a copy available via Google BigQuery. The local dataset is split up into 25000 individual files. The dataset is a timeline of events.
I want to plot the number of stars each repository has since its creation. (Only for repos with > 1000 currently)
I can get this result very quickly using Google BigQuery, but it "analyzes" 13.6GB of data each time. This limits me to <75 requests without having to pay $5 per additional 75.
My other option is to search through my local copy, but searching through each file for a specific string (repository name) takes way too long. Took over an hour on an SSD drive to get through half the files before I killed the process.
What is a better way I can approach analyzing such a large amount of data?
Python Code for Searching Through all Local Files:
for yy in range(11,15):
for mm in range(1,13):
for dd in range(1,32):
for hh in range(0,24):
counter = counter + 1
if counter < startAt:
if counter > stopAt:
#print counter
strHH = str(hh)
strDD = str(dd)
strMM = str(mm)
strYY = str(yy)
if len(strDD) == 1:
strDD = "0" + strDD
if len(strMM) == 1:
strMM = "0" + strMM
#print strYY + "-" + strMM + "-" + strDD + "-" + strHH
f = json.load (open ("/Volumes/WD_1TB/GitHub Archive/20"+strYY+"-"+strMM+"-"+strDD+"-"+strHH+".json", 'r') , cls=ConcatJSONDecoder)
for each_event in f:
if(each_event["type"] == "WatchEvent"):
num_stars = int(each_event["repository"]["watchers"])
created_at = each_event["created_at"]
json_entry[4][created_at] = num_stars
except Exception, e:
print e
except Exception, e:
print e
Google Big Query SQL Command:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
AND repository_owner = "mojombo"
AND repository_name = "grit"
ORDER BY created_at
I am really stumped so any advice at this point would be greatly appreciated.
If most of your BigQuery queries only scan a subset of the data, you can do one initial query to pull out that subset (use "Allow Large Results"). Then subsequent queries against your small table will cost less.
For example, if you're only querying records where type = "WatchEvent", you can run a query like this:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
And set a destination table as well as the "Allow Large Results" flag. This query will scan the full 13.6 GB, but the output is only 1 GB, so subsequent queries against the output table will only charge you for 1 GB at most.
That still might not be cheap enough for you, but just throwing the option out there.
I found a solution to this problem - Using a database. i imported the relevant data from my 360+GB of JSON data to a MySQL Database and queried that instead. What used to be a 3hour+ query time per element became <10seconds.
MySQL wasn't the easiest thing to set up, and import took approximately ~7.5 hours, but the results made it well worth it for me.
I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
for row in fNew:
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
num += 1
i += 1
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort: