Resources exceed limits big query - python

SELECT A,B, C, D, E, F ,EXTRACT(MONTH FROM PARSE_DATE('%b',Month))
as MonthNumber,PARSE_DATETIME(' %Y%b%d ', CONCAT(CAST(Year AS STRING),Month,'1'))
as G FROM `XXX.YYY.ZZZ`
where A !='null' and B = 'MYSTRING' order by A,Year
The query processes about 20 GB per run.
My table ZZZ has 396,567,431 (396 million) rows with a size of 53 GB. If I execute the above query without a LIMIT clause , I get an error saying "Resources exceeded".
If i execute it with a LIMIT clause , then it gives the same error for larger limits.
I am writing a python script using the API that runs the query above and then computes some metrics and then writes the output to another table. It writes some 1.7 million output rows, so basically aggregates the first table based on column A i.e original table has multiple rows for column A.
Now I know we can set Allow large results to on and select an output table to get around this error but for the purposes of my script it doesn't server the purpose.
Also , I read that order by is the expesnive part causing this but below is my algorithm and I dont see a way around order by.
Also my script pages the query results 100000 rows at a time.
log=[]
while True:
rows, total_rows, page_token = query_job.results.fetch_data(max_results=100000, page_token=page_token)
for row in rows:
try:
lastAValue=log[-1][0]
except IndexError:
lastAValue=None
if(lastAValue==None or row[0]==lastAValue):
log.append(row)
else:
res=Compute(lastAValue,EntityType,lastAValue)
allresults.append(res)
log=[]
log.append(row)
if not page_token:
break
I have two questions :
Column A | Column B ......
123 | NDG
123 | KOE
123 | TR
345 | POP
345 | KOP
345 | POL
The way I kept my logic is : I iterate through the rows and check if column A is same as last row column A. If same , then I add that row to an array. The moment I encounter a different column A i.e 345 , I send the first group of column A for processing , compute and add the data to my array. Based on this approach I had some questions :
1) I am effectively querying only once . So , I should be charged only for 1 query. Does big query charge as per totalRows/noOf pages ? i.e will individual pages from above code be separate query and charged separately ?
2) Assume page size in the above example would be 5 , what would happen is the 345 entries would be spread across pages , in this case will I lose information about the 6 th 345 -POL entry as it will be in a different page ? Is there a work around for this ?
3)Is there a direct way to get around the whole check the successive rows if they differ in values ? like a direct group by and get groups as array mechanism ? The above approach takes a couple of hours (estimated ) to run if i add a limit of 1 million.
4) How can I get around this error of Resources exceeded by specifying higher limits than 1 million.?

You are asking BigQuery to produce one huge sorted result, which BigQuery currently cannot efficiently parallelize, so you get the "Resources exceeded" error.
The efficient way to perform this kind of queries is to allow your computation to happen in SQL inside of BigQuery, rather than extracting huge result from it, and doing post-processing in Python. Analytical functions is common way to do what you described, if the Compute() function can be expressed in SQL.
E.g. for finding value of B in last row before A changes, you can find this row using LAST_VALUE function, something like
select LAST_VALUE(B) OVER(PARTITION BY A ORDER BY Yeah) from ...
If you could describe what Compute() does, we could try to fill details.

Related

adding values from two different rows into one using pyspark

I have two rows with the exact same data but columns changing between those two rows:
id
product
class
cost
1
table
large
5.12
1
table
medium
2.20
so I'm trying to get the following:
id
product
class
cost
1
table
large, Medium
7.32
I'm currently using the following code to get this:
df.groupBy("id", "product").agg(collect_list("class"),
(
F.sum("cost")
).alias("Sum")
The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this
id
product
class
cost
1
table
large, large
10.24
this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:
df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))
will it be the same to apply the F.array_join() function ?
You need to use the array_join function to join the results of collect_list with commas (,).
df = df.groupBy('id', 'product').agg(
F.array_join(F.collect_list('class'), ',').alias('class'),
F.sum('cost').alias('cost')
)

Efficient way to replace a large number of entries in a dataframe

I'm creating an automation program for work that automatically takes care of generating our end of the the month reports. The challenge I've run into is thinking of an efficient way to make a large number of replacements without a for loop and a bunch of if statements.
I have a file that's about 113 entries long giving me instructions on which entries need to be replaced with another entry
Uom
Actual UOM
0
ML
3
ML
4
UN
7
ML
11
ML
12
ML
19
ML
55
ML
4U
GR
There is a large number of duplicates where I change the values to the same thing (3,7,11 etc change to ML) but it still seems like I'd have to loop through a decent amount of if statements for every cell. I'd probably use a switch statement for this in another language but python doesn't seem to have them.
Pseudocode for what I'm thinking:
for each in dataframe
if (3,7,11, etc...)
change cell to ML
if (4)
change cell to UN
if (4U)
change cell to GR
etc.
Is there a more efficient way to do this or am I on the right track?
I would create a dictionary from your mapping_df (I assume the dataframe you posted is called mapping_df), and then map the result in your main dataframe.
This way you won't need to manually declare anything, so even if new rows are added in the 113 rows mapping_df, the code will still work smoothly:
# Create a dictionary with your Uom as Key
d = dict(zip(mapping_df.Uom,mapping_df['Actual UOM']))
# And then use map on your main_df Uom column
main_df['Actual Uom'] = main_df['Uom'].map(d)
Something like the above should work.
Pandas might throw warning/error messages, such as "Truth value of Series is ambiguous...".
I'm not sure to understand what you're trying to achieve, but to get you started, if you wanted to modify the "Uom" column, you would do:
mask = df["Uom"] == 3 | df["Uom"] == 7 | df["Uom"] == 11
df.loc[mask, "Uom"] = "ML"
df.loc[df["Uom"] == 4, "Uom"] = "UN"

sqlite, filter rows with dynamic number of keys, but only if they have the same value in a specific column?

I am brand new to sqlite (and databases in general). I have done a ton of reading both here and elsewhere and am unable to find this specific problem. People tend to want counts, or duplicates. I need to filter.
I have a database with 3 columns (and a few hundred thousand entries)
column1 column2 column3
abc 123 ##$
egf 456 $%#
abc 321 !##
kop 123 &$%
pok 321 ^$#
and so on.
What I am trying to do is this. I need to retrieve all possible combinations of a list. For example
[123, 321]
all possible combos would be
[123],[321],[123,321]
I do not know what input can possibly be, it can be more than 2 strings, and so the combinations list can grow pretty fast. For single entries above, like 123, 321, it works out of the gate, the thing I am trying to get to work is with more than 1 value in a list.
So I am dynamically generating the select statement
sqlquery = "SELECT fileloc, frequency FROM words WHERE word=?"
while numOfVariables < len(list):
sqlquery += " or word=?"
numOfVariables += 1
This generates the query, then I execute it with
cursor.execute(sqlquery,tuple(list))
Which works. It finds me all rows with any of those combinations.
Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
So in the above example it would select rows 1 and 3 because their column2 has the values I am interested in, and their column1 is the same. But column 4 would not be selected even though it has value we want. Because it's column1 does not match 321's column1. Same thing for row 5, again even though its one of the values we need, it's column1 doesnt match 123's.
From things Ive been able to find, people compare against specific value by using GROUP BY. But in my case I do not know what that value may be. All I care about is if its the same between the rows or not.
I am sorry if my explanation is not clear. I have never used mysql before this week so I dont know all the technical terms.
But basically I need the functionality of (pseudo code):
if (column2 is 123 or 321) and 123.column1 == 321.column1:
count
else:
dont count
I have a feeling this can be done by first moving whatever matches 123 or 321 into a new table. Then going through that table and only keeping records that have both 123 and 321 with the same column1 value. But I am not sure how to do this or if its the proper approach? Because this thing is going to scale pretty quick, if there are 5 inputs, then the rows that are kept is if there is one row to account for each input and all of their column1 is the same. (So rows would be saved in sets of 5).
Thank you.
(I am using Python 2.7.15)
You wrote:
"I need to retrieve all possible combinations of a list"
"Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
Use self-join for this purpose:
SELECT W1.column2, W2.column2
FROM words W1
JOIN words W2 ON W1.column1 = W2.column1
Correct me if I miss something in your question but this 3 lines must be sufficient.
Python looks as irrelevant for your question. It could be solved in pure SQL

Merging large data in Python in local machine

I have 140 csv files. Each file has 3 variables and is about 750 GB. Number of observation varies from 60 to 90 million.
I also have another small file, treatment_data - with 138000 row (for each unique ID) and 21 column (01 column for ID and 20 columns of 1s and 0s indicating whether the ID was given a particular treatment or not.
The variables are,
ID_FROM: A Numeric ID
ID_TO: A Numeric ID
DISTANCE: A numeric variable of physical distance between ID_FROM and ID_TO
(So in total, I have 138000*138000 (= 19+ Billion)rows - for every possible bi-lateral combination all ID, divided across these 140 files.
Research Question: Given a distance, how many neighbors (of each treatment type) an ID has.
So I need help with a system (preferably in Pandas) where
the researcher will input a distance
the program will look over all the files and filter out the the
rows wither DISTANCE between ID_FROM and ID_TO is less than
the given distance
output a single dataframe. (DISTANCE can be dropped at this
point)
merge the dataframe with the treatment_data by matching ID_TO
with ID. (ID_TO can be dropped at this point)
collapse the data by ID_FROM (group_by and sum the 1s, across
20 treatment variable.
(In the Final output dataset, I will have 138000 row and 21 column. 01 column for ID. 20 column for each different treatment type. So, for example, I will be able to answer the question, "Within '2000' meter, How many neighbors of '500' (ID) is in 'treatment_media' category?"
IMPORTANT SIDE NOTE:
The DISTANCE variable range between 0 to roughly the radius of an
average sized US State (in meter). Researcher is mostly interested to
see what happens with in 5000 meter. Which usually drops 98% of
observations. But sometimes, he/she will check for longer distance
measure too. So I have to keep all the observations available.
Otherwise, I could have simply filtered out the DISTANCE more than
5000 from the raw input files and made my life easier. The reason I
think this is important is that, the data are sorted based in
ID_FROM across 140 files. If I could somehow rearrange these 19+
billion rows based on DISTANCE and associate them have some kind of
dictionary system, then the program does not need to go over all the
140 files. Most of the time, the researcher will be looking into only
2 percentile of DISTANCE range. It seems like a colossal waste of
time that I have to loop over 140 files. But this is a secondary
thought. Please do provide answer even if you can't use this
additional side-note.
I tried looping over 140 files for a particular distance in Stata, It
takes 11+ hour to complete the task. Which is not acceptable as the
researcher will want to vary the distance with in 0 to 5000 range.
But, most of the computation time is wasted on reading each dataset
on memory (that is how Stata do it). That is why I am seeking help in
Python.
Is there a particular reason that you need to do the whole thing in Python?
This seems like something that a SQL database would be very good at. I think a basic outline like the following could work:
TABLE Distances {
Integer PrimaryKey,
String IdFrom,
String IdTo,
Integer Distance
}
INDEX ON Distances(IdFrom, Distance);
TABLE TreatmentData {
Integer PrimaryKey,
String Id,
String TreatmentType
}
INDEX ON TreatmentData(Id, TreatmentType);
-- How many neighbors of ID 500 are within 2000 meters and have gotten
-- the "treatment_media" treatment?
SELECT
d.IdFrom AS Id,
td.Treatment,
COUNT(*) AS Total
FROM Distances d
JOIN TreatmentData td ON d.IdTo = td.Id
WHERE d.IdFrom = "500"
AND d.Distance <= 2000
AND td.TreatmentType = "treatment_media"
GROUP BY 1, 2;
There's probably some other combination of indexes that would give better performance, but this seems like it would at least answer your example question.

Optimizing pandas computation

I have 22 million rows of house property sale data in a database table called sale_transactions. I am performing a job where I read information from this table, perform some calculations, and use the results to create entries to a new table. The process looks like this:
for index, row in zipcodes.iterrows(): # ~100k zipcodes
sql_string = """SELECT * from sale_transactions WHERE zipcode = '{ZIPCODE}' """
sql_query = sql_string.format(ZIPCODE=row['zipcode'])
df = pd.read_sql(sql_query, _engine)
area_stat = create_area_stats(df) # function does calculations
area_stat.save() # saves a Django model
At the moment each iteration of this loop takes about 20 seconds on my macbook pro (16GB RAM), which means that the code is going to take weeks to finish. The expensive part is the read_sql line.
How can I optimize this? I can't read the whole sale_transactions table into memory, it is about 5 GB, hence using the sql query each time to capture the relevant rows with the WHERE clause.
Most answers about optimizing pandas talk about reading with chunking, but in this case I need to perform the WHERE on all the data combined, since I am performing calculations in the create_area_stats function like number of sales over a ten year period. I don't have easy access to a machine with loads of RAM, unless I start going to town with EC2, which I worry will be expensive and quite a lot of hassle.
Suggestions would be greatly appreciated.
I also faced similar problem and the below code helped me to read database (~ 40 million rows) effectively .
offsetID = 0
totalrow = 0
while (True):
df_Batch=pd.read_sql_query('set work_mem="1024MB"; SELECT * FROM '+tableName+' WHERE row_number > '+ str(offsetID) +' ORDER BY row_number LIMIT 100000' ,con=engine)
offsetID = offsetID + len(df_Batch)
#your operation
totalrow = totalrow + len(df_Batch)
you have to create a index called row_number in your table. So this code will read your table (100 000 rows) index wise. for example when you want to read rows from 200 000 - 210 000 you don't need to read from 0 to 210 000. It will directly read by index. So It will improve your performance.
Since the bottleneck in the operation was the SQL WHERE query, the solution was to index the column upon which the WHERE statement was operating (i.e. the zipcode column).
In MySQL, the command for this was:
ALTER TABLE `db_name`.`table`
ADD INDEX `zipcode_index` USING BTREE (`zipcode` ASC);
After making this change, the loop execution speed increased by 8 fold.
I found this article useful because it encouraged profiling queries using EXPLAIN and observing opportunities for column indexing when key and possible_key values were NULL

Categories

Resources