Efficient way to replace a large number of entries in a dataframe

Efficient way to replace a large number of entries in a dataframe - python

I'm creating an automation program for work that automatically takes care of generating our end of the the month reports. The challenge I've run into is thinking of an efficient way to make a large number of replacements without a for loop and a bunch of if statements.
I have a file that's about 113 entries long giving me instructions on which entries need to be replaced with another entry
Uom
Actual UOM
0
ML
3
ML
4
UN
7
ML
11
ML
12
ML
19
ML
55
ML
4U
GR
There is a large number of duplicates where I change the values to the same thing (3,7,11 etc change to ML) but it still seems like I'd have to loop through a decent amount of if statements for every cell. I'd probably use a switch statement for this in another language but python doesn't seem to have them.
Pseudocode for what I'm thinking:
for each in dataframe
if (3,7,11, etc...)
change cell to ML
if (4)
change cell to UN
if (4U)
change cell to GR
etc.
Is there a more efficient way to do this or am I on the right track?

I would create a dictionary from your mapping_df (I assume the dataframe you posted is called mapping_df), and then map the result in your main dataframe.
This way you won't need to manually declare anything, so even if new rows are added in the 113 rows mapping_df, the code will still work smoothly:
# Create a dictionary with your Uom as Key
d = dict(zip(mapping_df.Uom,mapping_df['Actual UOM']))
# And then use map on your main_df Uom column
main_df['Actual Uom'] = main_df['Uom'].map(d)
Something like the above should work.

Pandas might throw warning/error messages, such as "Truth value of Series is ambiguous...".
I'm not sure to understand what you're trying to achieve, but to get you started, if you wanted to modify the "Uom" column, you would do:
mask = df["Uom"] == 3 | df["Uom"] == 7 | df["Uom"] == 11
df.loc[mask, "Uom"] = "ML"
df.loc[df["Uom"] == 4, "Uom"] = "UN"

Related

How to create a Dataframe from rows with conditions from another existing Dataframe using pandas?

So I have this problem, because of the size of the dataframe that I am working on, clearly, I cannot upload it, but it has the following structure:
country
coastline
EU
highest
1
Norway
yes
yes
1500
2
Turkey
yes
no
20100
...
...
...
...
41
Bolivia
no
no
999
42
Japan
yes
no
89
I have to solve several exercises with Pandas, among them is, for example, showing the country with the "highest" maximum, minimum and the average but only of the countries that do belong to the EU, I already solved the maximum and the minimum, but for the middle I thought about creating a new dataframe, one that is created from only the rows that contain a "yes" in the EU column, I've tried a few things, but they haven't worked.
I thought this is the best way to solve it, but if anyone has another idea, I'm looking forward to reading it.
By the way, these are the examples that I said that I was able to solve:
print('Minimum outside the EU')
paises[(paises.EU == "no")].sort_values(by=['highest'], ascending=[True]).head(1)
Which gives me this:
country
coastline
EU
highest
3
Belarus
no
no
345
As a last condition, this must be solved using pandas, since it is basically the chapter that we are working on in classes.

If you want to create a new dataframe that is based off of a filter on your first, you can do this:
new_df = df[df['EU'] == 'yes'].copy()
This will look at the 'EU' column in the original dataframe df, and only return the rows where it is 'yes'. I think it is good practice to add the .copy() since we can sometimes get strange side-affects if we then make changes to new_df (probably wouldn't here).

How to create a pandas Series (column), based in a match with a value in another Dataframe?

my question is the following: I do not know very well all the pandas methods and I think that there is surely a more efficient way to do this: I have to load two tables from .csv files to a postgres database; These tables are related to each other with an id, which serves as a foreign key, and comes from the source data, however I must relate them to a different id controlled by my logic.
I explain graphically in the following image:
Im trying to create a new Series based on the "another_id" that i have and apply a function that loop through a dataframe Series to compare if have the another code and get their id
def check_foreign_key(id, df_ppal):
if id:
for i in df_ppal.index:
if id == df_ppal.iloc[i]['another_id']:
return df_ppal.iloc[i]['id']
dfs['id_fk'] = dfs['another_id'].apply(lambda id : check_foreign_key(id, df_ppal))
In this point i think that it is not efficient because I have to loop in all column to match the another_id and get and get its the correct id that I need is in yellow in the picture.
So I should think about search algorithms to make the task more efficient, but I wonder if pandas does not have a method that allows me to do this faster, in case there are many records.
I need a dataframe like a this table that have a new column "ID Principal" based on matching Another_code, with another dataframe column.
ID
ID Principal
Another_code
1
12
54
2
12
54
3
13
55
4
14
56
5
14
56
6
14
56

Well indeed, I was not understanding very well all the pandas functions, I could solve my problem using merge, I did not know that pandas had a good implementation of the typical Join in SQL.
This documentation helped me a lot:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Pandas Merging 101
Finally my answer:
new_df = principal.merge(secondary, on='another_id')
I thank you all!

Modifying the date column calculation in pandas dataframe

I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.

Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)

sqlite, filter rows with dynamic number of keys, but only if they have the same value in a specific column?

I am brand new to sqlite (and databases in general). I have done a ton of reading both here and elsewhere and am unable to find this specific problem. People tend to want counts, or duplicates. I need to filter.
I have a database with 3 columns (and a few hundred thousand entries)
column1 column2 column3
abc 123 ##$
egf 456 $%#
abc 321 !##
kop 123 &$%
pok 321 ^$#
and so on.
What I am trying to do is this. I need to retrieve all possible combinations of a list. For example
[123, 321]
all possible combos would be
[123],[321],[123,321]
I do not know what input can possibly be, it can be more than 2 strings, and so the combinations list can grow pretty fast. For single entries above, like 123, 321, it works out of the gate, the thing I am trying to get to work is with more than 1 value in a list.
So I am dynamically generating the select statement
sqlquery = "SELECT fileloc, frequency FROM words WHERE word=?"
while numOfVariables < len(list):
sqlquery += " or word=?"
numOfVariables += 1
This generates the query, then I execute it with
cursor.execute(sqlquery,tuple(list))
Which works. It finds me all rows with any of those combinations.
Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
So in the above example it would select rows 1 and 3 because their column2 has the values I am interested in, and their column1 is the same. But column 4 would not be selected even though it has value we want. Because it's column1 does not match 321's column1. Same thing for row 5, again even though its one of the values we need, it's column1 doesnt match 123's.
From things Ive been able to find, people compare against specific value by using GROUP BY. But in my case I do not know what that value may be. All I care about is if its the same between the rows or not.
I am sorry if my explanation is not clear. I have never used mysql before this week so I dont know all the technical terms.
But basically I need the functionality of (pseudo code):
if (column2 is 123 or 321) and 123.column1 == 321.column1:
count
else:
dont count
I have a feeling this can be done by first moving whatever matches 123 or 321 into a new table. Then going through that table and only keeping records that have both 123 and 321 with the same column1 value. But I am not sure how to do this or if its the proper approach? Because this thing is going to scale pretty quick, if there are 5 inputs, then the rows that are kept is if there is one row to account for each input and all of their column1 is the same. (So rows would be saved in sets of 5).
Thank you.
(I am using Python 2.7.15)

You wrote:
"I need to retrieve all possible combinations of a list"
"Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
Use self-join for this purpose:
SELECT W1.column2, W2.column2
FROM words W1
JOIN words W2 ON W1.column1 = W2.column1
Correct me if I miss something in your question but this 3 lines must be sufficient.
Python looks as irrelevant for your question. It could be solved in pure SQL

Resources exceed limits big query

SELECT A,B, C, D, E, F ,EXTRACT(MONTH FROM PARSE_DATE('%b',Month))
as MonthNumber,PARSE_DATETIME(' %Y%b%d ', CONCAT(CAST(Year AS STRING),Month,'1'))
as G FROM `XXX.YYY.ZZZ`
where A !='null' and B = 'MYSTRING' order by A,Year
The query processes about 20 GB per run.
My table ZZZ has 396,567,431 (396 million) rows with a size of 53 GB. If I execute the above query without a LIMIT clause , I get an error saying "Resources exceeded".
If i execute it with a LIMIT clause , then it gives the same error for larger limits.
I am writing a python script using the API that runs the query above and then computes some metrics and then writes the output to another table. It writes some 1.7 million output rows, so basically aggregates the first table based on column A i.e original table has multiple rows for column A.
Now I know we can set Allow large results to on and select an output table to get around this error but for the purposes of my script it doesn't server the purpose.
Also , I read that order by is the expesnive part causing this but below is my algorithm and I dont see a way around order by.
Also my script pages the query results 100000 rows at a time.
log=[]
while True:
rows, total_rows, page_token = query_job.results.fetch_data(max_results=100000, page_token=page_token)
for row in rows:
try:
lastAValue=log[-1][0]
except IndexError:
lastAValue=None
if(lastAValue==None or row[0]==lastAValue):
log.append(row)
else:
res=Compute(lastAValue,EntityType,lastAValue)
allresults.append(res)
log=[]
log.append(row)
if not page_token:
break
I have two questions :
Column A | Column B ......
123 | NDG
123 | KOE
123 | TR
345 | POP
345 | KOP
345 | POL
The way I kept my logic is : I iterate through the rows and check if column A is same as last row column A. If same , then I add that row to an array. The moment I encounter a different column A i.e 345 , I send the first group of column A for processing , compute and add the data to my array. Based on this approach I had some questions :
1) I am effectively querying only once . So , I should be charged only for 1 query. Does big query charge as per totalRows/noOf pages ? i.e will individual pages from above code be separate query and charged separately ?
2) Assume page size in the above example would be 5 , what would happen is the 345 entries would be spread across pages , in this case will I lose information about the 6 th 345 -POL entry as it will be in a different page ? Is there a work around for this ?
3)Is there a direct way to get around the whole check the successive rows if they differ in values ? like a direct group by and get groups as array mechanism ? The above approach takes a couple of hours (estimated ) to run if i add a limit of 1 million.
4) How can I get around this error of Resources exceeded by specifying higher limits than 1 million.?

You are asking BigQuery to produce one huge sorted result, which BigQuery currently cannot efficiently parallelize, so you get the "Resources exceeded" error.
The efficient way to perform this kind of queries is to allow your computation to happen in SQL inside of BigQuery, rather than extracting huge result from it, and doing post-processing in Python. Analytical functions is common way to do what you described, if the Compute() function can be expressed in SQL.
E.g. for finding value of B in last row before A changes, you can find this row using LAST_VALUE function, something like
select LAST_VALUE(B) OVER(PARTITION BY A ORDER BY Yeah) from ...
If you could describe what Compute() does, we could try to fill details.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.