Detecting bad information (python/pandas) - python

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!

First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]

Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

Related

Group by two colums in pandas

I have the following script running in my power bi
dataset['Percent_Change'] = dataset.groupby('Contract_Year')['Norm_Price'].pct_change().fillna(0)
dataset['Norm_Change'] = dataset['Percent_Change'].add(1).groupby(dataset['Contract_Year']).cumprod()
Initially I only had one location so I just needed it the calculation to run by contract year.
Now I have several locations in the same dataset how do I make the calculation perform the same but for each location and then by contract year?
Field name: [location]
In order to group by several columns, you can just use a list of columns instead of a string.
something like:
dataset.groupby(['Contract_Year','Locations'])

How to make processing of a Pandas groupby object more efficient?

"""
I have a data frame of million of rows that I did .groupby() on.
I'd like to retrieve the rows containing the nlargest value for each id and tissue combination.
Also, I need to generate another df containing the mean value for each id and tissue combination.
Although I'm using a powerful Linux server, the process is being running for more that 24 hours. Therefore, I'm looking for a more efficient strategy. I spend hours on stackoverflow but I failed to apply the solutions to my particular problem.
Thanks you in advance for helping me out.
"""
df = pd.DataFrame({'id': ['g1','g1','g1','g1','g2','g2','g2','g2','g2','g2'],\
'Trans':['g1.1','g1.2','g1.3','g1.4','g2.1','g2.2','g2.3','g2.2','g2.1','g2.1'],\
'Tissue': ['Lf','Lf','Lf','pc','Pol','Pol','Pol','Ant','Ant','m2'],\
'val': [0.0948,1.5749,1.8904,0.8673,2.1089,2.5058,4.5722,0.7626,3.1381,2.723]})
print('df')
df_highest = pd.DataFrame(columns=df.columns)#brand new df that will contain the rows of interest
for grpID,data in df.groupby(['id','Tissue']):
highest = data.nlargest(1,'val')
df_highest.append(highest)
df_highest.to_csv('out.txt',sep='\t',index=False)
If you are trying to get the largest value for each id and tissue combination, try this code.
df_highest = df.loc[df.groupby(['id','Tissue'])['val'].idxmax()]
This will give you the mean of the id and Tissue combination.
df_mean = df.groupby(['id','Tissue']).agg({'val':np.mean})

I have a large set of data, i want to realize the following action but takes too much time. How can I optimize it?

I am working on a set of data which i need to clean a bit before, around 400.000 lines,
Two actions to make :
- Resale Invoice Month are strings M201705, i want to make a column named
Year with only the year in that case 2017
Some commercial products which are string also, end up with TR, i want to delete the TR from these products. for example M23065TR i want to change all the products in that case in M23065, but in the column there are also products names which are already good M340767 for example
There is my code just under, it needs more than 2h to run, would you have a solution to simplify it so it takes less time.
Thank you very much
for i in range(Ndata.shape[0]):
Ndata.loc[i,'Year']=Ndata.loc[i,'Resale Invoice Month'][1:5]
if (Ndata['Commercial Product Code'][i][-2:]=='TR')==True:
Ndata.loc[i,'Commercial Product Code']=Ndata.loc[i,'Commercial Product Code'][:-2]
When using pandas, always try to vectorize, not using loop.
You can do something like:
# for Year
NData['Year'] = Ndata['Resale Invoice Month'].str[1:5]
# remove trailing TR, only row have it
idx = Ndata['Commercial Product Code'].str[-2:]=='TR'
Ndata.loc[idx, 'Commercial Product Code'] = Ndata[idx].str[:-2]

how to do a nested for-each loop with PySpark

Imagine a large dataset (>40GB parquet file) containing value observations of thousands of variables as triples (variable, timestamp, value).
Now think of a query in which you are just interested in a subset of 500 variables. And you want to retrieve the observations (values --> time series) for those variables for specific points in time (observation window or timeframe). Such having a start and end time.
Without distributed computing (Spark), you could code it like this:
for var_ in variables_of_interest:
for incident in incidents:
var_df = df_all.filter(
(df.Variable == var_)
& (df.Time > incident.startTime)
& (df.Time < incident.endTime))
My question is: how to do that with Spark/PySpark? I was thinking of either:
joining the incidents somehow with the variables and filter the dataframe afterward.
broadcasting the incident dataframe and use it within a map-function when filtering the variable observations (df_all).
use RDD.cartasian or RDD.mapParitions somehow (remark: the parquet file was saved partioned by variable).
The expected output should be:
incident1 --> dataframe 1
incident2 --> dataframe 2
...
Where dataframe 1 contains all variables and their observed values within the timeframe of incident 1 and dataframe 2 those values within the timeframe of incident 2.
I hope you got the idea.
UPDATE
I tried to code a solution based on idea #1 and the code from the answer given by zero323. Work's quite well, but I wonder how to aggregate/group it to the incident in the final step? I tried adding a sequential number to each incident, but then I got errors in the last step. Would be cool if you can review and/or complete the code. Therefore I uploaded sample data and the scripts. The environment is Spark 1.4 (PySpark):
Incidents: incidents.csv
Variable value observation data (77MB): parameters_sample.csv (put it to HDFS)
Jupyter Notebook: nested_for_loop_optimized.ipynb
Python Script: nested_for_loop_optimized.py
PDF export of Script: nested_for_loop_optimized.pdf
Generally speaking only the first approach looks sensible to me. Exact joining strategy on the number of records and distribution but you can either create a top level data frame:
ref = sc.parallelize([(var_, incident)
for var_ in variables_of_interest:
for incident in incidents
]).toDF(["var_", "incident"])
and simply join
same_var = col("Variable") == col("var_")
same_time = col("Time").between(
col("incident.startTime"),
col("incident.endTime")
)
ref.join(df.alias("df"), same_var & same_time)
or perform joins against particular partitions:
incidents_ = sc.parallelize([
(incident, ) for incident in incidents
]).toDF(["incident"])
for var_ in variables_of_interest:
df = spark.read.parquet("/some/path/Variable={0}".format(var_))
df.join(incidents_, same_time)
optionally marking one side as small enough to be broadcasted.

Understanding groupby and pandas

I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!
As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.

Categories

Resources