how to do a nested for-each loop with PySpark

how to do a nested for-each loop with PySpark - python

Imagine a large dataset (>40GB parquet file) containing value observations of thousands of variables as triples (variable, timestamp, value).
Now think of a query in which you are just interested in a subset of 500 variables. And you want to retrieve the observations (values --> time series) for those variables for specific points in time (observation window or timeframe). Such having a start and end time.
Without distributed computing (Spark), you could code it like this:
for var_ in variables_of_interest:
for incident in incidents:
var_df = df_all.filter(
(df.Variable == var_)
& (df.Time > incident.startTime)
& (df.Time < incident.endTime))
My question is: how to do that with Spark/PySpark? I was thinking of either:
joining the incidents somehow with the variables and filter the dataframe afterward.
broadcasting the incident dataframe and use it within a map-function when filtering the variable observations (df_all).
use RDD.cartasian or RDD.mapParitions somehow (remark: the parquet file was saved partioned by variable).
The expected output should be:
incident1 --> dataframe 1
incident2 --> dataframe 2
...
Where dataframe 1 contains all variables and their observed values within the timeframe of incident 1 and dataframe 2 those values within the timeframe of incident 2.
I hope you got the idea.
UPDATE
I tried to code a solution based on idea #1 and the code from the answer given by zero323. Work's quite well, but I wonder how to aggregate/group it to the incident in the final step? I tried adding a sequential number to each incident, but then I got errors in the last step. Would be cool if you can review and/or complete the code. Therefore I uploaded sample data and the scripts. The environment is Spark 1.4 (PySpark):
Incidents: incidents.csv
Variable value observation data (77MB): parameters_sample.csv (put it to HDFS)
Jupyter Notebook: nested_for_loop_optimized.ipynb
Python Script: nested_for_loop_optimized.py
PDF export of Script: nested_for_loop_optimized.pdf

Generally speaking only the first approach looks sensible to me. Exact joining strategy on the number of records and distribution but you can either create a top level data frame:
ref = sc.parallelize([(var_, incident)
for var_ in variables_of_interest:
for incident in incidents
]).toDF(["var_", "incident"])
and simply join
same_var = col("Variable") == col("var_")
same_time = col("Time").between(
col("incident.startTime"),
col("incident.endTime")
)
ref.join(df.alias("df"), same_var & same_time)
or perform joins against particular partitions:
incidents_ = sc.parallelize([
(incident, ) for incident in incidents
]).toDF(["incident"])
for var_ in variables_of_interest:
df = spark.read.parquet("/some/path/Variable={0}".format(var_))
df.join(incidents_, same_time)
optionally marking one side as small enough to be broadcasted.

Related

Group by two colums in pandas

I have the following script running in my power bi
dataset['Percent_Change'] = dataset.groupby('Contract_Year')['Norm_Price'].pct_change().fillna(0)
dataset['Norm_Change'] = dataset['Percent_Change'].add(1).groupby(dataset['Contract_Year']).cumprod()
Initially I only had one location so I just needed it the calculation to run by contract year.
Now I have several locations in the same dataset how do I make the calculation perform the same but for each location and then by contract year?
Field name: [location]

In order to group by several columns, you can just use a list of columns instead of a string.
something like:
dataset.groupby(['Contract_Year','Locations'])

Compare two date columns in pandas DataFrame to validate third column

Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!

IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

Dask - Is there a dask dataframe equivalent of pandas df.values.tolist()?

I am reading a CSV file with about 25million rows and 4 columns - Lat, Long, Country and Level. After filtering out what I dont want, I am left with around 500k rows which i would like to visualise using Folium.
Folium requires the dataframe with the lat, long and level columns passed to it as individual rows in the following manner
data = ddf.apply(lambda row: makeList(row['Latitude'], row['Longitude'], row['Level']), axis=1, meta=object)
makeList is a function defined as follows -
def makeList(x,y,z):
return [x,y,z]
The above function takes about 120 seconds to compute. I was wondering if there's a way to speed this up by perhaps using 'ddf.values.tolist()' OR any other way that would compute quicker?
thanks!

The title of your post suggests that you want a list, so maybe a Dask bag would be an option.
But your post contains also Folium requires the dataframe with ..., so more
likely you need to generate just a DataFrame with the 3 mentioned columns.
To generate a DataFrame with a subset of columns, you can run:
data = ddf[['Latitude', 'Longitude', 'Level']]
Then, you could e.g. save it in a single CSV file:
data.to_csv('your_file.csv', single_file=True)
(500k rows is an acceptable number) and process it in another program as an "ordinary" (Pandas) DataFrame.

Detecting bad information (python/pandas)

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!

First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]

Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

Panda converting data to NaN when adding to a new DataSet

I´ve been trying to extract specific data from a given data set and add it in a new one in a specific set of organized column. I'm doing this by reading a CSV file and using the string function. The problem is that even though the data is extracted correctly Pandas will add the second column as NaN even though there is data stored in the affected variable, please see my code below, any idea on how to fix this?
processor=pd.DataFrame()
Hospital_beds="SH.MED.BEDS.ZS"
Mask1=data["IndicatorCode"].str.contains(Hospital_beds)
stage=data[Mask1]
Hospital_Data=stage["Value"]
Birth_Rate="SP.DYN.CBRT.IN"
Mask=data["IndicatorCode"].str.contains(Birth_Rate)
stage=data[Mask]
Birth_Data=stage["Value"]
processor["Countries"]=stage["CountryCode"]
processor["Birth Rate per 1000 people"]=Birth_Data
processor["Hospital beds per 100 people"]=Hospital_Data
processor.head(10)

The problem here is that the indices are not matching up. When you initially populate the processor data frame you are using each line from the original dataframe that contained birth rate data. These lines are different from the ones that contain the hospital beds data so when you do
processor["Hospital beds per 100 people"] = Hospital_Data
pandas will create the new column, but since there are no matching indices for the Hospital_Data in processor it will just contain null values.
Probably what you first want to do is re-index the original data using the country code and the year
data.set_index(['CountryCode','Year'], inplace=True)
You can then create a view of just the indicators you are interested in
indicators = ['SH.MED.BEDS.ZS', 'SP.DYN.CBRT.IN']
dview = data[data.IndicatorCode.isin(indicators)]
Finally you can then pivot on the indicator code to view each indicator on the same line
dview.pivot(columns='IndicatorCode')['Value']
But note this will still contain a lot of NaNs. This is just because the hospital bed data is updated very infrequently (or e.g. in Aruba not at all). But you can filter these out as needed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.