Is there a way to traverse through a dask dataframe backwards?

Is there a way to traverse through a dask dataframe backwards? - python

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

Assuming that the dataframe is indexed, the inversion of the index can be done as a two step process: invert the order of partitions and invert the index within each partition:
from dask.datasets import timeseries
ddf = timeseries()
ddf_inverted = (
ddf
.partitions[::-1]
.map_partitions(lambda df: df.sort_index(ascending=False))
)

If the last N rows are all in the last partition, you can use dask.dataframe.tail. If not, you can iterate backwards using the dask.dataframe.partitions attribute. This isn't particularly smart and will blow up your memory if you request too many rows, but it should do the trick:
def get_last_n(n, df):
read = []
lines_read = 0
for i in range(df.npartitions - 1, -1, -1):
p = df.partitions[i].tail(n - lines_read)
read.insert(0, p)
lines_read += len(p)
if lines_read >= n:
break
return pd.concat(read, axis=0)
For example, here's a dataframe with 20 rows and 5 partitions:
import dask.dataframe, pandas as pd, numpy as np, dask
df = dask.dataframe.from_pandas(pd.DataFrame({'A': np.arange(20)}), npartitions=5)
You can call the above function with any number of rows to get that many rows in the tail:
In [4]: get_last_n(4, df)
Out[4]:
A
16 16
17 17
18 18
19 19
In [5]: get_last_n(10, df)
Out[5]:
A
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
Requesting more rows than are in the dataframe just computes the whole dataframe:
In [6]: get_last_n(1000, df)
Out[6]:
A
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
Note that this requests the data iteratively, so may be very inefficient if your graph is complex and involves lots of shuffles.

Related

Make edges thicker in NetworkX

student_id 0 1 2 3 4 5 6 7 8 9 10 11 12
0 131X1319 1 14 6 16 1 10 8 15 15 17 15 18 16
1 13212YX3 1 1 4 8 11 9 14 7 0 3 0 17 13
2 13216131 1 1 13 9 15 17 0 9 3 15 11 8 10
3 132921W6 1 14 10 4 18 7 8 15 15 17 15 18 16
I have a dataframe like this. And I want to make a graph using networkX. And I want to make the edge thicker each time an edge goes from one node to another node. Suppose,
15->15->17->15->18->16
appears twice in the dataframe. So, I want to increase the thickness to two. I made the normal graph but not been able to increase the graph thickness.
This is my code to create the normal graph:
columns=list(pattern_df.columns.values)
pattern_g = nx.empty_graph(0, nx.DiGraph())
for i in range(len(columns)-1):
pattern_g.add_edges_from(zip(pattern_df[columns[i]],
pattern_df[columns[i+1]]))
sum_val=pattern_df.sum(numeric_only=True, axis=0)
values = [sum_val.get(node, 0.25) for node in pattern_g.nodes()]
nx.draw(pattern_g, with_labels=True, font_color='black')
plt.show()
This is the graph I have generated to the sample data:

You've done a poor job of explaining what you're trying to do. Also, it would have been nice if you had provided code that could work with a simple copy and paste.
I suspect that what you have in mind is something like this.
And I want to make the edge thicker each time an edge goes from one node to another node. Suppose that the sequence
15 15 17 15 18 16
appears in two different rows in the dataframe. So, I want to increase the thickness of each edge corresponding to a contiguous pair within that sequence, i.e. 15->15, 15->17, 17->15 and so forth.
Your explanation doesn't say what should happen if the same pair appears multiple times within the same row; I assume that such repetitions should separately count towards the thickness of that edge.
Here is some code that does work if you simply copy and paste it and does my best guess at what you're trying to do (i.e. assumes my interpretation is correct).
from collections import Counter
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
# Reconstruct the dataframe from its inconvenient format
df_str = ''' student_id 0 1 2 3 4 5 6 7 8 9 10 11 12
0 131X1319 1 14 6 16 1 10 8 15 15 17 15 18 16
1 13212YX3 1 1 4 8 11 9 14 7 0 3 0 17 13
2 13216131 1 1 13 9 15 17 0 9 3 15 11 8 10
3 132921W6 1 14 10 4 18 7 8 15 15 17 15 18 16
'''
lines = df_str.splitlines()
cols = lines[0].split()
data = [line.split()[1:] for line in lines[1:]]
pattern_df = pd.DataFrame(data,columns = cols)
# Count appearance of each edge
columns=list(pattern_df.columns.values)
ct = Counter(p for i in range(len(columns)-1)
for p in zip(pattern_df[columns[i]],pattern_df[columns[i+1]]))
# Build associated graph
pattern_g = nx.DiGraph()
pattern_g.add_edges_from(ct)
# Draw graph, using frequency of each pair as edge-width
width = [ct[p] for p in pattern_g.edges]
nx.draw(pattern_g, node_color = 'orange', with_labels=True, width = width)
plt.show()
Here's the result.
Regarding your comment: in order to add the width of an edge as an attribute within the graph pattern_g, you can make the following change to the graph-building section of the script I suggested.
# Build associated graph
pattern_g = nx.DiGraph()
for e,v in ct.items():
pattern_g.add_edge(*e, weight=v)

Creating a data frame named after values from another data frame

I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.

If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary

Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.

np.random.choice exclude certain numbers?

I'm supposed to create code that will simulate a d20 sided dice rolling 25 times using np.random.choice.
I tried this:
np.random.choice(20,25)
but this still includes 0's which wouldn't appear on a dice.
How do I account for the 0's?

Use np.arange:
import numpy as np
np.random.seed(42) # for reproducibility
result = np.random.choice(np.arange(1, 21), 50)
print(result)
Output
[ 7 20 15 11 8 7 19 11 11 4 8 3 2 12 6 2 1 12 12 17 10 16 15 15
19 12 20 3 5 19 7 9 7 18 4 14 18 9 2 20 15 7 12 8 15 3 14 17
4 18]
The above code draws numbers from 0 to 20 both inclusive. To understand why, you could check the documentation of np.random.choice, in particular on the first argument:
a : 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an
int, the random sample is generated as if a was np.arange(n)

np.random.choice() takes as its first argument an array of possible choices (if int is given it works like np.arrange), so you can use list(range(1, 21)) to get the output you want

+1
np.random.choice(20,25) + 1

Finding row with closest numerical proximity within Pandas DataFrame

I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!

Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...

Fill multiple nulls for categorical data

I'm wondering if there is a pythonic way to fill nulls for categorical data by randomly choosing from the distribution of unique values. Basically proportionally / randomly filling categorical nulls based on the existing distribution of the values in the column...
-- below is an example of what I'm already doing
--I'm using numbers as categories to save time, I'm not sure how to randomly input letters
import numpy as np
import pandas as pd
np.random.seed([1])
df = pd.DataFrame(np.random.normal(10, 2, 20).round().astype(object))
df.rename(columns = {0 : 'category'}, inplace = True)
df.loc[::5] = np.nan
print df
category
0 NaN
1 12
2 4
3 9
4 12
5 NaN
6 10
7 12
8 13
9 9
10 NaN
11 9
12 10
13 11
14 9
15 NaN
16 10
17 4
18 9
19 9
This is how I'm currently inputting the values
df.category.value_counts()
9 6
12 3
10 3
4 2
13 1
11 1
df.category.value_counts()/16
9 0.3750
12 0.1875
10 0.1875
4 0.1250
13 0.0625
11 0.0625
# to fill categorical info based on percentage
category_fill = np.random.choice((9, 12, 10, 4, 13, 11), size = 4, p = (.375, .1875, .1875, .1250, .0625, .0625))
df.loc[df.category.isnull(), "category"] = category_fill
Final output works, just takes a while to write
df.category.value_counts()
9 9
12 4
10 3
4 2
13 1
11 1
Is there a faster way to do this or a function that would serve this purpose?
Thanks for any and all help!

You could use stats.rv_discrete:
from scipy import stats
counts = df.category.value_counts()
dist = stats.rv_discrete(values=(counts.index, counts/counts.sum()))
fill_values = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = fill_values
EDIT: For general data(not restricted to integers) you can do:
dist = stats.rv_discrete(values=(np.arange(counts.shape[0]),
counts/counts.sum()))
fill_idxs = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = counts.iloc[fill_idxs].index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a way to traverse through a dask dataframe backwards? - python

I want to read_parquet but read backwards from where you start (assuming a sorted index). I don't want to read the entire parquet into memory because that defeats the whole point of using it. Is there a nice way to do this?

Related

Make edges thicker in NetworkX

Creating a data frame named after values from another data frame

np.random.choice exclude certain numbers?

Finding row with closest numerical proximity within Pandas DataFrame

Fill multiple nulls for categorical data

Categories

Resources