I am working with an ex termly large datfarem. Here is a sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'],
})
ID
0 A
1 A
2 A
3 X
4 X
5 Y
Now, given the frequency of each value in column '''ID''', I want to calculate a weight using the function below and add a column that has the weight associated with each value in '''ID'''.
def get_weights_inverse_num_of_samples(label_counts, power=1.):
no_of_classes = len(label_counts)
weights_for_samples = 1.0/np.power(np.array(label_counts), power)
weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
return weights_for_samples
freq = df.value_counts()
print(freq)
ID
A 3
X 2
Y 1
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]
So, I am looking for an efficient way to get a dataframe like this given the above weights:
ID sample_weight
0 A 0.54545455
1 A 0.54545455
2 A 0.54545455
3 X 0.81818182
4 X 0.81818182
5 Y 1.63636364
If you rely on duck-typing a little bit more, you can rewrite your function to return the same input type as outputted.
This will save you of needing to explicitly reaching back into the .index prior to calling .map
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})
def get_weights_inverse_num_of_samples(label_counts, power=1):
"""Using object methods here instead of coercing to numpy ndarray"""
no_of_classes = len(label_counts)
weights_for_samples = 1 / (label_counts ** power)
return weights_for_samples / weights_for_samples.sum() * no_of_classes
# select the column before using `.value_counts()`
# this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts()
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
# A 0.545455
# X 0.818182
# Y 1.636364
# note that now our weights are still a `pd.Series`
# that we can align directly against our `"ID"` column
df['sample_weight'] = df['ID'].map(weights)
print(df)
# ID sample_weight
# 0 A 0.545455
# 1 A 0.545455
# 2 A 0.545455
# 3 X 0.818182
# 4 X 0.818182
# 5 Y 1.636364
You can map the values:
df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))
NB. value_counts returns a MultiIndex with a single level, thus the needed get_level_values.
As noted by #ScottBoston, a better approach would be to use:
freq = df['ID'].value_counts()
df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))
Output:
ID sample_weight
0 A 0.545455
1 A 0.545455
2 A 0.545455
3 X 0.818182
4 X 0.818182
5 Y 1.636364
Related
I am trying to make plots with datashader. the data itself is a time series of points in polar coordiantes. i managed to transform them to cartesian coordianted(to have equal spaced pixles) and i can plot them with datashader.
the point where i am stuck is that if i just plot them with line() instead of points() it just connects the whole dataframe as a single line. i would like to plot the data of the dataframe group per group(the groups are the names in list_of_names ) onto the canvas as lines.
data can be found here
i get this kind of image with datashader
This is a zoomed in view of the plot generated with points() instead of line() the goal is to produce the same plot but with connected lines instead of points
import datashader as ds, pandas as pd, colorcet
import numby as np
df = pd.read_csv('file.csv')
print(df)
starlink_name = df.loc[:,'Name']
starlink_alt = df.loc[:,'starlink_alt']
starlink_az = df.loc[:,'starlink_az']
name = starlink_name.values
alt = starlink_alt.values
az = starlink_az.values
print(name)
print(df['Name'].nunique())
df['Date'] = pd.to_datetime(df['Date'])
for name, df_name in df.groupby('Name'):
print(name)
df_grouped = df.groupby('Name')
list_of_names = list(df_grouped.groups)
print(len(list_of_names))
#########################################################################################
#i want this kind of plot with connected lines with datashader
#########################################################################################
fig = plt.figure()
ax = fig.add_axes([0.1,0.1,0.8,0.8], polar=True)
# ax.invert_yaxis()
ax.set_theta_zero_location('N')
ax.set_rlim(90, 60, 1)
# Note: you must set the end of arange to be slightly larger than 90 or it won't include 90
ax.set_yticks(np.arange(0, 91, 15))
ax.set_rlim(bottom=90, top=0)
for name in list_of_names:
df2 = df_grouped.get_group(name)
ax.plot(np.deg2rad(df2['starlink_az']), df2['starlink_alt'], linestyle='solid', marker='.',linewidth=0.5, markersize=0.1)
plt.show()
print(df)
#########################################################################################
#transformation to cartasian coordiantes
#########################################################################################
df['starlink_alt'] = 90 - df['starlink_alt']
df['x'] = df.apply(lambda row: np.deg2rad(row.starlink_alt) * np.cos(np.deg2rad(row.starlink_az)), axis=1)
df['y'] = df.apply(lambda row: -1 * np.deg2rad(row.starlink_alt) * np.sin(np.deg2rad(row.starlink_az)), axis=1)
#########################################################################################
# this is what i want but as lines group per group
#########################################################################################
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.points(df, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#########################################################################################
#here i am stuck
#########################################################################################
for name in list_of_names:
df2 = df_grouped.get_group(name)
cvs = ds.Canvas(plot_width=2000, plot_height=2000)
agg = cvs.line(df2, 'y', 'x')
img = ds.tf.shade(agg, cmap=colorcet.fire, how='eq_hist')
#plt.imshow(img)
plt.show()
To do this, you have a couple options. One is inserting NaN rows as a breakpoint into your dataframe when using cvs.line. You need DataShader to "pick up the pen" as it were, by inserting a row of NaNs after each group. It's not the slickest, but that's a current recommended solution.
Really simple, hacky example:
In [17]: df = pd.DataFrame({
...: 'name': list('AABBCCDD'),
...: 'x': np.arange(8),
...: 'y': np.arange(10, 18),
...: })
In [18]: df
Out[18]:
name x y
0 A 0 10
1 A 1 11
2 B 2 12
3 B 3 13
4 C 4 14
5 C 5 15
6 D 6 16
7 D 7 17
This block groups on the 'name' column, then reindexes each group to be one element longer than the original data:
In [20]: res = df.set_index('name').groupby('name').apply(
...: lambda x: x.reset_index(drop=True).reindex(np.arange(len(x) + 1))
...: )
In [21]: res
Out[21]:
x y
name
A 0 0.0 10.0
1 1.0 11.0
2 NaN NaN
B 0 2.0 12.0
1 3.0 13.0
2 NaN NaN
C 0 4.0 14.0
1 5.0 15.0
2 NaN NaN
D 0 6.0 16.0
1 7.0 17.0
2 NaN NaN
You can plug this reindexed dataframe into datashader to have multiple disconnected lines in the result.
This is a still-open issue on the datashader repo, including additional examples and boilerplate code: https://github.com/holoviz/datashader/issues/257
Other options include restructuring your data to accommodate one of cvs.line's other formats. From the Canvas.line docstring:
def line(self, source, x=None, y=None, agg=None, axis=0, geometry=None,
antialias=False):
Parameters
----------
source : pandas.DataFrame, dask.DataFrame, or xarray.DataArray/Dataset
The input datasource.
x, y : str or number or list or tuple or np.ndarray
Specification of the x and y coordinates of each vertex
* str or number: Column labels in source
* list or tuple: List or tuple of column labels in source
* np.ndarray: When axis=1, a literal array of the
coordinates to be used for every row
agg : Reduction, optional
Reduction to compute. Default is ``any()``.
axis : 0 or 1, default 0
Axis in source to draw lines along
* 0: Draw lines using data from the specified columns across
all rows in source
* 1: Draw one line per row in source using data from the
specified columns
There are a number of additional examples in the cvs.line docstring. You can pass arrays as the x, y arguments giving multiple columns to use in forming lines when axis=1, or you can a dataframe with ragged array values.
See this pull request adding the line options (h/t to #James-a-bednar in the comments) for a discussion of their use.
I have a data like this in an CSV file;
x Y
[2,3,4] [3.4,2.5,3.1]
[4,5,2] [6.2,7.5,9.7]
[2,6,9] [4.6,2.5,2.4]
[1,3,6] [8.9,7.5,9.2]
I want to calculate the mean for each list in a row
x Y
[2,3,4] < mean [3.4,2.5,3.1] < mean
[4,5,2] < mean [6.2,7.5,9.7] < mean
[2,6,9] < mean [4.6,2.5,2.4] < mean
[1,3,6] < mean [8.9,7.5,9.2] < mean
and output the mean value to a CSV file.
How can it achieve it using python (pandas)?
EDIT
After some research, I found the solution to my issue above;
import csv
import pandas as pd
import numpy as np
from ast import literal_eval
#csv file you want to import
filename ="xy.csv"
fields = ['X','Y'] #field names
df = pd.read_csv(filename,usecols=fields,quotechar='"', sep=',',low_memory = True)
df.X = df.X.apply(literal_eval)
df.X = df.X.apply(np.mean) #calculates mean for the list in field 'X'
print(df.X) #print result
df.Y = df.Y.apply(literal_eval)
df.Y = df.Y.apply(np.mean) #calculates mean for the list in field 'Y'
print(df.Y)
Via applymap:
# df = df.applymap(lambda x: sum(eval(x))/ len(eval(x)))
df = df.applymap(np.mean) # suggested by alex
df = df.applymap(lambda x: sum(x)/ len(x))
OUTPUT:
x Y
0 3.000000 3.000000
1 3.666667 7.800000
2 5.666667 3.166667
3 3.333333 8.533333
You can use .applymap() with np.mean() to map the dataframe element-wise.
import numpy as np
df = df.applymap(eval) # optional step if your column is a string like a list instead of truly a list
df = df.applymap(np.mean)
Result:
print(df)
x Y
0 3.000000 3.000000
1 3.666667 7.800000
2 5.666667 3.166667
3 3.333333 8.533333
I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).
Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')
I have a Pandas data frame with some categorical variables. Something like this -
>>df
'a', 'x'
'a', 'y'
Now, I want to return a matrix with the conditional probabilities of each level appearing with every other level. For the data frame above, it would look like -
[1, 0.5, 0.5],
[1, 1, 0],
[1, 0, 1]
The three entries correspond to the levels 'a', 'x' and 'y'.
This is because conditional on the first column being 'a', the probabilities of 'x' and 'y' appearing are 0.5 each and so on.
I have some code that does this (below). However, the problem is that it is excruciatingly slow. So slow that the application I want to use it in times out. Does anyone have any tips to make it faster?
df = pd.read_csv('pathToData.csv')
df = df.fillna("null")
cols = 0
col_levels = []
columns = {}
num = 0
for i in df.columns:
cols += len(set(df[i]))
col_levels.append(np.sort(list(set(df[i]))))
for j in np.sort(list(set(df[i]))):
columns[i + '_' + str(j)] = num
num += 1
res = np.eye(cols)
for i in range(len(df.columns)):
for j in range(len(df.columns)):
if i != j:
row_feature = df.columns[i]
col_feature = df.columns[j]
rowLevels = col_levels[i]
colLevels = col_levels[j]
for ii in rowLevels:
for jj in colLevels:
frst = (df[row_feature] == ii) * 1
scnd = (df[col_feature] == jj) * 1
prob = sum(frst*scnd)/(sum(frst) + 1e-9)
frst_ind = columns[row_feature + '_' + ii]
scnd_ind = columns[col_feature + '_' + jj]
res[frst_ind, scnd_ind] = prob
EDIT: Here is a bigger example:
>>df
'a', 'x', 'l'
'a', 'y', 'l'
'b', 'x', 'l'
The number of distinct categories here are 'a', 'b', 'x', 'y' and 'l'. Since these are 5 categories, the output matrix should be 5x5. The first row and first column would be how often does 'a' appear conditional on 'a'. This is of course, 1 (as are all the diagonals). The first row and second column is conditional on 'a', what is the probability of 'b'. Since 'a' and 'b' are parts of the same column, this is zero. The first row and third column is the probability of 'x' conditional on 'a'. We see that 'a' appears twice but only once with 'x'. So, this probability is 0.5. And so on.
The way I approach the problem is to first calculate all unique levels in the dataset. Then loop through a cartesian product of those levels. At each step, filter the dataset to create a subset where condition is True. Then, count the number of rows in the subset where the event has happened. Below is my code.
import pandas as pd
from itertools import product
from collections import defaultdict
df = pd.DataFrame({
'col1': ['a', 'a', 'b'],
'col2': ['x', 'y', 'x'],
'col3': ['l', 'l', 'l']
})
levels = df.stack().unique()
res = defaultdict(dict)
for event, cond in product(levels, levels):
# create a subset of rows with at least one element equal to cond
conditional_set = df[(df == cond).any(axis=1)]
conditional_set_size = len(conditional_set)
# count the number of rows in the subset where at least one element is equal to event
conditional_event_count = (conditional_set == event).any(axis=1).sum()
res[event][cond] = conditional_event_count / conditional_set_size
result_df = pd.DataFrame(res)
print(result_df)
# OUTPUT
# a b l x y
# a 1.000000 0.000000 1.0 0.500000 0.500000
# b 0.000000 1.000000 1.0 1.000000 0.000000
# l 0.666667 0.333333 1.0 0.666667 0.333333
# x 0.500000 0.500000 1.0 1.000000 0.000000
# y 1.000000 0.000000 1.0 0.000000 1.000000
I am sure there are other faster methods, but it is the first thing that comes to my mind.
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)