I am currently having a dataframe as shown below:
This dataframe has 1 million rows. I would like to perform the following operation:
Say for row 0, b is 6.
I would like to create another column, c.
The c for row 0 is computed as the mean of a row (i.e. 8 rows in above image), where b is in range from 6-3 to 6+3 (here 3 is the fixed number for all rows).
Currently I have performed this operation by converting column a and column b to numpy arrays and then looping. Below I have attached the code:
index = range(0,1000000)
columns = ['A','B']
data = np.array([np.random.randint(10, size=1000000),np.random.randint(10, size=1000000)]).T
df = pd.DataFrame(data, index=index, columns=columns)
values_b = df.B.values
values_a = df.A.values
sum_array = []
program_starts = time.time()
for i in range(df.shape[0]):
value_index = [values_b[i] - 3,values_b[i] + 3]
sum_array.append(np.sum(values_a[value_index]))
time_now = time.time()
print('time taken ',time_now- program_starts)
This code is taking around 8 seconds to run.
How can I make this run faster? I tried to parallelize the task by splitting the array in 0.1 million rows and then calling this for loop in parallel for each 0.1 million array. But it takes even more time. Please, any help would be appreciated.
Related
I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5
I have a large Pandas dataframe, 24'000'000 rows × 6 columns plus index.
I need to read an integer in column 1 (which is = 1 or 2), then force the value in column 3 to be negative if column 1 = 1, or positive if = 2. I use the following code in Jupyter notebook:
for i in range(1000):
if df.iloc[i,1] == 1:
df.iloc[i,3] = abs(df.iloc[i,3])*(-1)
if df.iloc[i,1] == 2:
df.iloc[i,3] = abs(df.iloc[i,3])
The code above takes 2min 30sec to run for 1'000 rows only. For the 24M rows, it would take 41 days to complete !
Something is not right. The code runs in Jupyter Notebook/Chrome/Windows on a pretty high end PC.
The Pandas dataframe is created with pd.read_csv and then sorted and indexed this way:
df.sort_values(by = "My_time_stamp", ascending=True,inplace = True)
df = df.reset_index(drop=True)
The creation and sorting of the dataframe just takes a few seconds. I have other calculations to perform on this dataframe, so I clearly need to understand what I'm doing wrong.
np.where
a = np.where(df.iloc[:, 1].to_numpy() == 1, -1, 1)
b = np.abs(df.iloc[:, 3].to_numpy())
df.iloc[:, 3] = a * b
Vectorize it:
df.iloc[:, 3] = df.iloc[:, 3].abs() * (2 * (df.iloc[:, 1] != 1) - 1)
Explanation:
Treated as int, boolean series df.iloc[:, 1] != 1 gets converted to ones and zeroes. Multiplied by 2, it gets twos and zeroes. After subtracting one, it gets -1 where the first column is 1, and 1 otherwise. Finally, it is multiplied by the absolute value of the third column, which enforces the sign.
Vectorization typically provides an order of magnitude or two speedup comparing to for loops.
Use
df.iloc[:,3] = df.iloc[:,3].abs().mul( df.iloc[:,-1].map({2:1,1:-1}) )
Another way to do this:
import pandas as pd
Take an example data set:
df = pd.DataFrame({'x1':[1,2,1,2], 'x2':[4,8,1,2]})
Make new column, code values as -1 and +1:
df['nx1'] = df['x1'].replace({1:-1, 2:1})
Multiply columnwise:
df['nx1'] * df['x2']
I would like to do fuzzy matching where I match strings from a column of a large dataframe (130.000 rows) to a list (400 rows).
The code I wrote was tested on a small sample (matching 3000 rows to 400 rows) and works fine. It is too large to copy here but it roughly works like this:
1) data normalization of columns
2) create Cartesian product of columns and calculate Levensthein distance
3) select highest scoring matches and store 'large_csv_name' in seperate list.
4) compare list of 'large_csv_names' to 'large_csv', pull out all the intersecting data and write to a csv.
Because the Cartesian product contains over 50 million records I quickly run into memory errors.
That's why I would like to know how to divide the large dataset up in chunks on which I can then run my script.
So far I have tried:
df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
(step 1/4 as above)
As well as:
for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
(step 1/4 as above)
None of these methods seem to work. I would like to know how to run the fuzzy matching in chunks, that is cut the large csv up in pieces take a piece, run the code, take a piece, run the code etc.
In the meanwhile I wrote a script that slices a dataframe in chunks, each of which is then ready to be processed further. Since I'm new to python the code is probably a bit messy but I still wanted to share it with those who might be stuck with the same problem as I was.
import pandas as pd
import math
partitions = 3 #number of ways to split df
length = len(df)
list_index = list(df.index.values)
counter = 0 #var that will be used to stop slicing when df ends
block_counter0 = 0 #var which will indicate the begin index of slice
block_counter1 = block_counter0 + math.ceil(length/partitions) #likewise
while counter < int(len(list_index)): #stop slicing when df ends
df1 = df.iloc[block_counter0:block_counter1] #temp df that forms chunk
for i in range(block_counter0, block_counter1 ):
#insert operations on row of df1 here
counter += 1 #increase counter by 1 to stop slicing in time
block_counter0 = block_counter1 #when for loop ends indices areupdated
if block_counter0 + math.ceil(length / partitions) >
int(len(list_index)):
block_counter1 = len(list_index)
counter +=1
else:
block_counter1 = block_counter0 + math.ceil(length / partitions)
On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')
I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:
Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time
So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.
I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.
You can try below code to get the final result.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])
active_events= {}
for i in df.index:
active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})
df.join(last_columns)
Here goes. This is going to be SLOW.
Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)
import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics
def overlaps_with_row(row,frame):
starts_before_mask = frame.start_time <= row.start_time
ends_after_mask = frame.end_time > row.start_time
return (starts_before_mask & ends_after_mask).sum()
df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)
Yields:
In [8]: df
Out[8]:
start_time end_time number_which_overlap
0 4 7 3
1 3 5 2
2 1 3 1
3 2 8 2
[4 rows x 3 columns]
def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()
df["count"] = df.apply(counter , axis = 1)
This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.