Count and position a column python3 - python

Using python3, I am working with a DataFrame with 2 columns, the first column is made of dates: for example from 2017-12-18 to 2017-12-25 (but it can change based on the user input). And the relative data on the second column.
What I need is to add a third column in the first position of the DataFrame as such:
1;2017-12-18;Data1
2;2017-12-19;Data2
n;last_date;DataN
import numpy as np
def func(df):
count = np.count_nonzero(df['Data_column'])
l = np.array([x for x in range(count)])
df['day_count'][0] = l
return l
But it keeps giving me wrong results, what I am mistaking?
Thanks in advance!
P.s. I used the count_nonzero because sometimes I could have N/A or blank values in the time series.

Related

Finding euclidean distance from multiple mean vectors

This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T

Comparing dates and filling a new column based on this condition

I want to check if one date is between two other dates (everything in the same row). If this is true I want that a new colum is filled with a sales value of the same table. If not the row shall be dropped.
The code shall iterate over the entire dataframe.
This is my code:
for row in final:
x = 0
if pd.to_datetime(final['start_date'].iloc[x]) < pd.to_datetime(final['purchase_date'].iloc[x]) < pd.to_datetime(final['end_date'].iloc[x]):
final['new_col'].iloc[x] = final['sales'].iloc[x]
else:
final.drop(final.iloc[x])
x = x + 1
print(final['new_col'])
Instead of the values of final[sales] I just get 0 back.
Does anyone know where the mistake is or any other efficient way to tackle this?
The DataFrame looks like this:
I will do something like this:
First, creating the new column:
import numpy as np
final['new_col'] = np.where(pd.to_datetime(final['start_date'])<(pd.to_datetime(final['purchase_date']), final['sales'], np.NaN)
Then, you just drop the Na's:
final.dropna(inplace=True)

Python Pandas: fill a column using values from rows at an earlier timestamps

I have a dataframe df where one column is timestamp and one is A. Column A contains decimals.
I would like to add a new column B and fill it with the current value of A divided by the value of A one minute earlier. That is:
df['B'] = df['A']_current / df['A'] _(current - 1 min)
NOTE: The data does not come in exactly every 1 minute so "the row one minute earlier" means the row whose timestamp is the closest to (current - 1 minute).
Here is how I do it:
First, I use the timestamp as index in order to use get_loc and create a new dataframe new_df starting from 1 minute after df. In this way I'm sure I have all the data when I go look 1 minute earlier within the first minute of data.
new_df = df.loc[df['timestamp'] > df.timestamp[0] + delta] # delta = 1 min timedelta
values = []
for index, row n new_df.iterrows():
v = row.A / df.iloc[df.index.get_loc(row.timestamp-delta,method='nearest')]['A']
values.append[v]
v_ser = pd.Series(values)
new_df['B'] = v_ser.values
I'm afraid this is not that great. It takes a long time for large dataframes. Also, I am not 100% sure the above is completely correct. Sometimes I get this message:
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
What is the best / most efficient way to do the task above? Thank you.
PS. If someone can think of a better title please let me know. It took me longer to write the title than the post and I still don't like it.
You could try to use .asof() if the DataFrame has been indexed correctly by the timestamps (if not, use .set_index() first).
Simple example here
import pandas as pd
import numpy as np
n_vals = 50
# Create a DataFrame with random values and 'unusual times'
df = pd.DataFrame(data = np.random.randint(low=1,high=6, size=n_vals),
index=pd.DatetimeIndex(start=pd.Timestamp.now(),
freq='23s', periods=n_vals),
columns=['value'])
# Demonstrate how to use .asof() to get the value that was the 'state' at
# the time 1 min since the index. Note the .values call
df['value_one_min_ago'] = df['value'].asof(df.index - pd.Timedelta('1m')).values
# Note that there will be some NaNs to deal with consider .fillna()

Delete series value from row of a pandas data frame based on another data frame value

My question is little bit different than the question posted here
So I thought to open a new thread.I have a pandas data frame with 5 attributes.One of these attribute is created using pandas series.Here is the sample code for creating the data frame
import numpy as np
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
data = np.array([2540948, 2540955, 2540956,2540956,7138932])
x=pd.Series(data)
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
I have another data frame,the code for creating the data frame is given below
mydf2=pd.DataFrame(columns=['group','id'])
data1 = np.array([2540948, 2540955, 2540956])
y=pd.Series(data1)
mydf2.loc[0]=[1,y]
These are sample data. Actual data will have large number of rows & also the series length is large too .I want to match mydf1 with mydf2 & if it matches,sometime I wont have matching element in mydf2,then I will delete values of id from mydf1 which are there in mydf2 for example after the run,my id will be for group 1 2540956,7138932. I also tried the code mentioned in above link. But for the first line
counts = mydf1.groupby('id').cumcount()
I got error message as
TypeError: 'Series' objects are mutable, thus they cannot be hashed
in my Python 3.X. Can you please suggest me how to solve this?
This should work. We use Counter to find the difference between 2 lists of ids. (p.s. This problem does not requires the difference is in order.)
Setup
import numpy as np
from collections import Counter
mydf1=pd.DataFrame(columns=['group','id','name','mail','gender'])
x = [2540948, 2540955, 2540956,2540956,7138932]
y = [2540948, 2540955, 2540956,2540956,7138932]
mydf1.loc[0]=[1,x,'abc','abc#xyz.com','male']
mydf1.loc[1]=[2,y,'def','def#xyz.com','female']
mydf2=pd.DataFrame(columns=['group','id'])
x2 = np.array([2540948, 2540955, 2540956])
y2 = np.array([2540955, 2540956])
mydf2.loc[0]=[1,x2]
mydf2.loc[1]=[2,y2]
Code
mydf3 = mydf1[["group", "id"]]
mydf3 = mydf3.merge(mydf2, how="inner", on="group")
new_id_finder = lambda x: list((Counter(x.id_x) - Counter(x.id_y)).elements())
mydf3["new_id"] = mydf3.apply(new_id_finder, 1)
mydf3["new_id"]
group new_id
0 1 [2540956, 7138932]
1 2 [2540948, 2540956, 7138932]
One Counter object can substract another to get the difference in occurances of elements. Then, you can use elements function to retrieve all values left.

Perform operation on each cell of one column with each cell of other columns in pandas

I have a data frame(df) consisting of more than 1000 columns. Each cell consists of a list.
e.g.
0 1 2 .... n
0 [1,2,3] [3,7,9] [1,2,1] ....[x,y,z]
1 [2,5,6] [2,3,1] [3,3,3] ....[x1,y1,z1]
2 None [2,0,1] [2,2,2] ....[x2,y2,z2]
3 None [9,5,9] None ....None
This list is actually the dimensions. I need to find the euclidean distance of each cell in column 0 with every other cell in column 1 and store the minimum.
Similarly from column 0 to column 2 and then to column 3 so on..
Example
distance of df[0][0] from df[1][0], df[1][1], df[1][2]
then of df[0][1] from df[1][0], df[1][1], df[1][2] and so on...
Currently i am doing it with help of for loops but it is taking a lot of time for large data.
Following is the implementation::
for n in range(len(df.columns)):
for m in range(n+1,len(df.columns)):
for q in range(df.shape[0]):
min1=9999
r=0
while(r<df.shape[0]):
if(df[n][q] is not None or df[m][r] is not None):
dist=distance.euclidean(df[n][q],df[m][r])
if(d<min1):
min1=d
if(min1==0): *#because distance can never be less than zero*
break
r=r+1
Is there any other way to do this?
You can use pandas apply. The example below will do the distance between 0 and 1 and create a new column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'0':[[1,2,3],[2,5,6]],'1':[[3,7,9],[2,3,1]]})
def eucledian(row):
x = np.sqrt((row[0][0]-row[1][0])**2)
y = np.sqrt((row[0][1]-row[1][1])**2)
z = np.sqrt((row[0][2]-row[1][2])**2)
return [x,y,z]
df['dist0-1'] = df.apply(eucledian,axis=1)
With this said I highly recommend to unroll your variables to separate columns, e.g. 0.x, 0.y, 0.z, etc and then use numpy to operate directly on columns. This will be much faster if you have a large amount of data.
since you don't tell me how to handle Nond in the distance calculation, I just give a example. You need to handle the exception of None type yourself.
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
df = pd.DataFrame([{'a':[1,2,3], 'b':[3,7,9], 'c':[1,2,1]},
{'a':[2,5,6], 'b':[2,3,1], 'c':[3,3,3]}])
df.apply(distance, axis=1).apply(min, axis=1)

Categories

Resources