Merging pandas.DataFrame

Merging pandas.DataFrame - python

I need help. I've to merge this DataFrames(examples) by adding new column and put percents there.
If 'level'<5000 it's NaN, if 5000<'level'<=7000 it's 5%, if 7000<'level'<=10000 it's 7% and etc..
import pandas as pd
levels = pd.DataFrame({'lev':[5000,7000,10000],'proc':['5%','7%','10%']})
data = pd.DataFrame({'name':['A','B','C','D','E','F','G'],'sum':[6500,3000,15000,1400,8600,5409,9999]})
My efforts do solve this... It doesn't work and I don't understand how to solve this.
temp = data[data['sum'] >= levels['lev'][2]]
temp['proc']=levels['proc'][2]
lev3 = temp
temp = data[levels['lev'][1]<=data['sum'] and data['sum']<=levels['lev'][2]]
temp['proc']=levels['proc'][1]
lev2 = temp
temp = data[levels['lev'][0]<=data['sum'] and data['sum']<=levels['lev'][1]]
temp['proc']=levels['proc'][0]
lev1 = temp
data = pd.concat([lev1,lev2,lev3,data])

You can apply a function to each row like this:
import pandas as pd
def levels(s):
if 5000 < s <= 7000:
return '5%'
elif 7000 < s <= 10000:
return '7%'
elif s > 10000:
return '10%'
df = pd.DataFrame({'name':['A','B','C','D','E','F','G'],'sum':[6500,3000,15000,1400,8600,5409,9999]})
df['Percent'] = df.apply(lambda x: levels(x['sum']), axis=1)
print(df)
name sum Percent
0 A 6500 5%
1 B 3000 None
2 C 15000 10%
3 D 1400 None
4 E 8600 7%
5 F 5409 5%
6 G 9999 7%

Related

DataFrame loop, use value from previous row as new value and fill the DataFrame. check the code

Can not figure out how to create function which could take sum of previous row and use it as new value for next row and repeat it till end of dataframe.
Example: (here we see 2 values and its profit from A to B in %)
import pandas as pd
df = pd.DataFrame(data={"A":[4,5,2,2], "B":[5,3,3,4]})
df['Profit_%'] = (df["B"] - df["A"]) / df["A"] * 100
print(df)
Function:
new_col = 'c'
def apply_func_decorator(func):
prev_row = {}
def wrapper(curr_row, **kwargs):
val = func(curr_row, prev_row)
prev_row.update(curr_row)
prev_row[new_col] = val
return val
return wrapper
#apply_func_decorator
def running_total(curr_row, prev_row):
return curr_row['Profit_%'] * prev_row.get(new_col, 0) /100 + 125
Output:
df[new_col] = df.apply(running_total, axis=1)
print(df)
So the question would be how to apply the function starting from 2nd row and tell Python that the initial amount to invest is 100$?
So in real life it should be:
We invest 100$ and from first line we get + 25% = 125$
From second line we loose 40% and we loose them from 125$ > and now we have just 75$
From third line we gain 50% and it will be: 112.5$
etc.

You are just compounding returns, you could simply use a cumulative product:
df["C"] = 100*(df["B"]/df["A"]).cumprod()
>>> df
A B C
0 4 5 125.0
1 5 3 75.0
2 2 3 112.5
3 2 4 225.0

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

I am definitely still learning python and have tried countless approaches, but can't figure this one out.
I have a dataframe with 2 columns, call them A and B. I need to return a df that will sum the row values of each of these two columns independently until a threshold sum of A exceeds some value, for this example let's say 10. So far I am am trying to use iterrows() and can get segment based on if A >= 10, but can't seem to solve summation of rows until the threshold is met. The resultant df must be exhaustive even if the final A values do not meet the conditional threshold - see final row of desired output.
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
A B
0 20 16
1 10 5
2 3 2
3 1 1
4 12 10
5 9 7
6 6 6
7 5 2
Desired result:
A B
0 20 16
1 10 5
2 16 13
3 15 13
4 5 2
Thank you in advance, much time spent, and assistance is much appreciated!!!
Cheers

I rarely write long loops for pandas, but I didn't see a way to do this with a pandas method. Try this horrible loop :( :
The variable I created t is essentially checking the cumulative sums to see if > n (which we have set to 10). Then, we decide to use t, the cumulative some or i the value in the dataframe for any given row (j and u are just there in parallel with to the same thing for column B).
There are a few conditions so some elif statements, and there will be different behavior for the last row the way I have set it up, so I had to have some separate logic for that with the last if -- otherwise the last value wasn't getting appended:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df1
a,b = [],[]
t,u,count = 0,0,0
n=10
for (i,j) in zip(df1['A'], df1['B']):
count+=1
if i < n and t >= n:
a.append(t)
b.append(u)
t = i
u = j
elif 0 < t < n:
t += i
u += j
elif i < n and t == 0:
t += i
u += j
else:
t = 0
u = 0
a.append(i)
b.append(j)
if count == len(df1['A']):
if t == i or t == 0:
a.append(i)
b.append(j)
elif t > 0 and t != i:
t += i
u += j
a.append(t)
b.append(u)
df2 = pd.DataFrame({'A' : a, 'B' : b})
df2

Here's one that works that's shorter:
import pandas as pd
df1 = pd.DataFrame(data = [[20,16],[10,5],[3,2],[1,1],[12,10],[9,7],[6,6],[5,2]],columns=['A','B'])
df2 = pd.DataFrame()
index = 0
while index < df1.size/2:
if df1.iloc[index]['A'] >= 10:
a = df1.iloc[index]['A']
b = df1.iloc[index]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
index += 1
else:
a_sum = 0
b_sum = 0
while a_sum < 10 and index < df1.size/2:
a_sum += df1.iloc[index]['A']
b_sum += df1.iloc[index]['B']
index += 1
if a_sum >= 10:
temp_df = pd.DataFrame(data=[[a_sum,b_sum]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
else:
a = df1.iloc[index-1]['A']
b = df1.iloc[index-1]['B']
temp_df = pd.DataFrame(data=[[a,b]], columns=['A','B'])
df2 = df2.append(temp_df, ignore_index=True)
The key is to keep track of where you are in the DataFrame and track the sums. Don't be afraid to use variables.
In Pandas, use iloc to access each row by index. Make sure you don't go out of the DataFrame by checking the size. df.size returns the number of elements, so it will multiply the rows by the columns. This is why I divided the size by the number of columns, to get the actual number of rows.

Transpose List of Outliers & Blanks to New Table

I am hoping to write a program that will run though multiple columns of data and create a new dataframe based on those that are found to be outliers and those that are blank. Currently, I have the below code that will replace the values with "Outlier" and "No Data" but I am struggling to convert this to a new dataframe.
Visual of request:
import pandas as pd
from pandas import ExcelWriter
# Remove Initial Data Quality
outl = ['.',0,' ']
# Pull in Data
path = r"C:\Users\robert.carmody\desktop\Python\PyTest\PyTGPS.xlsx"
sheet = 'Raw Data'
df = pd.read_excel(path,sheet_name=sheet)
data = pd.read_excel(path,sheet_name=sheet)
j = 0
while j < len(df.keys()): #run through total number of columns
list(df.iloc[:,j]) #create a list of all values within the 'j' column
if type(list(df.iloc[:,j])[0]) == float:
q1 = df.iloc[:,j].quantile(q=.25)
med = df.iloc[:,j].quantile(q=.50)
q3 = df.iloc[:,j].quantile(q=.75)
iqr = q3 - q1
ub = q3 + 1.5*iqr
lb = q1 - 1.5*iqr
mylist = [] #outlier list is defined
for i in df.iloc[:,j]: #identify outliers and add to the list
if i > ub or i < lb:
mylist.append(float(i))
else:
i
if mylist == []:
mylist = ['Outlier']
else:
mylist
else:
mylist = ['Outlier']
data.iloc[:,j].replace(mylist,'Outlier',inplace=True)
j = j + 1
data = data.fillna('No Data')
#Excel
path2 = r"C:\Users\robert.carmody\desktop\Python\PyTest\PyTGPS.xlsx"
writer = ExcelWriter(path2)
df.to_excel(writer,'Raw Data')
data.to_excel(writer,'Adjusted Data')
writer.save()

Suppose you data looks like this, and for simplicity upperbound is 2 and lowerbound is 0,
df = pd.DataFrame({'group':'A B C D E F'.split(' '), 'Q1':[1,1,5,2,2,2], 'Q2':[1,5,5,2,2,2],'Q3':[2,2,None,2,2,2]})
df.set_index('group', inplace=True)
i.e.:
Q1 Q2 Q3
group
A 1 1 2.0
B 1 5 2.0
C 5 5 NaN
D 2 2 2.0
E 2 2 2.0
F 2 2 2.0
Then the following might give what you want:
newData = []
for quest in df.columns: #run through the columns
q1 = df[quest].quantile(q=.25)
med = df[quest].quantile(q=.50)
q3 = df[quest].quantile(q=.75)
iqr = q3 - q1
#ub = q3 + 1.5*iqr
ub = 2 #my
#lb = q1 - 1.5*iqr
lb = 0 #my
for group in df.index:
i = df.loc[group, quest]
if i > ub or i < lb: #identify outliers and add to the list
newData += [[group, quest, 'Outlier', i]]
elif (i>0 or i<=0)==False:
newData += [[group, quest, 'None', None]]
creates a 2 dimensional list which can easily be transformed in a dataframe
by
pd.DataFrame(newData)

Python: How to change same numbers in a Series/Column to other values?

I am trying to change the values of a very long column (about 1mio entries) in a data frame. I have something like
####ID_Orig
3452
3452
3452
6543
6543
...
I want something like
####ID_new
0
0
0
1
1
...
At the moment I'm doing this:
j=0
for i in range(0,1199531):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
Which takes about ages... Is there a faster way to do this?
I don't know what values ID_orig has and how often a single value comes up.

Use factorize, but if duplicated groups then output values are set to same number.
Another solution with comparing by ne (!=) of shifted values with cumsum is more general - create always new values, also if repeating group values:
df['ID_new1'] = pd.factorize(df['ID_Orig'])[0]
df['ID_new2'] = df['ID_Orig'].ne(df['ID_Orig'].shift()).cumsum() - 1
print (df)
ID_Orig ID_new1 ID_new2
0 3452 0 0
1 3452 0 0
2 3452 0 0
3 6543 1 1
4 6543 1 1
5 100 2 2
6 100 2 2
7 6543 1 3 <-repeating group
8 6543 1 3 <-repeating group

You can do this …
import collections
l1 = [3452, 3452, 3452, 6543, 6543]
c = collections.Counter(l1)
l2 = list(c.items())
l3 = []
for i, t in enumerate(l2):
for x in range(t[1]):
l3.append(i)
for x in l3:
print(x)
This is the output:
0
0
0
1
1

You can use the following. In the following implementation duplicate ids in the original id will get same ids. The implementation is based on dropping duplicates from the column and assigning a different number to each unique id to form the enw ids. These new ids are then merged into the original dataset
import numpy as np
import pandas as pd
from time import time
num_rows = 119953
input_data = np.random.randint(1199531, size=(num_rows,1))
data = pd.DataFrame(input_data)
data.columns = ["ID_orig"]
data2 = pd.DataFrame(input_data)
data2.columns = ["ID_orig"]
t0 = time()
j=0
for i in range(0,num_rows-1):
if data.ID_orig[i]==data.ID_orig[i+1]:
data.ID_orig[i] = j
else:
data.ID_orig[i] = j
j=j+1
t1 = time()
id_new = data2.loc[:,"ID_orig"].drop_duplicates().reset_index().drop("index", axis=1)
id_new.reset_index(inplace=True)
id_new.columns = ["id_new"] + id_new.columns[1:].values.tolist()
data2 = data2.merge(id_new, on="ID_orig")
t2 = time()
print("Previous: ", round(t1-t0, 2), " seconds")
print("Current : ", round(t2-t1, 2), " seconds")
The output of the above program using only 119k rows is
Previous: 12.16 seconds
Current : 0.06 seconds
The runtime difference increases even more as the number of rows are increased.
EDIT
Using the same number of rows:
>>> print("Previous: ", round(t1-t0, 2))
Previous: 11.7
>>> print("Current : ", round(t2-t1, 2))
Current : 0.06
>>> print("jezrael's answer : ", round(t3-t2, 2))
jezrael's answer : 0.02

Updating dataframe with array ouput

I am trying to copy results from an array to dataframe from a for loop. However, each time I try this, the last value of the loop is updated in the dataframe:
counter = 0
sample = [1,2,5,10,15,20,30,60,120,180,240,300,360,420,480,540,600]
columns = ['1','2','5','10','15','20','30','60','120','180','240','300','360','420','480','540','600']
index = df.set_index([df.index])
resultsDf = pd.DataFrame(columns=columns)
resultsDf = pd.DataFrame()
resultsDf.set_index([resultsDf.index])
results = []
for index, rowEntry in TradesGTC.iterrows():#Entry of Trade
entryVolume = rowEntry[26]
entryPrice = rowEntry[28]
ccyPair = rowEntry[12][0:6]
entryTime = rowEntry['DateTime']
for data in sample:
exitTime = entryTime + dt.timedelta(seconds = data)
f = MD.between_time(entryTime,exitTime)
buy = entryVolume > 0
sell = entryVolume < 0
if buy == True:
maxBidinTimeFrame = f['bid'].max()
profit = (maxBidinTimeFrame - entryPrice) * entryVolume
results.append(profit)
if sell == True:
minAskinTimeFrame = f['offer'].min()
profit = (entryPrice - minAskinTimeFrame) * entryVolume
results.append(profit)
resultsDf.append(results)
The output which is returned is:
resultsDf
1 2 5 10 15 20 30 60 120 180 240 300 360 420 480 540 600
I expect to have a dataframe with column headers
1 2 5 10.... 600
and the results listed in each column going down so..
1 2 3 5 10....600
100 200 100...-50
.
.
Appreciate all help
Thanks

Thanks for all, but the missing piece in the puzzle..
resultsDf = resultsDf.append(pd.DataFrame(data = results), ignore_index=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging pandas.DataFrame - python

Related

DataFrame loop, use value from previous row as new value and fill the DataFrame. check the code

Iterate through df rows and sum values of two columns separately until condition is met on one of those columns

Transpose List of Outliers & Blanks to New Table

Python: How to change same numbers in a Series/Column to other values?

Updating dataframe with array ouput

Categories

Resources