I have a scenario where I am running two functions in a script:
test.py :
def func1():
df1=pd.read_csv('test1.csv')
val1=df['col1'].mean().round(2)
return va11
def func2():
df2=pd.read_csv('test2.csv')
val2=df['col1'].mean().round(2)
return val2
def func3():
dataf = pd.read_csv('test3.csv')
col1=dataf['area']
col2 = dataf['overall']
dataf['overall']=val1 # value from val1 ->leads to error
dataf['overall']=val2 #value from val2 ->leads to error
Here I am reading test1.csv & test2.csv file and I am storing the mean value in variable "val1" & "val2" respectively and returning the same.
These variable values I want to store in a new test3.csv file which is having two cols and values should be stored one after one(appending). By the above it is not working out & couldn't find anything on internet as such. Any help would be great.
You need pass variables as parameters in function func3, and if only difference in func1 and func2 is file name, create only one function with parameetr .
Thanks for idea cᴏʟᴅsᴘᴇᴇᴅ ;)
def func1(file):
df=pd.read_csv(file)
val=df['col1'].mean().round(2)
return val
a = func1('test1.csv')
b = func1('test2.csv')
def func3(val1=a, val2=b):
dataf = pd.read_csv('test3.csv')
col1=dataf['area']
col2 = dataf['overall']
dataf.iloc[::2, dataf.columns.get_loc('overall')] = val1
dataf.iloc[1::2, dataf.columns.get_loc('overall')] = val2
return dataf
Sample:
dataf = pd.DataFrame({'overall':[1,7,8,9,4],
'col':list('abcde')})
print (dataf)
col overall
0 a 1
1 b 7
2 c 8
3 d 9
4 e 4
val1 = 20
val2 = 50
dataf.iloc[::2, dataf.columns.get_loc('overall')] = val1
dataf.iloc[1::2, dataf.columns.get_loc('overall')] = val2
print (dataf)
col overall
0 a 20
1 b 50
2 c 20
3 d 50
4 e 20
General solution for append N values from list - create array by numpy.tile and then assign to new column:
val =[1,8,4]
a = np.tile(val, int(len(dataf) / len(val))+2)[:len(dataf)]
dataf['overall'] = a
print (dataf)
col overall
0 a 1
1 b 8
2 c 4
3 d 1
4 e 8
Related
I have a issue with applying function for column in pandas, please see below code :
import pandas as pd
#create a dict as below
data_dic = {
"text": ['hello',1,'how are you?',4],
"odd": [0,2,4,6],
"even": [1,3,5,7]
}
#create a DataFrame
df = pd.DataFrame(data_dic)
#define function
def checktext(str1):
if isinstance(str1,str):
return str1.upper()
def checknum(str1):
if isinstance(str1,int):
return str1+1
df['new'] = df['text'].apply(lambda x: checktext(x))
df['new'].head()
my df now show like below:
text odd even new
0 hello 0 1 HELLO
1 1 2 3 None
2 how are you? 4 5 HOW ARE YOU?
3 4 6 7 None
I would like to apply function checknum for 2 cell in column 'new' which is having 'None' value. Can someone assist this ? Thank you
IIUC, you can use vectorial code:
# make string UPPER
s = df['text'].str.upper()
# where there was no string, get number + 1 instead
df['new'] = s.fillna(df['text'].where(s.isna())+1)
output:
text odd even new
0 hello 0 1 HELLO
1 1 2 3 2
2 how are you? 4 5 HOW ARE YOU?
3 4 6 7 5
That said, for the sake of the argument, your 2 functions could be combined into one:
def check(str1):
if isinstance(str1,str):
return str1.upper()
elif isinstance(str1,int):
return str1+1
df['new'] = df['text'].apply(check)
your function:
def checktext(str1):
if isinstance(str1,str):
return str1.upper()
Will return None, if the if statement is false (i.e., 'str1' is not a string). By default, return the value?
def checktext(str1):
if isinstance(str1,str):
return str1.upper()
return str1
First, you could use the StringMethods accessor to convert to upper case whithout any loop. And when you have done that, you can easily process the rows where the result is NaN:
df['new'] = df['text'].str.upper()
mask = df['new'].isna()
df.loc[mask, 'new'] = df.loc[mask, 'text'] + 1
It gives directly:
text odd even new
0 hello 0 1 HELLO
1 1 2 3 2
2 how are you? 4 5 HOW ARE YOU?
3 4 6 7 5
Suppose I have the following dataframe:
CategoryID Days Views
a 1 19
a 2 2000
a 5 5667
a 7 7899
b 1 2
b 3 245
c 1 1
c 2 252
c 7 2657
Given a threshold = n, I want to create two lists and I'll append them until I reach that threshold + 1 element for each category.
So, if n < 4, I expect for category a:
days_list = [1,2,5]
views_list = [19, 2000, 5667]
After that, I want to apply a function in those lists and then, start the iteration in the next category. However, I'm facing two issues with the following code:
I can't iterate properly when i == 0
The iteration does not go to the next category.
df['interpolated'] = int
days_list = []
views_list = []
for i,post in enumerate(category):
if df['category_id'].iloc[i-1] != post:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['category_id'].iloc[i] == post and df[category_id].iloc[i-1] == post:
if df['days new'].iloc[i] < 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
elif df['days new'].iloc[i] != 3:
days_list.append(df['days new'].iloc[i])
views_list.append(df['views'].iloc[i])
break
# Calculate the interpolation
interpolator = log_interp1d(days_list,views_list)
df['interpolated'] = round(interpolator(4).astype(int))
# Reset the lists after the category loop
days_list = []
views_list = []
Can someone give me some light? Thanks!
You can use a row_number type operation.
....
df['row_number'] = df.groupby(['CategoryId']).cumcount+1
Then, you will have a dataframe
CategoryID Days Views row_number
a 1 19 1
a 2 2000 2
a 5 5667 3
a 7 7899 4
b 1 2 1
b 3 245 2
c 1 1 1
c 2 252 2
c 7 2657 3
Then, you should be able to use boolean filtering to get what you want. So for your example,
df_category_a_filtered_4 = df[(df['row_number'] == 3]) & (df['CategoryID'] == 'a')]
Which will filter your dataframe so that the two lists you want are the two columns. This can be functionized obviously to do whatever you need.
If you want a more specific output, please specify what that would look like.
I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.
Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object
I have a pandas.Dataframe called a and the structure is as follows:
while I want to get the DataFrame structure is like:
where the b is like the transpose of a.
By convert a to b, I use the code :
id_uni = a['id'].unique()
b = pd.DataFrame(columns=['id']+[str(i) for i in range(1,4)])
b['id'] = id_uni
for i in id_uni:
for j in range(7):
ind = (a['id'] == i) & (a['w'] == j)
med = a.loc[ind, 't'].values
if med:
b.loc[b['id'] == i, str(j)] = med[0]
else:
b.loc[b['id'] == i, str(j)] = 0
The method is very brutal that I just use two for-loops to get all elements from a to b. And it is very slow. Do you have an efficient way to improve it?
You can use pivot:
print (df.pivot(index='id', columns='w', values='t'))
w 1 2 3
id
0 54 147 12
1 1 0 1
df1 = df.pivot(index='id', columns='w', values='t').reset_index()
df1.columns.name=None
print (df1)
id 1 2 3
0 0 54 147 12
1 1 1 0 1
i wouldn't be posting this if i didn't do extensive research in attempt to find the answer. Alas, I have not been able to find any such answer. I have a paired dataset that looks something like this:
PERSON, ATTRIBUTE
person1, a
person1, b
person1, c
person1, d
person2, c
person2, d
person2, x
person3, a
person3, b
person3, e
person3, f
What I want to do is: 1) drop attributes that don't appear more than 10 times, 2) turn it into a binary table that would look something like this:
a b c
person1 1 1 1
person2 0 0 1
person3 1 1 0
So far, I have put together a script to drop the attributes that only appear 10 times; however, it is painfully slow as it has to go through each attribute, determine its frequency and find the corresponding x and y values to append to new variables.
import pandas as pd
import numpy as np
import csv
from collections import Counter
import time
df = pd.read_csv(
filepath_or_buffer='sample.csv',
sep=',')
x = df.ix[:, 1].values
y = df.ix[:, 0].values
x_vals = []
y_vals = []
counter = Counter(x)
start_time = time.time()
for each in counter:
if counter[each]>=10:
for i, j in enumerate(x):
if j==each:
print "Adding position:" + str(i)
x_vals.append(each)
y_vals.append(y[i])
print "Time took: %s" %(time.time()-start_time)
I would love some help in 1) finding a faster way to match attributes that appear more than 10 times and appending the values to new variables.
OR
2) An alternative method entirely to get the final binary table. I feel like converting a paired table to a binary table is probably a common occurrence in the data world, yet i couldnt find any code, module etc that could help with doing that.
Thanks a million!
I would probably add a dummy column and then call pivot_table:
>>> df = pd.DataFrame({"PERSON": ["p1", "p2", "p3"] * 10, "ATTRIBUTE": np.random.choice(["a","b","c","d","e","f","x"], 30)})
>>> df.head()
ATTRIBUTE PERSON
0 d p1
1 b p2
2 x p3
3 b p1
4 f p2
>>> df["count"] = 1
>>> p = df.pivot_table(index="PERSON", columns="ATTRIBUTE", values="count",
aggfunc=sum, fill_value=0)
>>> p
ATTRIBUTE a b c d e f x
PERSON
p1 1 3 1 1 1 0 3
p2 2 1 1 2 1 2 1
p3 0 4 1 1 2 0 2
And then we can select only the attributes with more than 10 occurrences (here 5, from my example):
>>> p.loc[:,p.sum() >= 5]
ATTRIBUTE b x
PERSON
p1 3 3
p2 1 1
p3 4 2