I'm currently facing a problem with method chaining in manipulating data frames in pandas, here is the structure of my data:
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame(
{'Frenquency': lst1,
'lst2Tite': lst2,
'lst3Tite': lst3
})
the question is get entries(rows) if the frequency is less than 6, but it needs to be done in method chaining.
I know using a traditional way is easy, I could just do
df[df["Frenquency"]<6]
to get the answer.
However, the question is about how to do it with method chaining, I tried something like
df.drop(lambda x:x.index if x["Frequency"] <6 else null)
but it raised an error "[<function <lambda> at 0x7faf529d3510>] not contained in axis"
Could anyone share some light on this issue?
This is an old question but I will answer since there is no accepted answer for future reference.
df[df.apply(lambda x: True if (x.Frenquency) <6 else False,axis=1)]
explanation: This lambda function checks the frequency and if yes it assigns True otherwise False and that series of True and False used by df to index the true values only. Note the column name Frenquency is a typo but I kept as it is since the question was like so.
Or maybe this:
df.drop(i for i in df.Frequency if i >= 6)
Or use inplace:
df.drop((i for i in df.Frequency if i >= 6), inplace=True)
For this sort of selection, you can maintain a fluent interface and use method-chaining by using the query method:
>>> df.query('Frenquency < 6')
Frenquency lst2Tite lst3Tite
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
>>>
So something like:
df.rename(<something>).query('Frenquency <6').assign(<something>)
Or more concretely:
>>> (df.rename(columns={'Frenquency':'F'})
... .query('F < 6')
... .assign(FF=lambda x: x.F**2))
F lst2Tite lst3Tite FF
0 0 0 0 0
1 1 1 1 1
2 2 2 2 4
3 3 3 3 9
4 4 4 4 16
5 5 5 5 25
Feel this post did not have the answers that addressed the spirit of the question. The most chain-friendly way is (probably) to use Panda's .loc.
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
df = pd.DataFrame({"Frequency": lst1, "lst2Tite": lst2, "lst3Tite": lst3})
df.loc[lambda _df: 6 < _df["Frequency"]]
Simple!
Would this satisfy your needs?
df.mask(df.Frequency >= 6).dropna()
Related
I am trying to count how many characters from the first column appear in second one. They may appear in different order and they should not be counted twice.
For example, in this df
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2"],["AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"]]).T
the result should be:
result = [3,2,2,2,3]
Looks like set.intersection after zipping the 2 columns:
[len(set(a).intersection(set(b))) for a,b in zip(df[0],df[1])]
#[3, 2, 2, 2, 3]
The other solutions will fail in the case that you compare names that both have the same multiple character, eg. AAL0 and AAL0X24. The result here should be 4.
from collections import Counter
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2", "AAL0"],
["AL0X24", "CXP44", "MLN", "KKRR9", "22MMRRS", "AAL0X24"]]).T
def num_shared_chars(char_counter1, char_counter2):
shared_chars = set(char_counter1.keys()).intersection(char_counter2.keys())
return sum([min(char_counter1[k], char_counter2[k]) for k in shared_chars])
df_counter = df.applymap(Counter)
df['shared_chars'] = df_counter.apply(lambda row: num_shared_chars(row[0], row[1]), axis = 'columns')
Result:
0 1 shared_chars
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
5 AAL0 AAL0X24 4
Sticking to the dataframe data structure, you could do:
>>> def count_common(s1, s2):
... return len(set(s1) & set(s2))
...
>>> df["result"] = df.apply(lambda x: count_common(x[0], x[1]), axis=1)
>>> df
0 1 result
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)
I am trying to find out Index of such rows before "None" occurs.
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
pId
0 a
1 b
2 c
3 None
4 d
5 e
6 None
df.index[df.pId.eq('None') & df.pId.ne(df.pId.shift(-1))]
I am expecting the output of the above code should be
Index([2,5])
It gives me
Index([3,6])
Please correct me
I am not sure for the specific example you showed. Anyway, you could do it in a more simple way:
indexes = [i-1 for i,x in enumerate(pId) if x == 'None']
The problem is that you're returning the index of the "None". You compare it against the previous item, but you're still reporting the index of the "None". Note that your accepted answer doesn't make this check.
In short, you still need to plaster a "-1" onto the result of your checking.
Just -1 from df[df["pId"] == "None"].index:
import pandas as pd
pId=["a","b","c","None","d","e","None"]
df = pd.DataFrame(pId,columns=['pId'])
print(df[df["pId"] == "None"].index - 1)
Which gives you:
Int64Index([2, 5], dtype='int64')
Or if you just want a list of values:
(df[df["pId"] == "None"].index - 1).tolist()
You should be aware that for a list like:
pId=["None","None","b","c","None","d","e","None"]
You get a df like:
pId
0 None
1 None
2 b
3 c
4 None
5 d
6 e
7 None
And output like:
[-1, 0, 3, 6]
Which does not make a great deal of sense.
How do I take multiple lists and put them as different columns in a python dataframe? I tried this solution but had some trouble.
Attempt 1:
Have three lists, and zip them together and use that res = zip(lst1,lst2,lst3)
Yields just one column
Attempt 2:
percentile_list = pd.DataFrame({'lst1Tite' : [lst1],
'lst2Tite' : [lst2],
'lst3Tite' : [lst3] },
columns=['lst1Tite','lst1Tite', 'lst1Tite'])
yields either one row by 3 columns (the way above) or if I transpose it is 3 rows and 1 column
How do I get a 100 row (length of each independent list) by 3 column (three lists) pandas dataframe?
I think you're almost there, try removing the extra square brackets around the lst's (Also you don't need to specify the column names when you're creating a dataframe from a dict like this):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
percentile_list = pd.DataFrame(
{'lst1Title': lst1,
'lst2Title': lst2,
'lst3Title': lst3
})
percentile_list
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
...
If you need a more performant solution you can use np.column_stack rather than zip as in your first attempt, this has around a 2x speedup on the example here, however comes at bit of a cost of readability in my opinion:
import numpy as np
percentile_list = pd.DataFrame(np.column_stack([lst1, lst2, lst3]),
columns=['lst1Title', 'lst2Title', 'lst3Title'])
Adding to Aditya Guru's answer here. There is no need of using map. You can do it simply by:
pd.DataFrame(list(zip(lst1, lst2, lst3)))
This will set the column's names as 0,1,2. To set your own column names, you can pass the keyword argument columns to the method above.
pd.DataFrame(list(zip(lst1, lst2, lst3)),
columns=['lst1_title','lst2_title', 'lst3_title'])
Adding one more scalable solution.
lists = [lst1, lst2, lst3, lst4]
df = pd.concat([pd.Series(x) for x in lists], axis=1)
There are several ways to create a dataframe from multiple lists.
list1=[1,2,3,4]
list2=[5,6,7,8]
list3=[9,10,11,12]
pd.DataFrame({'list1':list1, 'list2':list2, 'list3'=list3})
pd.DataFrame(data=zip(list1,list2,list3),columns=['list1','list2','list3'])
Just adding that using the first approach it can be done as -
pd.DataFrame(list(map(list, zip(lst1,lst2,lst3))))
Adding to above answers, we can create on the fly
df= pd.DataFrame()
list1 = list(range(10))
list2 = list(range(10,20))
df['list1'] = list1
df['list2'] = list2
print(df)
hope it helps !
#oopsi used pd.concat() but didn't include the column names. You could do the following, which, unlike the first solution in the accepted answer, gives you control over the column order (avoids dicts, which are unordered):
import pandas as pd
lst1 = range(100)
lst2 = range(100)
lst3 = range(100)
s1=pd.Series(lst1,name='lst1Title')
s2=pd.Series(lst2,name='lst2Title')
s3=pd.Series(lst3 ,name='lst3Title')
percentile_list = pd.concat([s1,s2,s3], axis=1)
percentile_list
Out[2]:
lst1Title lst2Title lst3Title
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
...
you can simply use this following code
train_data['labels']= train_data[["LABEL1","LABEL1","LABEL2","LABEL3","LABEL4","LABEL5","LABEL6","LABEL7"]].values.tolist()
train_df = pd.DataFrame(train_data, columns=['text','labels'])
I just did it like this (python 3.9):
import pandas as pd
my_dict=dict(x=x, y=y, z=z) # Set column ordering here
my_df=pd.DataFrame.from_dict(my_dict)
This seems to be reasonably straightforward (albeit in 2022) unless I am missing something obvious...
In python 2 one could've used a collections.OrderedDict().