I want to count the number of pipe symbol occurrence in a column of a data frame and it equals 5, then I need to append another pipe(|) symbol to the existing value.
df2['smartexpenseid']
0 878497|253919815?HOTEL?141791520780|||305117||
1 362593||||35068||
2 |231931871509?CARRT?231940968972||||177849|
3 955304|248973233?HOTEL?154687992630||||93191|
4 27984||||5883|3242|
5 3579321|253872763?HOTEL?128891721799|92832814|||
6 127299|248541768?HOTEL?270593355555|||||
7 |231931871509?CARRT?231940968972||||177849|
8 831665||||80658||
9 |247132692?HOTEL?141790728905||||6249|
For ex: for row number 5, the (|) count is 5, so it should add another (|) to the existing value and for other rows, since count is 6, we just leave it as it is. Can somebody help me with this ?
I tried these
if df2['smartexpenseid'].str.count('\|')==5:
df2['smartexpenseid'].append('\|')
This is throwing me error saying "The truth value of a Series is ambiguous"
and also
a = df2['smartexpenseid'].str.count('\|')
if 5 in a:
a.index(5)
So you have the vectorized str methods down. Now you need to conditionally append an extra '|' character. See Pandas section on masking for more info.
m = df2['smartexpenseid'].str.count('\|') == 5
df2.loc[m, 'smartexpenseid'] = df2['smartexpenseid'][m].values + '|'
Related
I have a dataframe of answers eg.
answers=pd.DataFrame(np.random.randint(1, 5,size=(29, 10))),
that need to be compared an answer key.
In[1]: keyslist.head()
Out[1]:
0 4
1 3
2 1, 3
3 3
4 2
Some questions have more than 1 correct answer. I want to be able to compare each cell in each column in answers to the corresponding cells in keyslist. If the values are the same, append a counter for numcorrect=list() and if the values are not the same, append a counter for numwrong=list().
Here is what I have so far:
for j in range(answers.shape[1]):#iterate over rows
ne=0
nc=0
nw=0
for i in range(answers.shape[0]):
if str(answers.iloc[i, j])=='nan':
ne+=1
elif answers.iloc[i, j]==keys[j]:
nc+=1
else:
nw+=1
nwrong.append(nw)
nempty.append(ne)
ncorrect.append(nc)
This works for when keyslist has one value per cell. I need help figuring out how to get it to work when
keyslist has more than value in some cells.
Thank you
I would use
answers.iloc[i, j] in [int(x) for x in str(keyslist.iloc[i, 0]).split(',')]
Say your keyslist is a dataframe with one column 0. keyslist.iloc[i, 0] gets the ith row data in 0 column.
str(keyslist.iloc[i, 0]) converts it to string.
str(keyslist.iloc[i, 0]).split(',') splits the more than 1 correct answers into list.
[int(x) for x in str(keyslist.iloc[i, 0]).split(',')] converts the comma separates answers to list of int.
answers.iloc[i, j] in .. checks if the student's answer in the answer list.
My data consist of Latitude in object type :
0 4.620881605
1 4.620124518
2 4.619367709
3 4.618609512
4 4.61784758
Then, I split after the decimal point using this code:
marker['Latitude'].str.split('.')
Resulting in :
0 [4, 620881605]
1 [4, 620124518]
2 [4, 619367709]
3 [4, 618609512]
4 [4, 61784758]
which is good but not quite there yet. I want to access the second element of the list for every row and the end result I am expecting is this :
0 620881605
1 620124518
2 619367709
3 618609512
4 61784758
I was looking for an answer to the same question, it seems there is nothing built-in. The best option I can find is operator.itemgetter(), which is implemented in native code and should perform fine with Series.apply():
from operator import itemgetter
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|')
# Fetch all the negative numbers
negatives = pairs.apply(itemgetter(0)).astype(int)
# Fetch all the positive numbers
positives = pairs.apply(itemgetter(1)).astype(int)
Note Series.str.split() also accepts an expand=True argument, which returns a new DataFrame containing columns 0..n rather than a series of lists. This probably should be the default behaviour, it's much easier to work with:
series = pd.Series(["%s|%s" % (-x, x) for x in range(100)])
pairs = series.str.split('|', expand=True)
# Fetch all the negative numbers
negatives = pairs[0]
# Fetch all the positive numbers
positives = pairs[1]
You can use pd.DataFrame.iterrows() to iterate by row and then select the proper index for your list.
import pandas as pd
x = pd.DataFrame({'a':[[1,2],[3,4],[5,6]]})
for index, row in x.iterrows():
print(row['a'][1])
2
4
6
marker['Latitude'].apply(lambda x : x.strip(',').split('.')[1])
I have a Data-frame which contain two column.
On the first column (Motif_name) my value look like that :
Motif_Name_xx/Description/Homer
Second column just contain a score.
I'm trying to split my first column by '/' and conserve the first element.
Basically what I tried :
df=df['Motif_name'].str.split('/').str[1]
Here the problem is that my data-frame is replaced :
print(df)
0 Motif_1
1 Motif_2
I lost the header and the second column...
I expect to have a data-frame like that :
Motif_name Score
0 Motif_Name_xx1 0.001
1 Motif_Name_xx2 0.05
2 Motif_Name_xx3 0.02
3 Motif_Name_xx4 0.01
It seems need parameter n=1 for split by first / and str[0] for get first value of lists (python count from 0) and then assign it to same column:
df['Motif_name'] = df['Motif_name'].str.split('/', n=1).str[0]
Ok I just see the solution when I was editing my question, so if someone else need the answer :
EF1a_R1_df['Motif_name']=EF1a_R1_df['Motif_name'].str.split('/').str[0]
Basically, in place to replace the all data-frame, just replace the column and it work well.
Let's say I have an UNORDERED Dataframe :
df = pandas.DataFrame({'A': [6, 2, 3, 5]})
I have an input :
input = 3
I want to find the rank of my input in the list. Here :
expected_rank_in_df(input) = 2
# Because 2 < 3 < 5 < 6
Assumption : The input is always included in the dataframe. So for example, I will not find the position of "4" in this df.
The first idea was to use like here : Pandas rank by column value:
df.rank()
But it seems overkill to me as I don't need to rank the whole column. Maybe it's not ?
If you know for sure that the input is in the column, the rank will be equal to
df[df > input].count()
Does that make sense? If you intend on calling this multiple times, it may be worth it to just sort the column. But this is probably faster if you only care about a few inputs.
You can get first position of matched value by numpy.where with boolean mask for first True:
a = 3
print (np.where(np.sort(df['A']) == a)[0][0] + 1)
2
If default RangeIndex:
a = 3
print (df['A'].sort_values().eq(3).idxmax())
2
Another idea is count True values by sum:
print (df['A'].gt(3).sum())
2
As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values