Counting Values In Columns Igonorig AlphaNumeric Values - python

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5

If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))

I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

Related

retrieve cell string values in a column between two unknown indexes based on substrings location

I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0
You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()

how to change the iterrows method to apply

I have this code, in which I have rows around 60k. It taking around 4 hrs to complete the whole process. This code is not feasible and want to use apply instead iterrow because of time constraints.
Here is the code,
all_merged_k = pd.DataFrame(columns=all_merged_f.columns)
for index, row in all_merged_f.iterrows():
if (row['route_count'] == 0):
all_merged_k = all_merged_k.append(row)
else:
for i in range(row['route_count']):
row1 = row.copy()
row['Route Number'] = i
row['Route_Broken'] = row1['routes'][i]
all_merged_k = all_merged_k.append(row)
Basically, what the code is doing is that if the route count is 0 then append the same row, if not then whatever the number of counts is it will append that number of rows with all same value except the routes column (as it contains nested list) so breaking them in multiple rows. And adding them in new columns called Route_Broken and Route Number.
Sample of data:
routes route_count
[[CHN-IND]] 1
[[CHN-IND],[IND-KOR]] 2
O/P data:
routes route_count Broken_Route Route Number
[[CHN-IND]] 1 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [CHN-IND] 1
[[CHN-IND],[IND-KOR]] 2 [IND-KOR] 2
Can it be possible using apply because 4 hrs is very high and cant be put into production. I need extreme help. Pls help me.
So below code doesn't work
df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
or
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
It doesn't working if the index matches, we can see the last row, Route Number should be 1
Are you expect something like that:
>>> df.join(df['routes'].explode().rename('Broken_Route')) \
.assign(**{'Route Number': lambda x: x.groupby(level=0).cumcount().add(1)})
routes route_count Broken_Route Route Number
0 [[CHN-IND]] 1 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 1
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 2
2 0 1
Setup:
data = {'routes': [[['CHN-IND']], [['CHN-IND'], ['IND-KOR']], ''],
'route_count': [1, 2, 0]}
df = pd.DataFrame(data)
Update 1: added a record with route_count=0 and routes=''.
You can assign the routes and counts and explode:
(df.assign(Broken_Route=df['routes'],
count=df['routes'].str.len().apply(range))
.explode(['Broken_Route', 'count'])
)
NB. multi-column explode requires pandas ≥1.3.0, if older use this method
output:
routes route_count Broken_Route count
0 [[CHN-IND]] 1 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [CHN-IND] 0
1 [[CHN-IND], [IND-KOR]] 2 [IND-KOR] 1

Pandas dataframe select Columns based on other dataframe contains column value in it

I have two dataframes. Here is dwpjp.head():
jp_number
0
25146315052147720191
1
57225427599900052634
2
86076681691411639833
3
50491824499499656478
4
95588382889227620465
and ct_data.head():
imjp_number
imct_id
0
23605308039805192764
x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1
57225427599900052634
aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
2
53733358271401869469
6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
3
50491824499499656478
__gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
4
82143248133286027306
__g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
I want two new dataframes cct_data, and dct_data from ct_data. The ct_data dataframe should be split on the condition if the jp_number is present in the dwbjp dataframe then put into cct_data, otherwise put into dct_data.
I tried this for common jp_number present in dwpjp:
cct_data = ct_data[ct_data.isin(dwpjp).any(1).values]
and for the other I negated the condition as follows:
dct_data = ct_data[~[ct_data.isin(dwpjp).any(1).values]]
but results are not getting as below.
cct_data
imjp_number
imct_id
0
57225427599900052634
aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
1
50491824499499656478
__gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
and dct_data:
imjp_number
imct_id
0
23605308039805192764
x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1
53733358271401869469
6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
2
82143248133286027306
__g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
Note: jpnumber=imjp_number.
Modified your formula as below
cct_data = ct_data[ct_data.imjp_number.isin(dwpjp.jp_number)]
and
dct_data = ct_data[~ct_data.imjp_number.isin(dwpjp.jp_number)]

How can I count the element in specific intervals in a dataframe?

I've got a dataframe like below where columns in c01 represent the start time and c04 the end for time intervals:
c01 c04
1742 8.444991 14.022029
3786 29.91143 31.422439
3951 29.91143 31.145099
5402 37.81136 42.689595
8230 63.12394 65.34602
also a list like this (it's actually way longer):
8.522494
8.54471
8.578426
8.611193
8.644996
8.678053
8.710918
8.744901
8.777851
8.811053
8.844867
8.878389
8.912099
8.944729
8.977601
9.011232
9.04492
9.078157
9.111946
9.144788
9.177663
9.211054
9.245265
9.27805
9.311766
9.344647
9.377612
9.411709
I'd like to count how many elements in the list falls in the intervals shown by the dataframe, where I coded like this:
count = 0
for index, row in speech.iterrows():
count += gtls.count(lambda i : i in [row['c01'], row['c04']])
the file works as a whole but all 'count' turns out to be 0, would you please tell me where did I mess up?
I took the liberty of converting your list into a numpy array() (I called it arr). Then you can use the apply function to create your count column. Let's assume your dataframe is called df.
def get_count(row): #the logic for your summation is here
return np.sum([(row['c01'] < arr) & (row['c04'] >= arr)])
df['C_sum'] = df.apply(get_count, axis = 1)
print(df)
Output:
c01 c04 C_sum
0 8.444991 14.022029 28
1 29.911430 31.422439 0
2 29.911430 31.145099 0
3 37.811360 42.689595 0
4 63.123940 65.346020 0
You can also do the whole thing in one line using lambda:
df['C_sum'] = df.apply(lambda row: np.sum([(row['c01'] < arr) & (row['c04'] >= arr)]), axis = 1)
Welcome to Stack Overflow! The i in [row['c01'], row['c04']] doesn't do what you seem to think; it stands for checking whether element i can be found from the two-element list instead of checking the range between row['c01'] and row['c04']. For checking if a floating point number is within a range, use row['c01'] < i < row['c04'].

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Categories

Resources