Creating new dataframes by selecting rows with numbers/digits - python

I have this small dataframe:
index words
0 home # there is a blank in words
1 zone developer zone
2 zero zero
3 z3 z3
4 ytd2525 ytd2525
... ... ...
3887 18TH 18th
3888 180m 180m deal
3889 16th 16th
3890 150M 150m monthly
3891 10am 10am 20200716
I would like to extract all the words in index which contains numbers, in order to create a dataframe with only them, and another one where words containing numbers in both index and words are selected.
To select rows which contain numbers I have considered the following:
m1 = df['index'].apply(lambda x: not any(i.isnumeric() for i in x.split()))
m2 = df['index'].str.isalpha()
m3 = df['index'].apply(lambda x: not any(i.isdigit() for i in x))
m4 = ~df['index'].str.contains(r'[0-9]')
I do not know which one should be preferred (as they are redundant). But I would also consider another case, where both index and words contain numbers (digits), in order to select rows and create two dataframes.

Your question not clear. Happy to correct if I got the question wrong
For all words in index containing numbers in their own dataframe please try:
df.loc[df['index'].str.contains('\d+'),'index'].to_frame()
and for words containing numbers in both index and words
df.loc[df['index'].str.contains('\d+'),:]

Related

Find which column has unique values that can help distinguish the rows with Pandas

I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.
You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']

pandas series row-wise comparison (preserve cardinality/indices of larger series)

I have two pandas series, both string dtypes.
reports['corpus'] has 1287 rows
0 point seem peaking effects drug unique compari...
1 mother god seen much difficult withstand spent...
2 getting weird half breakthrough feels like sec...
3 vomited three times bucket suddenly felt much ...
4 reached peak mild walk around without difficul...
labels['uniq_labels'] has 52 rows
0 amplification
1 enhancement
2 psychedelic
3 sensory
4 visual
I want to create a new series object equal to the size of reports['corpus']. In it, each row needs to contain a list of string matches (i.e. searching reports['corpus'] for exact string matches to strings in labels['uniq_labels']).
I have tried looping over the two series to check if a string from labels['uniq_labels'] is in a report from reports['corpus']. I split at the report iter and am able to return a list of the strings that match. Though I can't seem to preserve conditions such as: allocating string matches for a given report to the reports' index position (very important).
Edit (Adding example of the series objects):
reports_series = pd.Series(['This is a test first sentence. \
This is the first row of a pandas series.',
'Here is the second row. The row that means the most. The row that never goes away.',
'The third sentence. The third row to the example pandas series.',
'This is the fourth and only fourth row of the pandas series.',
'Here is the fifth row. The fifth row that means the most.'])
labels_series = pd.Series(['first', 'sentence', 'second row'])
Convert uniq_labels column from the labels dataframe to a list, and split the corpus column from reports dataframe on white space, and take the values that are in both the lists.
(reports['corpus']
.str.split(' ')
.apply(lambda x:[i for i in labels['uniq_labels'].tolist() if i in x]))
0 []
1 []
2 []
3 []
4 []
Name: corpus, dtype: object
In the sample you have mentioned above, no values actually match, so the output has empty list only.

flag strings based on previous values in pandas

I would like to flag sentences located in a pandas dataframe. As you can see in the example, some of the sentences are split into multiple rows (these are subtitles from an srt file that I would like to translate to a different language eventually, but first I need to put them in a single cell). The end of the sentence is determined by the period at the end. I want to create a column like the column sentence, where I number each sentence (it doesn't have to be a string, it could be a number too)
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df
output:
subtitle sentence_number presence_of_period
0 This is an example of subtitle. sentence_1 True
1 I want to group by sentences, which sentence_2 False
2 the end is determined by a period. sentence_2 True
3 row 0 should have sentece_1, rows 1 and 2 sentence_3 False
4 should have sentence_2. and this sentence_3 True
5 last row should have sentence_3. sentence_4 True
How can I create the sentence_number column since it has to read the previous cells on subtitle column? I was thinking of a window function or the shift() but couldn't figure out how to make it work. I added a column to show if the cell has a period, signifying the end of the sentence. Also, if possible, I would like to move the "and this" from row 4 to the beginning of row 5, since it is a new sentence (not sure if this one would require a different question).
Any thoughts?
To fix the sentence number, here's an option for you.
import pandas as pd
values=[
['This is an example of subtitle.','sentence_1'],
['I want to group by sentences, which','sentence_2'],
['the end is determined by a period.','sentence_2'],
['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
['should have sentence_2.','sentence_2'],
['and this last row should have sentence_3.','sentence_3']
]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])
The output will be as follows:
subtitle sentence_#
0 This is an example of subtitle. sentence_1
1 I want to group by sentences, which sentence_2
2 the end is determined by a period. sentence_2
3 row 0 should have sentece_1, rows 1 and 2 sentence_3
4 should have sentence_2. sentence_3
5 and this last row should have sentence_3. sentence_4
If you need to move the partial sentence to the next row, I need to understand a bit more details.
What do you want to do if there are more than two sentences in a row. For example, 'This is first sentence. This second. This is'.
What do you want to do in this case. Split the first one to a row, second to another row, and concatenate the third to the next row data?
Once I understand this, we can use the df.explode() to solve it.

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

How can I count number of columns whose name starts with specific words

In my python dataframe I have around 40 columns. Out of these 40 columns around 20 columns starts with "Name_" for example "Name_History","Name_Language" and remaining column starts with "score_" for example "Score_Math","Scor_Physisc".
I would like to determine dynamically First & last index of columns starting with "Name_" .
You can do this to return an array with first and last:
df.columns[df.columns.str.startswith('Name_')][[0,-1]]

Categories

Resources