I have the following df
Trends Value
2021-12-13T08:00:00.000Z 45
2021-12-13T07:00:00.000Z 32
2021-12-13T06:42:10.000Z 23
2021-12-13T06:27:00.000Z 45
2021-12-10T05:00:00.000Z 23
I ran the following line:
df['Trends'].str.extract('^(.*:[1-9][1-9].*)$', expand=True)
It returns:
0
NaN
NaN
2021-12-13T06:42:10.000Z
2021-12-13T06:27:00.000Z
NaN
My objective is to use the regex, extract any trends that have minutes and seconds more than zero. The regex works (tested) and the line also work, but what I don't understand is why is it returning NaN when it does not match? I looked through several other SO and the line is pretty much the same.
My expected outcome:
2021-12-13T06:42:10.000Z
2021-12-13T06:27:00.000Z
Your solution is close; you can get matches with str.match, then filter:
df[df.Trends.str.match('^(.*:[1-9][1-9].*)$')].Trends
output:
2 2021-12-13T06:42:10.000Z
3 2021-12-13T06:27:00.000Z
previous answer won't work with the following data (where minute is 00 but second is not, or vice versa), but will work with this updated regex.
df[df.Trends.str.match('^(?!.*:00:00\..*)(.*:[0-9]+:[0-9]+\..*)$')].Trends
or
df[df.Trends.str.match('^(?!.*:00:00\..*)(.*:.*\..*)$')].Trends
or if second doesn't matter, but 01 minute should be selected then
df[df.Trends.str.match('^(?!.*:00:\d+\..*)(.*:.*\..*)$')].Trends
Trends Value
2021-12-13T07:00:00.000Z 32
2021-12-13T07:00:01.000Z 32
2021-12-13T07:00:10.000Z 32
2021-12-13T07:01:00.000Z 32
2021-12-13T07:10:00.000Z 32
Related
Hi I'm cleaning up a bigdata about food products and I'm struggling with one columns types(df['serving_size']=O) that informs about the size of the product. It's a pandas dataframe that contains 300.000 observations. I succeeded to clean the text that was included with the size with helps of Regex:
df['serving_size'] = df['serving_size'].str.replace('[^\d\,\.]', ' ')
df['serving_size'] = df['serving_size'].str.replace('(^\d\,\.+)\s', '')
And I got this (the space are White Space)
40.5 23
13
87 23
123
72,5
And my goals would be to keep only the first group of numbers for each rows including the ,and . like so:
40.5
13
87
123
72.5
Despite my reserch I didn't find how to achieve it ? Thanks
You can use .str.extract() with regex, as follows:
df['serving_size'] = df['serving_size'].str.extract(r'(\d+(?:,\d+)*(?:\.\d+)?)')
Result:
print(df)
serving_size
0 40.5
1 13
2 87
3 123
4 72.5
I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax
A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.
I have two dataframes that might look like this:
df1:
name start end
stuart 0 20
lamp 32 34
hamlet 16 100
df2:
name start end
LOXL1 30 40
FOXP3 0 11
INSN 43 70
I've seen many answers that find the intersection between two ranges. My favorite is:
range(max(start_1, start_2), min(end_1, end_2))
That's fine. But, for my context, I just need to know if the two ranges intersect at all. I can't seem to find an answer that works for my use case. Expected output would basically grab the names from df2 for which the range intersected with df1. Expected output would be:
name start end intersects
stuart 0 20 FOXP3
lamp 32 34 LOXL1
hamlet 16 100 LOXL1|INSN
Or, if this is easier (this solution would actually be ideal, but I can work with the first one):
name start end intersects
stuart 0 20 FOXP3
lamp 32 34 LOXL1
hamlet 16 100 LOXL1
hamlet 16 100 INSN
What I'm effectively stuck on is getting a True/False out of whether ranges between two rows intersect, without a for loop. A for loop is not a viable solution for me because I have 40k rows being compared to 6m rows.
Just using the mathmetical way + numpy broadcast
v1=df1.start.values
v2=df1.end.values
s1=df2.start.values
s2=df2.end.values
s=pd.DataFrame(((s2-v1[:,None])>0)&((s1-v2[:,None])<0)).dot(df2.name+'|').str[:-1]
s
Out[737]:
0 FOXP3
1 LOXL1
2 LOXL1|INSN
dtype: object
#df1['New']=s.values
The question you need to answer, from what you already have, is whether there is anything in the range you know.
if max(start_1, start_2) <= min(end_1, end_2):
You might find better tools in the interval module; this does a variety of operations on known intervals; I'm hopeful there are vectorized tools you can use.
I wrote a function to extract integer from strings. The strings example is below and it is a column in my dataframe.
The output I got is in square bracket, with a lot of numbers inside. I want to use those numbers to compute further, but when I check what it is, instead of integer, it is a Nonetype. Why is that? and how can I convert it into integer so I can find .sum() or .mean() with the output numbers I got? Ideally, I want the extracted integer as another column like with str.extract(regex, inplace=True).
Here is part of my data, which is a column in my dataframe df2017
Bo medium lapis 20 cash pr gr
Porte monnaie dogon vert olive 430 euros carte
Bo noires 2015 fleurs clips moins brillant 30 ...
Necklace No 20 2016 80€ carte Grecs 20h00 salo...
Bo mini rouges 30 carte 13h it
Necklace No 17 2016 100€ cash pr US/NYC crois ...
Chocker No 1 2016 + BO No 32 2016 70€ cash pr …
Here is my code
def extract_int_price():
text=df2017['Items'].astype(str)
text=text.to_string()
amount=[int(x) for x in re.findall('(?<!No\s)(?<!new)(?!2016)(\d{2,4})+€?', text)]
print (amount)
Thank you!
Your function returns None because you forgot the return statement. Because every function in Python has a return value, a missing return statement is like returning None.
You want to use either str.findall or str.extractall:
In [11]: REGEX = '(?<!No\s)(?<!new)(?!2016)(\d{2,4})+€?'
In [12]: s = df2017['Items']
In [13]: s.str.findall(REGEX)
Out[13]:
0 [20]
1 [430]
2 [2015, 30]
3 [016, 80, 20, 00]
4 [30, 13]
5 [016, 100]
6 [016, 016, 70]
dtype: object
In [14]: s.str.extractall(REGEX)
Out[14]:
0
match
0 0 20
1 0 430
2 0 2015
1 30
3 0 016
1 80
2 20
3 00
4 0 30
1 13
5 0 016
1 100
6 0 016
1 016
2 70
Generally extractall is preferred since it keeps you in numpy rather than using a Series of python lists.
If your problem is getting the sum of the integers, then you can simply:
sum(int(x) for x in ...)
However, if your problem is with the regex, then you should consider improving your filter mechanism (what should go in). You may also consider filtering manually (though not ideal) word by word (determining which word is irrelevant).
I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})