I have a few hundred thousand rows of data with many different currency forms, some examples being:
116,319,545 SAR
40,381,846 CNY
57,712,170 CNY
158,073,425 RUB2
0 MYR
0 EUR
USD 110,169,240
These values are read into a DataFrame, and I am unsure what the best way (if there is a prebuilt way?) is to just get the integer value out of all the possible cases. There are probably more currencies in the data.
Currently the best approach I have is:
df1['value'].str.replace(r"[a-zA-Z,]",'').astype(int)
But this fails obviously with the entry xxxx RUB2.
EDIT:
In addition to the working answer, it also is reasonable to expect the currency to be important - to extract that the regex is ([A-Z]+\d*)
Given this df
df=pd.DataFrame()
df["col"]=["116,319,545 SAR",
"40,381,846 CNY",
"57,712,170 CNY",
"158,073,425 RUB2",
"0 MYR",
"0 EUR",
"USD 110,169,240"]
You can use regex '(\d+)' after removing commas to get
df.col.str.replace(",","").str.extract('(\d+)').astype(int)
0
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240
Another more manual solution would be to split and replace
df.col.str.split(' ').apply(lambda d: pd.Series(int(x.replace(",","")) for x in d if x.replace(",","").isdigit()).item())
0 116319545
1 40381846
2 57712170
3 158073425
4 0
5 0
6 110169240
Related
I have a dataframe with the following column. Each row contains different format strings.
col |
----------------------
GRA/B
TPP
BBMY
...
SOCBBA 0 MAX
CMBD 0 MAX
EPR 5.75 MAX
...
PMUST 5.57643 02/15/34
LEO 0 12/30/2099
RGB 3.125 09/15/14
RGB 3.375 04/15/20
I want to convert all the dates to a format that shows the full year.
Is there a way to regex this so that it looks like this.
col |
----------------------
GRA/B
TPP
BBMY
...
SOCBBA 0 MAX
CMBD 0 MAX
EPR 5.75 MAX
...
PMUST 5.57643 02/15/2034
LEO 0 12/30/2099
RGB 3.125 09/15/2014
RGB 3.375 04/15/2020
Right now the only thing I can think of doing is doing,
df['col'] = df['col'].str.replace('/14', '/2014')
for each year, but theres many years, also it will replace the days and months as well.
How can I achieve this properly, should I be using regex?
what about replacing when it "ends with a slash followed by 2 digits"?
In [9]: df["col"] = df["col"].str.replace(r"/(\d{2})$", r"/20\1", regex=True)
In [10]: df
Out[10]:
col
0 GRA/B
1 TPP
2 BBMY
3 ...
4 SOCBBA 0 MAX
5 CMBD 0 MAX
6 EPR 5.75 MAX
7 ...
8 PMUST 5.57643 02/15/2034
9 LEO 0 12/30/2099
10 RGB 3.125 09/15/2014
11 RGB 3.375 04/15/2020
regex:
/: a literal forward slash
(\d{2}): capture 2 digits
$: end of string
substituter:
/20: literally forward slash and 20
\1: first capturing group in the regex, i.e., the last 2 digits there
Is it possible to conditionally append data to an existing template dataframe? I'll try to make the data below as simple as possible, since I'm asking more for conceptual help than actual code so I better understand the mindset of solving these kinds of problems in the future (but actual code would be great too).
Example Data
I have a dataframe below that shows 4 dummy products SKUs that a client may order. These SKUs never change. Sometimes a client orders large quantities of each SKU, and sometimes they only order one or two SKUs. Due to reporting, I need to fill unordered SKUs with zeroes (probably use ffill?)
Dummy dataframe DF
product_sku
quantity
total_cost
1234
5678
4321
2468
Problem
Currently, my data only returns the SKUs that customers have ordered (a), but I would like unordered SKUs to be returned, with zeros filled in for quantity and total_cost (b)
(a)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
(b)
product_sku
quantity
total_cost
1234
10
50.00
5678
3
75.00
4321
0
0
2468
0
0
I'm wondering if there's a way to take that existing dataframe, and simply append any sales that actually occurred, leaving the unordered SKUs as zero or blank (whatever makes more sense).
I just need some help thinking through the steps logically, and wasn't able to find anything like this. I'm still relatively novice at this stuff, so let me know if I'm missing any pertinent information.
Thanks!
one way is to use reindex after putting the column with product's names as index with set_index. With your notation it would be something like
l_products = DF['product_sku'].tolist() #you may have the list differently
b = (a.set_index('product_sku')
.reindex(l_products, fill_value=0)
.reset_index()
)
If you know the SKus a-priori, maintain one DataFrame initizlized with zeros and update the relevant rows. Then you will always have all SKUs.
For example:
import pandas as pd
# initialization
df = pd.DataFrame(0, index = ['1234', '5678', '4321', '2468'],
columns={'quantity', 'total_cost'})
print(df)
# updating
df.loc['1234', :] = {'total_cost': 100, 'quantity': 4}
print(df)
# incrementing quantity
df.loc['1234', 'quantity'] += 5
print(df)
total_cost quantity
1234 0 0
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 4
5678 0 0
4321 0 0
2468 0 0
total_cost quantity
1234 100 9
5678 0 0
4321 0 0
2468 0 0
I have results from A/B test that I need to evaluate but in the checking of the data I noticed that there were users that were in both control groups and I need to drop them to not hurt the test. My data looks something like this:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
2 3698129301 2 2019-08-01 165.7 B
3 4214855558 2 2019-08-07 30.5 A
4 797272108 3 2019-08-23 100.4 A
What I need to do is remove every user that was in both A and B groups while leaving the rest intact. So from the example data I need this output:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
4 797272108 3 2019-08-23 100.4 A
I tried to do it in various ways and I can't seems to figure it out and I couldn't find an answer for it anywhere I would really appreciate some help here,
thanks in advance
You can get a list of users that are in just one group like this:
group_counts = df.groupby('visitorId').agg({'group': 'nunique'}) ##list of users with number of groups
to_include = group_counts[group_counts['group'] == 1] ##filter for just users in 1 group
And then filter your original data according to which visitors are in that list:
df = df[df['visitorId'].isin(to_include.index)]
I am trying to extract two values from arbitrary text, formatted in variable ways. The two values are different, and I want to distinguish them based on a nearby sring, lets say "DDT" and "EEG". Here are some examples of how the strings can be formatted.
This contains 42.121% DDT and 2.1% EEG
Now with DDT: 12% EEG: 23.2%
47 DDT 22 EEG
EEG N/A DDT 43
5% EEG 20% DDT and more
Essentially I need to be able to select both values preceded by and followed by their identifier.
I have been using a | between two selectors to capture both "cases" for each value, but I am having trouble. I want to prevent the regex from selecting "12% EEG" in the second example line. I am trying to use negative lookaheads and positive lookbehinds but can't make it work.
Here is the regex for selecting just ddt
(?<=eeg)(\d{1,3}\.?\d{1,6}).{,10}?ddt|ddt(?!.*eeg).{,10}?(\d{1,3}\.?\d{1,6})
This is the closest I have gotten, but it still does not work correctly. This version fails to match "20% DDT."
My original regex did not use lookbehinds, but also fails in some cases.
(?:(?:(\d{1,3}\.?\d*)[^(?:eeg)]{0,10}?ddt)|(?:ddt[^(?:eeg)]{0,10}?(\d{1,3}\.?\d*)))
My original approach fails to recognize the 23.2% EEG strings formatted like this. "DDT: 12% EEG: 23.2%"
I am not sure if this type of selector is possible with regex, but I want to use regex in order to vectorize this extraction. I have a function that does a good job of characterizing these strings, but it is very slow on large datasets (~1 million records). The regex runs quickly and is easy to apply to vectors, which is why I want to use it. If there are other suggestions to solve this problem with NLP or numpy/pandas functions I am open to those as well.
You could try the following, at least for these cases:
1/ work out which is first EEG or DDT:
In [11]: s.str.extract("(DDT|EEG)")
Out[11]:
0
0 DDT
1 DDT
2 DDT
3 EEG
4 EEG
2/ pull out all the numbers:
In [12]: s.str.extract("(\d+\.?\d*|N/A).*?(\d+\.?\d*|N/A)")
Out[12]:
0 1
0 42.121 2.1
1 12 23.2
2 47 22
3 N/A 43
4 5 20
To get rid of the N/A you can apply to_numeric:
In [13]: res = s.str.extract("(\d+\.?\d*|N/A).*?(\d+\.?\d*|N/A)").apply(pd.to_numeric, errors='coerce', axis=1)
In [14]: res
Out[14]:
0 1
0 42.121 2.1
1 12.000 23.2
2 47.000 22.0
3 NaN 43.0
4 5.000 20.0
Now you have to rearrange these columns to match their respective DDT/EEG:
In [15]: pd.DataFrame({
"DDT": res[0].where(s.str.extract("(DDT|EEG)")[0] == 'DDT', res[1]),
"EEG": res[1].where(s.str.extract("(DDT|EEG)")[0] == 'DDT', res[0])
})
Out[15]:
DDT EEG
0 42.121 2.1
1 12.000 23.2
2 47.000 22.0
3 43.000 NaN
4 20.000 5.0
Here s is the original Series/column:
In [21]: s
Out[21]:
0 This contains 42.121% DDT and 2.1% EEG
1 Now with DDT: 12% EEG: 23.2%
2 47 DDT 22 EEG
3 EEG N/A DDT 43
4 5% EEG 20% DDT and more
dtype: object
This assumes both DDT and EEG are both present, you might need to NaN out the rows where this isn't the case (which only have one of DDT/EEG)...
I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2