I'm working with a dataset of about ~ 32.000.000 rows:
RangeIndex: 32084542 entries, 0 to 32084541
df.head()
time device kpi value
0 2020-10-22 00:04:03+00:00 1-xxxx chassis.routing-engine.0.cpu-idle 100
1 2020-10-22 00:04:06+00:00 2-yyyy chassis.routing-engine.0.cpu-idle 97
2 2020-10-22 00:04:07+00:00 3-zzzz chassis.routing-engine.0.cpu-idle 100
3 2020-10-22 00:04:10+00:00 4-dddd chassis.routing-engine.0.cpu-idle 93
4 2020-10-22 00:04:10+00:00 5-rrrr chassis.routing-engine.0.cpu-idle 99
My goal is to create one aditional columns named role, filled with regard a regex
This is my approach
def router_role(row):
if row["device"].startswith("1"):
row["role"] = '1'
if row["device"].startswith("2"):
row["role"] = '2'
if row["device"].startswith("3"):
row["role"] = '3'
if row["device"].startswith("4"):
row["role"] = '4'
return row
then,
df = df.apply(router_role,axis=1)
However it's taking a lot of time ... any idea about other possible approach ?
Thanks
Apply is very slow and never very good. Try something like this instead:
df['role'] = df['device'].str[0]
Using apply is notoriously slow because it doesn't take advantage of multithreading (see, for example, pandas multiprocessing apply). Instead, use built-ins:
>>> import pandas as pd
>>> df = pd.DataFrame([["some-data", "1-xxxx"], ["more-data", "1-yyyy"], ["other-data", "2-xxxx"]])
>>> df
0 1
0 some-data 1-xxxx
1 more-data 1-yyyy
2 other-data 2-xxxx
>>> df["Derived Column"] = df[1].str.split("-", expand=True)[0]
>>> df
0 1 Derived Column
0 some-data 1-xxxx 1
1 more-data 1-yyyy 1
2 other-data 2-xxxx 2
Here, I'm assuming that you might have multiple digits before the hyphen (e.g. 42-aaaa), hence the extra work to split the column and get the first value of the split. If you're just getting the first character, do what #teepee did in their answer with just indexing into the string.
You can trivially convert your code to use np.vectorize().
See here:
Performance of Pandas apply vs np.vectorize to create new column from existing columns
Related
I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.
I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1
I am not entirely sure if this is possible but I thought I would go ahead and ask. I currently have a string that looks like the following:
myString =
"{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}"
The datasets can be varying lengths (this example has 2 datasets but it could have more), however the parameters will always be the same, (close, downticks, downvolume, etc).
Is there a way to create a dataframe from this string that takes the parameters as the index, and the numbers as the values in the column? So the dataframe would look something like this:
df =
0 1
index
Close 175.30 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.80
Low 173.66 174.94
Open 177.32 175.24
(etc)...
It looks like there are some issues with your input. As mentioned by #lmiguelvargasf, there's a missing comma at the end of the first dictionary. Additionally, there's a \n which you can simply use a str.replace to fix.
Once those issues have been solved, the process it pretty simple.
myString = '''{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}'''
myString = myString.replace('\n', ',')
import ast
list_of_dicts = list(ast.literal_eval(myString))
df = pd.DataFrame.from_dict(list_of_dicts).T
df
0 1
Close 175.3 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.8
Low 173.66 174.94
Open 177.32 175.24
OpenInterest 0 0
Status 29 536870941
TimeStamp \/Date(1521489600000)\/ \/Date(1521576000000)\/
TotalTicks 245246 135239
TotalVolume 33446771 19649350
UnchangedTicks 0 0
UnchangedVolume 0 0
UpTicks 122273 66168
UpVolume 14807630 8842514
I have a csv file that has a primary_id field and a version field and it looks like this:
ful_id version xs at_grade date
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
edit this is what the actual data looks like plus add 106 more columns of data and 20,000 records
The larger version number is the latest version of that record.I am having a difficult time thinking of the logic to get the latest record based on version and dumping that into a dictionary.I am pulling the info from the csv into a blank list but If anyone could give me some guidance on some of the logic moving forward, I would appreciate it
import csv
from collections import defaultdict
reader = csv.DictReader(open('rpm_inv.csv', 'rb'))
allData = list(reader)
dict_list = []
for line in allData:
dict_list.append(line)
pprint.pprint(dict_list)
I'm not exactly sure how you want your output to look like, but this might point you at least in the right direction, as long as you're not opposed to pandas.
import pandas as pd
df = pd.read_csv('rpm_inv.csv', header=True)
by_version = df.groupby('Version')
latest = by_version.max()
# To put it into a dictionary of {version:ID}
{v:row['ID'] for v, row in latest.iterrows()}
There's no need for anything fancy.
defaultdict is included in Python's standard library. It's an improved dictionary. I've used it here because it obviates the need to initialise entries in a dictionary. This means that I can write, for instance, result[id] = max(result[id], version). If no entry exists for id then defaultdict creates one and puts version in it (because it's obvious that this will be the maximum).
I read through the lines in the input file, one at a time, discarding end-lines and blanks, splitting on the commas, and then use map to apply the int function to each string produced.
I ignore the first line in the file simply be reading it and assigning its contents to a variable that I have arbitrarily called ignore.
Finally, just to make the results more intelligible, I sort the keys in the dictionary, and present the contents of it in order.
>>> from collections import defaultdict
>>> result = defaultdict(int)
>>> with open('to_dict.txt') as input:
... ignore = input.readline()
... for line in input:
... id, version = map(int, line.strip().replace(' ', '').split(','))
... result[id] = max(result[id], version)
...
>>> ids = list(result.keys())
>>> ids.sort()
>>> for id in ids:
... id, result[id]
...
(3, 1)
(11, 3)
(20, 2)
(400, 2)
EDIT: With that much data it becomes a different question, in my estimation, better processed with pandas.
I've put the df.groupby(['ful_id']).version.idxmax() bit in to demonstrate what I've done. I group on ful_id, then ask for the maximum value of version and the index of the maximum value, all in one step using idxmax. Although pandas displays this as a two-column table the result is actually a list of integers that I can use to select rows from the dataframe.
That's what I do with df.iloc[df.groupby(['ful_id']).version.idxmax(),:]. Here the df.groupby(['ful_id']).version.idxmax() part identifies the rows, and the : part identifies the columns, namely all of them.
Thanks for an interesting question!
>>> import pandas as pd
>>> df = pd.read_csv('different.csv', sep='\s+')
>>> df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
1 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
4 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
>>> df.groupby(['ful_id']).version.idxmax()
ful_id
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 0
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 3
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 2
Name: version, dtype: int64
>>> new_df = df.iloc[df.groupby(['ful_id']).version.idxmax(),:]
>>> new_df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
I'm fairly new to programming and I have a question on using loops to recode variables in a pandas data frame that I was hoping I could get some help with.
I want to recode multiple columns in a pandas data frame from units of seconds to minutes. I've written a simple function in python and then can copy and repeat it on each column which works, but I wanted to automate this. I appreciate the help.
The ivf.secondsUntilCC.xxx column contains the number of seconds until something happens. I want the new column ivf.minsUntilCC.xxx to be the number of minutes. The data frame name is data.
def f(x,y):
return x[y]/60
data['ivf.minsUntilCC.500'] = f(data,'ivf.secondsUntilCC.500')
data['ivf.minsUntilCC.1000'] = f(data,'ivf.secondsUntilCC.1000')
data['ivf.minsUntilCC.2000'] = f(data,'ivf.secondsUntilCC.2000')
data['ivf.minsUntilCC.3000'] = f(data,'ivf.secondsUntilCC.3000')
data['ivf.minsUntilCC.4000'] = f(data,'ivf.secondsUntilCC.4000')
I would use vectorized approach:
In [27]: df
Out[27]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 906395 854268 701859 979647 914942
1 288577 300394 577555 880370 924162 897984
2 66705 493545 232603 682509 794074 204429
3 747828 504930 379035 29230 410390 287327
4 926553 913360 657640 336139 210202 356649
In [28]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')] /= 60
In [29]: df
Out[29]:
X ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 191365 15106.583333 14237.800000 11697.650000 16327.450000 15249.033333
1 288577 5006.566667 9625.916667 14672.833333 15402.700000 14966.400000
2 66705 8225.750000 3876.716667 11375.150000 13234.566667 3407.150000
3 747828 8415.500000 6317.250000 487.166667 6839.833333 4788.783333
4 926553 15222.666667 10960.666667 5602.316667 3503.366667 5944.150000
Setup:
df = pd.DataFrame(np.random.randint(0,10**6,(5,6)),
columns=['X','ivf.minsUntilCC.500', 'ivf.minsUntilCC.1000',
'ivf.minsUntilCC.2000', 'ivf.minsUntilCC.3000',
'ivf.minsUntilCC.4000'])
Explanation:
In [26]: df.loc[:, df.columns.str.startswith('ivf.minsUntilCC.')]
Out[26]:
ivf.minsUntilCC.500 ivf.minsUntilCC.1000 ivf.minsUntilCC.2000 ivf.minsUntilCC.3000 ivf.minsUntilCC.4000
0 906395 854268 701859 979647 914942
1 300394 577555 880370 924162 897984
2 493545 232603 682509 794074 204429
3 504930 379035 29230 410390 287327
4 913360 657640 336139 210202 356649