python pandas: fulfill condition and assign a value to it - python

I am really hoping you can help me here...I need to assign a label(df_label) to an exact file within dataframe (df_data) and save all labels that appear in each file in a separate txt file (that's an easy bit)
df_data:
file_name file_start file_end
0 20190201_000004.wav 0.000 1196.000
1 20190201_002003.wav 1196.000 2392.992
2 20190201_004004.wav 2392.992 3588.992
3 20190201_010003.wav 3588.992 4785.984
4 20190201_012003.wav 4785.984 5982.976
df_label:
Begin Time (s)
0 27467.100000
1 43830.400000
2 43830.800000
3 46378.200000
I have tried to switch to np.array and use for loop and np.where but without any success...

If the time values in df_label fall under exactly one entry in df_data, you can use the following
def get_file_name(begin_time):
file_names = df_data[
(df_data["file_start"] <= begin_time)
& (df_data["file_end"] >= begin_time)
]["file_name"].values
return file_names.values[0] if file_names.values.size > 0 else None
df_label["file_name"] = df_label["Begin Time (s)"].apply(get_label)
This will add another col file_name to df_label

If the labels from df_label matches the order of files in df_data you can simply:
add the labels as new column of df_data (df_data["label"] = df_label["Begin Time (s)"]).
or
use DataFrame.merge() function (df_data = df_data.merge(df_labels, left_index=True, right_index=True)).
More about merging/joining with examples you can find here:
https://thispointer.com/pandas-how-to-merge-dataframes-by-index-using-dataframe-merge-part-3/
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

More pythonic way to remove rows where one value begins by another row's value in a pandas dataframe

I am processing a pandas dataframe and want to remove rows if they contain a "Full Path" that is already contained in other "Full Path" of the dataframe.
In the example below I want to remove the rows 1 2 3 4 because c:/dir/ "contains" them (we are talking about file systems path here):
Full Path Value
0 c:/dir/ x
1 c:/dir/sub1/ x
2 c:/dir/sub2/ x
3 c:/dir/sub2/a x
4 c:/dir/sub2/b x
5 c:/anotherdir/ x
6 c:/anotherdir_A/ x
7 c:/anotherdir_C/ x
Rows 6 & 7 are kept because the path is not contained in 5 (a in b in my code below).
The code I came up with is the following, res is the initial dataframe:
to_drop = []
for index, row in res.iterrows():
a = row['Full Path']
for idx, row2 in res.iterrows():
b = row2['Full Path']
if a != b and a in b:
to_drop.append(idx)
res2 = res.loc[~res.index.isin(to_drop)]
It works but the code does not feel 100% pythonic to me. I am quite sure there is a more elegant/clever way to do this. Any idea?
pd.concat([df, df['Full Path'].str.extract('(.*:\/.*?\/)')], axis = 1)\
.drop_duplicates([0])\
.drop(columns = 0)
You can use .str.extract and regex to extract the base directory, concating the extract back to the original df, dropping the duplicates of the base directory, followed by finally dropping the extracted column.
Edit: Alternate if Path is not in order:
df[df['Full Path'] == df['Full Path'].str.extract('(.*:\/.*?\/)', expand = False)]
The time complexity of this is in the tank (no matter how you turn it, you have to check every path against every other path), but a single line solution using str.startswith:
df = pd.DataFrame({'Full Path': ['c:/dir/', 'c:/dir/sub/', 'c:/anotherdir/dir',
'c:/anotherdir/'],
'Value': ['A', 'B', 'C', 'D']})
print(df[[any(a.startswith(b) if a != b else False for a in df['Full Path'])
for b in df['Full Path']]])
output
Full Path Value
0 c:/dir/ A
3 c:/anotherdir/ D

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5
If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))
I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

Categories

Resources