Problems Sorting Data out of a text-file - python

I have a csv file imported into a dataframe and have trouble sorting the data.
df looks like this:
Data
0 <WindSpeed>0.69</WindSpeed>
1 <PowerOutput>0</PowerOutput>
2 <ThrustCoEfficient>0</ThrustCoEffici...
3 <RotorSpeed>8.17</RotorSpeed>
4 <ReactivePower>0</ReactivePower>
5 </DataPoint>
6 <DataPoint>
7 <WindSpeed>0.87</WindSpeed>
8 <PowerOutput>0</PowerOutput
I want it to look like this:
0 Windspeed Poweroutput
1 0.69 0.0
HereĀ“s the code that I wrote so far:
import pandas as pd
from pandas.compat import StringIO
import re
import numpy as np
df= pd.read_csv('powercurve.csv', encoding='utf-8',skiprows=42)
df.columns=['Data']
no_of_rows=df.Data.str.count("WindSpeed").sum()/2
rows=no_of_rows.astype(np.uint32)
TRBX=pd.DataFrame(index=range(0,abs(rows)),columns=['WSpd[m/s]','Power[kW]'],dtype='float')
i=0
for i in range(len(df)):
if 'WindSpeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
elif 'Rotorspeed' in df['Data']:
TRBX['WSpd[m/s]', i]= re.findall ("'(\d+)'",'Data')
Is this a suitable approach? If yes, so far there are no values written into the TRBX dataframe. Where is my mistake?

The code below should help you if your df is indeed in the same format as you:
import re
split_func = lambda x: re.split('<|>', str(x))
split_series = df.Data.apply(split_func)
data = a.apply(lambda x: x[2]).rename('data')
features = a.apply(lambda x: x[1]).rename('features')
df = pd.DataFrame(data).set_index(features).T
You may want to drop some columns that have no data or input some N/A values afterwards. You also may want to rename the variables and series to different names that make more sense to you.

Related

Compute number of floats in a int range - Python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.
Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

For loop using enumerate runs more than expected for a pandas Data Frame

So, I was working on titanic dataset to extract Title(Mr,Ms,Mrs) from Name column from Data frame(df). Its has 1309 rows.
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
This peice of code gives the following output
nan 1309
As supposed it had to stop for ind=1308, but it goes one step further even if not indicated to do so.
What could be the flaw here? Is it due to the fact that I am using 1 based indexing of the data frame?
If so, what could be done here to prevent such behaviour?
I am new to this platform, so please ask for clarifications in case of any discrepancies.
Here is a short Example:-
import numpy as np
import pandas as pd
dict1 = {'Name':['Hey, Mr.','Hello, Ms.','Hi, Mrs,','Welcome, Master.','Yes, Mr.'],'ind':[1,2,3,4,5]}
df = pd.DataFrame(data = dict1)
df.set_index('ind')
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
print(df['Title'])

Pandas read_csv: Columns are being imported as rows

Edit: I believe this was all user error. I have been typing df.T by default, and it just occurred to me that this is very likely the TRANSPOSE output. By typing df, the data frame is output normally (headers as columns). Thank you for those who stepped up to try and help. In the end, it was just my misunderstanding of pandas language..
Original Post
I'm not sure if I am making a simple mistake but the columns in a .csv file are being imported as rows using pd.read_csv. The dataframe turns out to be 5 rows by 2000 columns. I am importing only 5 columns out of 14 so I set up a list to hold the names of the columns I want. They match exactly those in the .csv file. What am I doing wrong here?
import os
import numpy as np
import pandas as pd
fp = 'C:/Users/my/file/path'
os.chdir(fp)
cols_to_use = ['VCOMPNO_CURRENT', 'MEASUREMENT_DATETIME',
'EQUIPMENT_NUMBER', 'AXLE', 'POSITION']
df = pd.read_csv('measurement_file.csv',
usecols=cols_to_use,
dtype={'EQUIPMENT_NUMBER': np.int,
'AXLE': np.int},
parse_dates=[2],
infer_datetime_format=True)
Output:
0 ... 2603
VCOMPNO_CURRENT T92656 ... T5M247
MEASUREMENT_DATETIME 7/26/2018 13:04 ... 9/21/2019 3:21
EQUIPMENT_NUMBER 208 ... 537
AXLE 1 ... 6
POSITION L ... R
[5 rows x 2000 columns]
Thank you.
Edit: To note, if I import the entire .csv with the standard pd.read_csv('measurement_file.csv'), the columns are imported properly.
Edit 2: Sample csv:
VCOMPNO_CURRENT,MEASUREMENT_DATETIME,REPAIR_ORDER_NUMBER,EQUIPMENT_NUMBER,AXLE,POSITION,FLANGE_THICKNESS,FLANGE_HEIGHT,FLANGE_SLOPE,DIAMETER,RO_NUMBER_SRC,CL,VCOMPNO_AT_MEAS,VCOMPNO_SRC
T92656,10/19/2018 7:11,5653054,208,1,L,26.59,27.34,6.52,691.3,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,1,R,26.78,27.25,6.64,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,L,26.6,27.13,6.49,691.5,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T92656,10/19/2018 7:11,5653054,208,2,R,26.61,27.45,6.75,691.6,OPTIMESS_DATA,2MTA ,T71614 ,RO_EQUIP
T7L672,10/19/2018 7:11,5653054,208,3,L,26.58,27.14,6.58,644.4,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
T7L672,10/19/2018 7:11,5653054,208,3,R,26.21,27.44,6.17,644.5,OPTIMESS_DATA,2CTC ,T7L672 ,BOTH
A simple workaround here is to just to take the transpose of the dataframe.
Link to Pandas Documentation
df = pd.DataFrame.transpose(df)
Can you try like this?
import pandas as pd
dataset = pd.read_csv('yourfile.csv')
#filterhere
dataset = dataset[cols_to_use]

How to add a column to Pandas based off of other columns

I'm using Pandas and I have a very basic dataframe:
session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z
I wish to add a third column based on the values of the current columns:
df['key'] = urlsafe_b64encode(md5('l' + df['session_id'] + df['datetime']))
But I receive:
TypeError: must be convertible to a buffer, not Series
You need to use pandas.DataFrame.apply. The code below will apply the lambda function to each row of df. You could, of course, define a separate function (if you need to do more something more complicated).
import pandas as pd
from io import StringIO
from base64 import urlsafe_b64encode
from hashlib import md5
s = ''' session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z'''
df = pd.read_csv(StringIO(s), sep='\s+')
df['key'] = df.apply(lambda x: urlsafe_b64encode(md5('l' + x['session_id'] + x['datetime'])), axis=1)
Note: I couldn't get the hashing bit working on my machine unfortunately, some unicode error (might be because I'm using Python 3) and I don't have time to debug the inner workings of it, but the pandas part I'm pretty sure about :P

Replace text with numbers using dictionary in pandas

I'm trying to replace months represented as a character (e.g. 'NOV') for their numerical counterparts ('-11-'). I can get the following piece of code to work properly.
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('NOV','-11-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('DEC','-12-')
df_cohorts['ltouch_datetime'] = df_cohorts['ltouch_datetime'].str.replace('JAN','-01-')
However, to avoid redundancy, I'd like to use a dictionary and .replace to replace the character variable for all months.
r_month1 = {'JAN':'-01-','FEB':'-02-','MAR':'-03-','APR':'-04-','MAY':'-05-','JUN':'-06-','JUL':'-07-','AUG':'-08-','SEP':'-09-','OCT':'-10-','NOV':'-11-','DEC':'-12-'}
df_cohorts.replace({'conversion_datetime': r_month1,'ltouch_datetime': r_month1})
When I enter the code above, my output dataset is unchanged. For reference, please see my sample data below.
User_ID ltouch_datetime conversion_datetime
001 11NOV14:13:12:56 11NOV14:16:12:00
002 07NOV14:17:46:14 08NOV14:13:10:00
003 04DEC14:17:46:14 04DEC15:13:12:00
Thanks!
Let me suggest a different approach: You could parse the date strings into a column of pandas TimeStamps like this:
import pandas as pd
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = pd.to_datetime(df[col], format='%d%b%y:%H:%M:%S')
print(df)
# User_ID ltouch_datetime conversion_datetime
# 0 1 2014-11-11 13:12:56 2014-11-11 16:12:00
# 1 2 2014-11-07 17:46:14 2014-11-08 13:10:00
# 2 3 2014-12-04 17:46:14 2015-12-04 13:12:00
I would stop right here, since representing dates as TimeStamps is the ideal
form for the data in Pandas.
However, if you need/want date strings with 3-letter months like 'NOV' converted to -11-, then you can convert the Timestamps with strftime and apply:
for col in ('ltouch_datetime', 'conversion_datetime'):
df[col] = df[col].apply(lambda x: x.strftime('%d-%m-%y:%H:%M:%S'))
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
To answer your question literally, in order to use Series.str.replace you need a column with the month string abbreviations all by themselves. You can arrange for that by first calling Series.str.extract. Then you can join the columns back into one using apply:
import pandas as pd
import calendar
month_map = {calendar.month_abbr[m].upper():'-{:02d}-'.format(m)
for m in range(1,13)}
df = pd.read_table('data', sep='\s+')
for col in ('ltouch_datetime', 'conversion_datetime'):
tmp = df[col].str.extract(r'(.*?)(\D+)(.*)')
tmp[1] = tmp[1].replace(month_map)
df[col] = tmp.apply(''.join, axis=1)
print(df)
yields
User_ID ltouch_datetime conversion_datetime
0 1 11-11-14:13:12:56 11-11-14:16:12:00
1 2 07-11-14:17:46:14 08-11-14:13:10:00
2 3 04-12-14:17:46:14 04-12-15:13:12:00
Finally, although you haven't asked for this directly, it's good to be aware
that if your data is in a file, you can parse the datestring columns into
TimeStamps directly using
import pandas as pd
import datetime as DT
df = pd.read_table(
'data', sep='\s+', parse_dates=[1,2],
date_parser=lambda x: DT.datetime.strptime(x, '%d%b%y:%H:%M:%S'))
This might be the most convenient method of all (assuming you want TimeStamps).

Categories

Resources