Removing text and keep digits in panda

Removing text and keep digits in panda - python

I have dataframe similar to the one bellow
I want to remove text and keep digits only from each coloumn in that Dataframe
The expected output something like this
So far I have tried this
import json
import requests
import pandas as pd
URL = 'https://xxxxx.com'
req = requests.get(URL,auth=('xxx', 'xxx') )
text_data= req.text
json_dict= json.loads(text_data)
df = pd.DataFrame.from_dict(json_dict["measurements"])
cols_to_keep =['source','battery','c8y_TemperatureMeasurement','time','c8y_DistanceMeasurement']
df_final = df[cols_to_keep]
df_final = df_final.rename(columns={'c8y_TemperatureMeasurement': 'Temperature Or T','c8y_DistanceMeasurement':'Distance'})
for col in df_final:
df_final[col] = [''.join(re.findall("\d*\.?\d+", item)) for item in df_final[col]]

Your code is missing import pandas as pd and the data cannot be accessed, because it requires credentials.
You can use pandas.DataFrame.replace:
Example data:
df = pd.DataFrame({'a':['abc123abc', 'def456678'], 'b':['123a', 'b456']})
Dataframe:
a b
0 abc123abc 123a
1 def456678 b456
[^0-9.] replaces all non-digit characters.
df.replace('[^0-9.]', '', regex=True)
Output:
a b
0 123 123
1 456678 456
Edit:
The problem here is actually about nested JSON and not about replacing values in a dataframe. The reason the statement above does not work is because the data is saved as dicts in in the dataframe. But since the above mentioned solution is generally correct, it won't edit it out.
Revised Answer:
from pandas.io.json import json_normalize
import requests
import pandas as pd
URL = 'https://wastemanagement.post-iot.lu/measurement/measurements?source=83512& pageSize=1000000000&dateFrom=2019-10-26&dateTo=2019-10-28'
req = requests.get(URL,auth=('xxxx', 'xxxx') )
text_data= req.text
json_dict= json.loads(text_data)
df= json_normalize(json_dict['measurements'])
df = df_final.rename(columns={'source.id': 'source', 'battery.percent.value': 'battery', 'c8y_TemperatureMeasurement.T.value': 'Temperature Or T','c8y_DistanceMeasurement.distance.value':'Distance'})
cols_to_keep =['source' ,'battery', 'Temperature Or T', 'time', 'Distance']
df_final = df[cols_to_keep]
Output:
source battery Temperature Or T time Distance
0 83512 98.0 NaN 2019-10-26T00:00:06.494Z NaN
1 83512 NaN 23.0 2019-10-26T00:00:06.538Z NaN
2 83512 NaN NaN 2019-10-26T00:00:06.577Z 21.0
3 83512 98.0 NaN 2019-10-26T00:30:06.702Z NaN
4 83512 NaN 23.0 2019-10-26T00:30:06.743Z NaN

Related

How to combine sensor data for plotting

I'm testing a light sensor for sensitivity. I now have data that I would like to plot.
The sensor has 24 levels of sensitivity
I'm only testing 0,6,12,18 and 23
On the x-axes: PWM value, range 0-65000
My goal is to plot from a dataframe using with plotly.
My question is:
How can I combine the data (as shown below) into a dataframe for plotting?
EDIT: The link to my csv files: https://filetransfer.io/data-package/QwzFzT8O
Also below: my code so far
Thanks!
def main_code():
data = pd.DataFrame(columns=['PWM','sens_00','sens_06','sens_12','sens_18','sens_23'])
sens_00 = pd.read_csv('sens_00.csv', sep=';')
sens_06 = pd.read_csv('sens_06.csv', sep=';')
sens_12 = pd.read_csv('sens_12.csv', sep=';')
sens_18 = pd.read_csv('sens_18.csv', sep=';')
sens_23 = pd.read_csv('sens_23.csv', sep=';')
print(data)
print(sens_23)
import plotly.express as px
import pandas as pd
if __name__ == '__main__':
main_code()

#Dawid's answer is fine, but it does not produce a complete dataframe (so you can do more than just plotting), and contains too much redundancy.
Below is a better way to concatenate the multiple csv files.
Then plotting is just a single call.
Reading csv files into a single dataframe:
from pathlib import Path
import pandas as pd
def read_dataframes(data_root: Path):
# It can be turned into a single line
# but keeping it more readable here
dataframes = []
for fpath in data_root.glob("*.csv"):
df = pd.read_csv(fpath, sep=";")
df = df[["pwm", "lux"]]
df = df.rename({"lux": fpath.stem}, axis="columns")
df = df.set_index("pwm")
dataframes.append(df)
return pd.concat(dataframes)
data_root = Path("data")
df = read_dataframes(data_root)
df
sens_06 sens_18 sens_12 sens_23 sens_00
pwm
100 0.00000 NaN NaN NaN NaN
200 1.36435 NaN NaN NaN NaN
300 6.06451 NaN NaN NaN NaN
400 12.60010 NaN NaN NaN NaN
500 20.03770 NaN NaN NaN NaN
... ... ... ... ... ...
64700 NaN NaN NaN NaN 5276.74
64800 NaN NaN NaN NaN 5282.29
64900 NaN NaN NaN NaN 5290.45
65000 NaN NaN NaN NaN 5296.63
65000 NaN NaN NaN NaN 5296.57
[2098 rows x 5 columns]
Plotting:
df.plot(backend="plotly") # equivalent to px.line(df)

Here is my suggestion. You have two columns in each file, and you need to use unique column names to keep both columns. All files are loaded and appended to the empty DataFrame called data. To generate a plot with all columns, you need to specify it by fig.add_scatter. The code:
import pandas as pd
import plotly.express as px
def main_code():
data = pd.DataFrame()
for filename in ['sens_00', 'sens_06', 'sens_12', 'sens_18', 'sens_23']:
data[['{}-PWM'.format(filename), '{}-LUX'.format(filename)]] = pd.read_csv('{}.csv'.format(filename), sep=';')
print(data)
fig = px.line(data_frame=data, x=data['sens_00-PWM'], y=data['sens_00-LUX'])
for filename in ['sens_06', 'sens_12', 'sens_18', 'sens_23']:
fig.add_scatter(x=data['{}-PWM'.format(filename)], y=data['{}-LUX'.format(filename)], mode='lines')
fig.show()
if __name__ == '__main__':
main_code()

Based on the suggestion by #Dawid
This is what I was going for.

Convert Json to DataFrame using pandas

I have a json file that looks like that, and i want to convert it to dataframe while Android and Ios will be my indexes at my DF:
json = {
"Android":{
"lastExecutionID":"21-08-16_07_02_25_25111",
"lastExecutionTime":1629,
"avgDuration":26884
},
"IOS":{
"lastID":"21-08-16_07_02_25_25534",
"lastTime":1669,
"avg":109802
}
}
The best way I have found to do this is to convert each json object to list will using my indexes at my DF,
onw list of each key will using 'colmuns' at my DF and one list for each value.
There is a better way to that ?
Thanks everyone

The easiest to parse json objects to pandas is pd.json_normalize():
>>> df = pd.json_normalize(json)
>>> df
Android.lastExecutionID Android.lastExecutionTime Android.avgDuration IOS.lastID IOS.lastTime IOS.avg
0 21-08-16_07_02_25_25111 1629 26884 21-08-16_07_02_25_25534 1669 109802
Nested names are .-separated, so you then just need to split column names and unstack:
>>> df.columns = df.columns.str.split('.', 1).map(tuple)
>>> df.loc[0].unstack()
avg avgDuration lastExecutionID lastExecutionTime lastID lastTime
Android NaN 26884 21-08-16_07_02_25_25111 1629 NaN NaN
IOS 109802 NaN NaN NaN 21-08-16_07_02_25_25534 1669

As you want Android and Ios to be the indexes of the resulting dataframe, see whether this is what you want:
Use pd.Series + .apply(), as follows:
df = pd.Series(json).apply(pd.Series)
Result:
print(df)
lastExecutionID lastExecutionTime avgDuration lastID lastTime avg
Android 21-08-16_07_02_25_25111 1629.0 26884.0 NaN NaN NaN
IOS NaN NaN NaN 21-08-16_07_02_25_25534 1669.0 109802.0

Python - NaN return (pandas - resample function)

I'm doing a finance study based on the youtube link below and I would like to understand why I got the NaN return instead of the expected calculation. What do I need to do in this script to reach the expected value?
YouTube case: https://www.youtube.com/watch?v=UpbpvP0m5d8
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2020', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
Return:
Ativo
ABEV3 NaN
CEAB3 NaN
ENBR3 NaN
FLRY3 NaN
IRBR3 NaN
ITSA4 NaN
JHSF3 NaN
STBP3 NaN
dtype: float64

You need to change the 'from_date' to have more than one year of data.
You current script returns one row and .pct_change() on one row of data returns NaN, because there is no previous row to compare against.
When I changed from_date to '01/01/2018'
import investpy as env
import numpy as np
import pandas as pd
lt = ['ABEV3','CEAB3','ENBR3','FLRY3','IRBR3','ITSA4','JHSF3','STBP3']
prices = pd.DataFrame()
for i in lt:
df = env.get_stock_historical_data(stock=i, from_date='01/01/2018', to_date='29/05/2020', country='brazil')
df['Ativo'] = i
prices = pd.concat([prices, df], sort=True)
pivoted = prices.pivot(columns='Ativo', values='Close')
e_r = pivoted.resample('Y').last().pct_change().mean()
e_r
I get the following output:
Ativo
ABEV3 -0.043025
CEAB3 -0.464669
ENBR3 0.180655
FLRY3 0.191976
IRBR3 -0.175084
ITSA4 -0.035767
JHSF3 1.283291
STBP3 0.223627
dtype: float64

Pandas reading data as NaT and nan

I am trying to read in some sample data that is in a CSV file which contains the following. However when I print out the data in the dataframe many of the columns are null. I manually set the column datatypes because I thought that might be the issue, however that didn't solve the problem. Some help would be appreciated.
Data:
People_id,datetime,First Name,Last Name,Utilization,Chargeability,Target,Employee Type,Business Unit,Business Group
2222,2020-05-03,FirstName,LastName,0.8,0.9,0.4,Employee,GGGG,G1
Code:
import pandas as pd
data = pd.read_csv (r'C:\Users\Name\Documents\testdata.csv')
df = pd.DataFrame(data, columns= ['People_id', 'WeekEnding', 'FirstName', 'LastName', 'Utilization', 'Chargeability', 'Target', 'EmployeeType', 'BusinessUnit', 'BusinessGroup'])
df.dropna(subset=['People_id', 'Utilization', 'Chargeability', 'Target'])
df.FirstName = df.FirstName.astype(str)
df.LastName = df.LastName.astype(str)
df.EmployeeType = df.EmployeeType.astype(str)
df.BusinessUnit = df.BusinessUnit.astype(str)
df.BusinessGroup = df.BusinessGroup.astype(str)
df['WeekEnding'] = pd.to_datetime(df['WeekEnding'])
for row in df.itertuples():
print (row.People_id)
print (row.WeekEnding)
print (row.FirstName)
print (row.LastName)
print (row.Utilization)
print (row.Chargeability)
print (row.Target)
print (row.EmployeeType)
print (row.BusinessUnit)
print(row.BusinessGroup)
Output:
2222
NaT
nan
nan
0.8
0.9
0.4
nan
nan
nan

TypeError: Cannot compare type 'Timestamp' with type 'int'

I have some long winded code here with an issue when I am attempting to join (or merge/concat) two datasets together, I get this TypeError: Cannot compare type 'Timestamp' with type 'int'
The two datasets both come from resampling the same initial starting dataset. The master_hrs df is a resampling process using the a change point algorithm Python package called rupters. (pip install ruptures to run code). daily_summary df is just using Pandas to resample daily mean & sum values. But I get the error when I attempt to combine the datasets together. Would anyone have any tips to try?
Making up some fake data generates the same error as my real world dataset. I think the issue I have is I am trying to compare datime to numpy some how... Any tips greatly appreciated. Thanks
import ruptures as rpt
import calendar
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='H')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
def changPointDf(df):
arr = np.array(df.Value)
#Define Binary Segmentation search method
model = "l2"
algo = rpt.Binseg(model=model).fit(arr)
my_bkps = algo.predict(n_bkps=5)
# getting the timestamps of the change points
bkps_timestamps = df.iloc[[0] + my_bkps[:-1] +[-1]].index
# computing the durations between change points
durations = (bkps_timestamps[1:] - bkps_timestamps[:-1])
#hours calc
d = durations.seconds/60/60
d_f = pd.DataFrame(d)
df2 = d_f.T
return df2
master_hrs = pd.DataFrame()
for idx, days in df.groupby(df.index.date):
changPoint_df = changPointDf(days)
values = changPoint_df.values.tolist()
master_hrs=master_hrs.append(values)
master_hrs.columns = ['overnight_AM_hrs', 'moring_startup_hrs', 'moring_ramp_hrs', 'high_load_hrs', 'evening_shoulder_hrs']
daily_summary = pd.DataFrame()
daily_summary['Temperature'] = df['Temperature'].resample('D').mean()
daily_summary['Value'] = df['Value'].resample('D').sum()
final_df = daily_summary.join(master_hrs)

The issue was the indexes themselves - master_hrs was int64 whereas daily_summary was datetime. Include this before joining the two dataframes together:
master_hrs.index = pd.to_datetime(master_hrs.index)
Just for clarity, here's my output of final_df:
Temperature Value ... high_load_hrs evening_shoulder_hrs
2019-01-01 0.417517 12.154527 ... NaN NaN
2019-01-02 0.521131 13.811842 ... NaN NaN
2019-01-03 0.583205 12.568966 ... NaN NaN
2019-01-04 0.448225 14.036136 ... NaN NaN
2019-01-05 0.542870 10.738192 ... NaN NaN
... ... ... ... ...
2024-09-10 0.470421 13.775528 ... NaN NaN
2024-09-11 0.384672 10.473930 ... NaN NaN
2024-09-12 0.527284 14.000231 ... NaN NaN
2024-09-13 0.555646 11.460867 ... NaN NaN
2024-09-14 0.426003 3.763975 ... NaN NaN
[2084 rows x 7 columns]
Hopefully this gets you what you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing text and keep digits in panda - python

Related

How to combine sensor data for plotting

Convert Json to DataFrame using pandas

Python - NaN return (pandas - resample function)

Pandas reading data as NaT and nan

TypeError: Cannot compare type 'Timestamp' with type 'int'

Categories

Resources