Store a dataframe and block new updates

Store a dataframe and block new updates - python

I'm struggling to solve this problem. I'm creating a data frame that is generated by data that can vary from one day to another but I need to save the first version and block new updates.
This is the code:
# Create data frame for the ideal burndown line
df_ideal_burndown = pd.DataFrame(columns=['dates', 'ideal_trend'])
df_ideal_burndown['dates'] = range_sprint
#### Dates preparation
df_ideal_burndown['dates'] = pd.to_datetime(df_ideal_burndown['dates'], dayfirst=True)
df_ideal_burndown['dates'] = df_ideal_burndown['dates'].dt.strftime('%Y-%m-%d')
# Define the sprint lenght
days_sprint = int(len(range_sprint)) - int(cont_nonworking)
# Get how many items are in the current sprint
commited = len(df_current_sprint)
# Define the ideal number of items should be delivered by day
ideal_burn = round(commited/days_sprint,1)
# Create a list of remaining items to be delivered by day
burndown = [commited - ideal_burn]
# Day of the sprint -> starts with 2, since the first day is already in the list above
sprint_day = 2
# Iterate to create the ideal trend line in numbers
for i in range(1, len(df_ideal_burndown), 1):
burndown.append(round((commited - (ideal_burn * sprint_day)),1))
sprint_day += 1
# Add the ideal burndown to the column
df_ideal_burndown['ideal_trend'] = burndown
df_ideal_burndown
This is the output:
dates ideal_trend
0 2022-03-14 18.7
1 2022-03-15 17.4
2 2022-03-16 16.1
3 2022-03-17 14.8
4 2022-03-18 13.5
5 2022-03-21 12.2
6 2022-03-22 10.9
7 2022-03-23 9.6
8 2022-03-24 8.3
9 2022-03-25 7.0
10 2022-03-28 5.7
11 2022-03-29 4.4
12 2022-03-30 3.1
13 2022-03-31 1.8
14 2022-04-01 0.5
My main problem is related to commited = len(df_current_sprint), since the df_current_sprint is (and needs to be) used by other parts of my code.
Basically, even if the API returns new data that should be stored in the df_current_sprint I should use the version I'd just created.
I am pretty new to Python and I do not know if there is a way to store and, let's say, cache this information until I need to use fresh new data.
I appreciate your support, clues, and guidance.
Marcelo

Related

how to read url .txt files using pandas

I have a problem reading files using pandas (read_csv). I can do it using the built in, with open(...), however it is much easier with pandas. I just need to read the data (numbers) between the ----. This is the LINK with one of my data url. There are more depending on the date that i insert. A sample of this is :
MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020
NAME: Krieza Evias CITY: Krieza Evias STATE:
ELEV: 119 m LAT: 38° 24' 00" N LONG: 24° 18' 00" E
TEMPERATURE (°C), RAIN (mm), WIND SPEED (km/hr)
HEAT COOL AVG
MEAN DEG DEG WIND DOM
DAY TEMP HIGH TIME LOW TIME DAYS DAYS RAIN SPEED HIGH TIME DIR
------------------------------------------------------------------------------------
1 18.2 22.4 10:20 13.5 23:50 1.0 0.9 0.0 4.5 33.8 12:30 E
2 17.6 22.3 15:00 10.8 4:10 2.0 1.3 0.0 4.5 30.6 15:20 E
3 18.1 21.9 12:20 14.1 3:40 1.3 1.1 1.0 4.2 24.1 14:40 E
Keep in mind that i cannot just use skiprows=8 and skipfooter=9 to get the data between the --------, because not all files of this format have a specific number of footer (skipfooter)or title (skiprows) to skip. Some have 2 or 3 and some others have 8-9 lines of footer or title to skip. But every file has 2 lines of -------- where the data are between them.

I think you can't directly use read_csv but you could do this:
import urllib
from io import StringIO
count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
if "---" in line.decode('windows-1252'):
count+=1
elif count==1:
txt+=line.decode('windows-1252')
else:
break
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)
header is None because in your link column names are not in a row only but divided into multiple rows. If they're fixed I suggest you to put them by hand such as ["DAY", "MEAN TEMP", ...].

Add column to dataframe based on intervals from another dataframe

I have a gpx file that I am manipulating. I would like to add a column to it that describes the terrain based on another dataframe that lists the terrain by distance. Here are the dataframes:
GPS_df
lat lon alt time dist total_dist
0 44.565335 -123.312517 85.314 2020-09-07 14:00:01 0.000000 0.000000
1 44.565336 -123.312528 85.311 2020-09-07 14:00:02 0.000547 0.000547
2 44.565335 -123.312551 85.302 2020-09-07 14:00:03 0.001137 0.001685
3 44.565332 -123.312591 85.287 2020-09-07 14:00:04 0.001985 0.003670
4 44.565331 -123.312637 85.270 2020-09-07 14:00:05 0.002272 0.005942
... ... ... ... ... ... ...
12481 44.565576 -123.316116 85.517 2020-09-07 17:28:14 0.002318 26.091324
12482 44.565559 -123.316072 85.587 2020-09-07 17:28:15 0.002469 26.093793
12483 44.565554 -123.316003 85.637 2020-09-07 17:28:16 0.003423 26.097217
12484 44.565535 -123.315966 85.697 2020-09-07 17:28:17 0.002249 26.099465
12485 44.565521 -123.315929 85.700 2020-09-07 17:28:18 0.002066 26.101532
terrain_df:
dist terrain
0 0.0 Start
1 3.0 Road
2 5.0 Gravel
3 8.0 Trail-hard
4 12.0 Gravel
5 16.0 Trail-med
6 18.0 Road
7 22.0 Gravel
8 23.0 Trail-easy
9 26.2 Road
I have come up with the following code, that works, but I would like to make it more efficient by eliminating the looping:
GPS_df['terrain']=""
i=0
for j in range(0,len(GPS_df)):
if GPS_df.total_dist[j]<= terrain_df.dist[i]:
GPS_df.terrain[j]=terrain_df.terrain[i]
else:
i=i+1
GPS_df.terrain[j]=terrain_df.terrain[i]
I have tried a half a dozen different ways, but none seem to work correctly. I am sure there is an easy way to do it, but I just don't have the skills and experience to figure it out so far, so I am looking for some help. I tried using cut and add the labels, but cut requires unique labels. I could use cut and then replace the generated intervals with labels in another way, but that doesn't seems like the best approach either. I also tried this approach that I found from another question, but it filled the column with the first label only (I also am having trouble understanding how it works, so it makes it tough to troubleshoot).
bins = terrain_df['dist']
names = terrain_df['terrain']
d = dict(enumerate(names, 1))
GPS_df['terrain2'] = np.vectorize(d.get)(np.digitize(GPS_df['dist'], bins))
Appreciate any guidance that you can give me.

I believe pandas.merge_asof should do the trick. Try:
result = pd.merge_asof(left=GPS_df, right=terrain_df, left_on='total_dist', right_on='dist', direction='backward')

group_by output conversion to data frame issues

So I am not sure if I am taking the best approach to solve this problem, but this is what I have so far:
This is the df that I am working with:
calls.head()
id user_id call_date duration
0 1000_93 1000 2018-12-27 9.0
1 1000_145 1000 2018-12-27 14.0
2 1000_247 1000 2018-12-27 15.0
3 1000_309 1000 2018-12-28 6.0
4 1000_380 1000 2018-12-30 5.0
I am trying to figure out how to create a data frame that tells me how many times a user made a call in a month. This is the code I used to generate that:
calls_per_month = calls.groupby(['user_id',calls['call_date'].dt.month])['call_date'].count()
calls_per_month.head(10)
user_id call_date
1000 12 16
1001 8 27
9 49
10 65
11 64
12 56
1002 10 11
11 55
12 47
1003 12 149
Name: call_date, dtype: int64
Now, the issue is that I need to do further calculations with the user_id attributes of other data frames, so I need to be able to access the total I computed in this table. However it seems like the table I created is not a dataframe, which is not allowing me to do so. This is a solution I tried:
calls_per_month = calls.groupby(['user_id',calls['call_date'].dt.month])['call_date'].count().reset_index()
#(calls_per_month.to_frame()).columns = ['user_id','date','total_calls']
calls_per_month.columns = ['user_id','date','total_calls']
(I tried with and without to_frame)
But I got the following error:
cannot insert call_date, already exists
Please suggest the best way to go about solving this issue. Considering that I have other dataframes with user_id and attributes like 'data used' how do I make this data frame such that I can do computations like total_use = calls['total_calls']*internet['data_used] for each user_id?
Thank you.

Use rename for change level name, so Series.reset_index working correctly:
calls_per_month = (calls.groupby(['user_id',
calls['call_date'].dt.month.rename('month')])['call_date']
.count()
.reset_index())

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?

From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Get time difference between two values in csv file [duplicate]

This question already has answers here:
Pandas: Difference to previous value
(2 answers)
Closed 3 years ago.
I trying to get the avarage, max and min time difference between value occurrences in a csv file.
The contains a multiple columns and rows.
I am currently working in python and trying to use pandas to solve my problem.
I have managed to break down the csv file to the column i want to get the time difference from and the time column.
Where the "payload" column "value occurrences" happens.
looking like:
time | payload
12.1 2368
13.8 2508
I have also tried to get the time in a array when the value occurrences happens and tried to step through the array but failed bad. I felt like there was a easier way to do it.
def average_time(avg_file):
avg_read = pd.read_csv(avg_file, skiprows=2, names=new_col_names, usecols=[2, 3], na_filter=False, skip_blank_lines=True)
test=[]
i=0
for row in avg_read.payload:
if row != None:
test[i]=avg_read.time
i+=1
if len[test] > 2:
average=test[1]-test[0]
i=0
test=[]
return average
The csv-file currently look like:
time | payload
12.1 2250
12.5 2305
12.9 (blank)
13.1 (blank)
13.5 2309
14.6 2350
14.9 2680
15.0 (blank)
I want to get the time diffenrence between the values in the payload columen. example time between
2250 and 2305 --> 12.5-12.1 = 0.4 sec
and the get the difference between
2305 and 2309 --> 13.5-12.5 = 1 s
Skipping the blank numbers
To later on get the maximum, minimun and average difference.

First use dropna then use Series.diff
DataFrame used:
print(df)
time payload
0 12.1 2250.0
1 12.5 2305.0
2 12.9 NaN
3 13.1 NaN
4 13.5 2309.0
5 14.6 2350.0
6 14.9 2680.0
7 15.0 NaN
df.dropna().time.diff()
0 NaN
1 0.4
4 1.0
5 1.1
6 0.3
Name: time, dtype: float64
Note I assumed your (blank) values are NaN, else use the following before running my code:
df.replace('(blank)', np.NaN, inplace=True, axis=1)
# Or if they are whitespaces
df.replace('', np.NaN, inplace=True, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Store a dataframe and block new updates - python

Related

how to read url .txt files using pandas

Add column to dataframe based on intervals from another dataframe

group_by output conversion to data frame issues

Pandas Way of Weighted Average in a Large DataFrame

Get time difference between two values in csv file [duplicate]

Categories

Resources