Avoiding duplicate data in hdf5 or sqlite

Avoiding duplicate data in hdf5 or sqlite - python

I am taking 30 days of historical data and modifying it.
Hopefully, I can read the historical data and have it refer to a dynamic rolling date of 30 days. The 'DateTime' value is a column in the raw data.
df_new = df = pd.read_csv(loc+filename)
max_date = df_new['DateTime'].max()
date_range = max_date - Timedelta(30, unit='d')
df_old = pd.read_hdf(loc+filename,'TableName', where = [('max_date > date_range')])
Then I would read the new data which is a separate file which is always Month to Date values (all of June for example, this file is replaced daily with the latest data), and concat them to the old dataframe.
frames = [df_old, df_new]
df = pd.concat(frames)
Then I do some things to the file (I am checking if certain values repeat within a 30 day window, if they do then I place a timestamp in a column).
Now I would want to add this modified data back into my original file (it was HDF5 but it could be a .sqlite file too) called df_old. For sure, there are a ton of duplicates since I am reading the previous 30 days data and the MTD data. How do I manage this?
My only solution is to read the entire file (df_old along with the new data I added) and then drop duplicates and then overwrite it again. This isn't very efficient.
Can .sqlite or .hdf formats enforce non-duplicates? If so then I have 3 columns which identify a unique value (Date, EmpID, CustomerID). I do not want exact duplicate rows.

Define them as primary keys in sqlite. It wont allow you to have set of non-unique primary keys.
e.g.
CREATE TABLE table (
a INT,
b INT,
c INT,
PRIMARY KEY(a,b)
);
wont allow you to have duplicates of a,b added to the data. Then use
INSERT OR IGNORE to add data, and any duplicates will be ignored.
http://sqlite.org/lang_insert.html

Related

Use python to export excel data for table in random location

I keep receiving excel files where the table starts in a random location and I want to use python to export the data. I want to pull two columns from the table, but since the start location is random (example: one file the table might start on Column 2 Row 3 and the next time we receive the file the table starts on Column 4 Row 7).
I tried trimming the null values, but sometimes the file has a title so that doesn't work.
The table column is consistent so I was wondering if there's any way to use the index of that column to pull the data.
Below is an example of the data I'm receiving. I want to pull the columns Product Number and Market. The table is in random locations in each file.

import pandas as pd
df = pd.read_excel('Fun.xlsx')
for i in range(df.shape[0]):
for j in range(df.shape[1]):
Position = df.iloc[i][j] == "Product Number"
if Position == True:
print(i,j)
correctdata = pd.read_excel('Fun.xlsx', skiprows=(i-1))
FinalCorrectData = correctdata[["Product Number", "Sale", "Market"]]
This gets you the data frame you need (FinalCorrectData) you may have to worry about empty rows below the table, but that should be an easy fix. Once you have the data frame you can easily write it to an excel.
This is a naïve brute force way to solve this, so there probably is a better way to approach it.

Merging datasets in Python on timestamp with emphasis on Date

I have a large dataset ('df'; ~400,000 lines) of rows with a datetime index describing features of cities.
eg.
df = pd.DataFrame([['2016-01-01 00:00:00','Jacksonville'], ['2016-01-01 01:00:00','Jacksonville'],
['2016-01-01 02:00:00','Jacksonville'], ['2016-01-01 03:00:00','Toronto']], columns=['timestamp','City'])
I want to merge this with another smaller dataset I've created ('public_holidays'; ~300 lines) that lists public holidays for those cities.
eg.
public_holidays = pd.DataFrame([['1/01/2016','New Year\'s Day','Jacksonville'], ['1/01/2016','New Year\'s Day','San Francisco'],
['25/12/2018','Christmas Day','Toronto'], ['26/12/2018','Boxing Day','Toronto']], columns=['timestamp','Holiday','City'])
Currently I've done this:
new_df= pd.merge(df, public_holidays, how = 'left', on = ['timestamp','City'])
This works, however as 'df's timestamp contains every hour of each day, the merge only occures at the hour 00:00 (as 'public_holidays' "timestamp" is only by date).
How can I get 'public_holidays' to map to every row that matches its date, regardless of time?
Many thanks for any assistance.

Add to df an auxiliary column with normalized timestamp:
df['dat'] = df.timestamp.dt.normalize()
Then in merge, instead of on=... pass:
left_on=['dat', 'City'],
right_on=['timestamp', 'City'].
Finally (after the new_df is created) you can drop this auxiliary column.
An alternative is to overwrite timestamp column with the normalized timestamp:
df.timestamp = df.timestamp.dt.normalize()
and perform the merge without any change.
Note: As you failed to include sample data, the above advice is only
"theoretical", not supported by any actual test run.

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.

You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.

I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?

Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Dataframe appending datetime data and write to .wac file

I am new in python and in the last days I have educated myself in opening and performing operations on data stored in txt, xls, asc files with pandas, but I still have confusion when doing operations with datarames.
I have a .wac file which is in the right formatting (it has to be then used as input file for a software), but contains partially wrong values, and an .xlsx file containing the right values.
I have transferred the data to two dataframes with this code (I used skiprows to skip through the string data in both files):
data_format = pd.read_csv('Example.wac', skiprows=11, delim_whitespace=True, names=["Date", "Hour", "Temp gap North [C]", "RH %"])
data_WUFI =pd.read_excel('Temperature_RH_North.xlsx', skiprows=1, header=None, dtype=float, names=["Hour", "Temp gap [C]", "RH %"])
Now I need to do the following modifications to the dataframes, but I do not know where to start from and I hope I came to the right place to seek help.
For data_format:
- the column 'Date' is in the format *2018-01-01* and runs to *2019-12-31*. Being obviously a date, it stays the same for 24 positions and then it increases by 1 day. I need to add rows to that column up to *2027-12-31* (without leap years)
- the column 'Hour' is in the format *01:00*. Values run from *01:00* to *24:00*. I need to add rows so that every 24 hours the date in the first column increases by one day, then the hour numbering restarts at *01:00*
- The column 'RH %' contains the same value in all rows, i.e. 0.5
I add a snapshot of data_format to make it more clear:
Once the new dataframe is created, e.g. data_format_NEW I can substitute the values in 'Temp gap North [C]' with the correct values from data_WUFI (already of the right size):
data_format_NEW['Temp gap North [C]'] = data_WUFI['Temp gap [C]']
At that point I will write data_format_NEW in a .wac file:
data_format_NEW.to_csv('Example_NEW.wac', index=False, delim_whitespace=True)
but the first 12 rows will have to contain string values as in the picture:
I am not sure whether I got the planning right , but I hope I managed to explain myself enough to be clear

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Avoiding duplicate data in hdf5 or sqlite - python

Related

Use python to export excel data for table in random location

Merging datasets in Python on timestamp with emphasis on Date

How to create a new python DataFrame with multiple columns of differing row lengths?

How to calculate based on multiple conditions using Python data frames?

Dataframe appending datetime data and write to .wac file

Categories

Resources