I'm working on a billing system.
On the one hand, I have contracts with start and end date, which I need to bill monthly. One contract can have several start/end dates, but they can't overlap for a same contract.
On the other hand, I have a df with the invoice billed per contract, with their start and end date. Invoices' start/end dates for a specific contract can't also overlap. There could be gap though between end date of an invoice and start of another invoice.
My goal is to look at the contract start/end dates, and remove all the period billed for a single contract, so that I know what's left to be billed.
Here is my data for contract:
contract_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770052',
3: 'C00770052',
4: 'C00770053'},
'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
1: pd.to_datetime('2019-01-01 00:00:00'),
2: pd.to_datetime('2019-07-01 00:00:00'),
3: pd.to_datetime('2019-09-01 00:00:00'),
4: pd.to_datetime('2019-10-01 00:00:00')},
'to': {0: pd.to_datetime('2019-01-01 00:00:00'),
1: pd.to_datetime('2019-07-01 00:00:00'),
2: pd.to_datetime('2019-09-01 00:00:00'),
3: pd.to_datetime('2021-01-01 00:00:00'),
4: pd.to_datetime('2024-01-01 00:00:00')}})
Here is my invoice data (no invoice for C00770053):
invoice_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770052',
3: 'C00770052',
4: 'C00770052',
5: 'C00770052',
6: 'C00770052',
7: 'C00770052'},
'from': {0: pd.to_datetime('2018-07-01 00:00:00'),
1: pd.to_datetime('2018-08-01 00:00:00'),
2: pd.to_datetime('2018-09-01 00:00:00'),
3: pd.to_datetime('2018-10-01 00:00:00'),
4: pd.to_datetime('2018-11-01 00:00:00'),
5: pd.to_datetime('2019-05-01 00:00:00'),
6: pd.to_datetime('2019-06-01 00:00:00'),
7: pd.to_datetime('2019-07-01 00:00:00')},
'to': {0: pd.to_datetime('2018-08-01 00:00:00'),
1: pd.to_datetime('2018-09-01 00:00:00'),
2: pd.to_datetime('2018-10-01 00:00:00'),
3: pd.to_datetime('2018-11-01 00:00:00'),
4: pd.to_datetime('2019-04-01 00:00:00'),
5: pd.to_datetime('2019-06-01 00:00:00'),
6: pd.to_datetime('2019-07-01 00:00:00'),
7: pd.to_datetime('2019-09-01 00:00:00')}})
My expected result is:
to_bill_df = pd.DataFrame({'contract_id': {0: 'C00770052',
1: 'C00770052',
2: 'C00770053'},
'from': {0: pd.to_datetime('2019-04-01 00:00:00'),
1: pd.to_datetime('2019-09-01 00:00:00'),
2: pd.to_datetime('2019-10-01 00:00:00')},
'to': {0: pd.to_datetime('2019-05-01 00:00:00'),
1: pd.to_datetime('2021-01-01 00:00:00'),
2: pd.to_datetime('2024-01-01 00:00:00')}})
What I need therefore is to go through each row of contract_df, identify the invoices matching the relevant period and remove the periods which have already been billed from the contract_df, eventually splitting the contract_df row into 2 rows if there is a gap.
The problem is that going like this seem very heavy considering that I'll have millions of invoices and contracts, I feel like there is an easy way with pandas but I'm not sure how I could do it
Thanks
I was solving a similar problem the other day. It's not a simple solution but should be generic in identifying any non-overlapping intervals.
The idea is to convert your dates into continuous integers and then we can remove the overlap with a set OR operator. The function below will transform your DataFrame into a dictionary that contains a list of non-overlapping integer dates for each ID.
from functools import reduce
def non_overlapping_intervals(df, uid, date_from, date_to):
# Convert date to day integer
helper_from = date_from + '_helper'
helper_to = date_to + '_helper'
df[helper_from] = df[date_from].sub(pd.Timestamp('1900-01-01')).dt.days # set a reference date
df[helper_to] = df[date_to].sub(pd.Timestamp('1900-01-01')).dt.days
out = (
df[[uid, helper_from, helper_to]]
.dropna()
.groupby(uid)
[[helper_from, helper_to]]
.apply(
lambda x: reduce( # Apply for an arbitrary number of cases
lambda a, b: a | b, x.apply( # Eliminate the overlapping dates OR operation on set
lambda y: set(range(y[helper_from], y[helper_to])), # Create continuous integers for date ranges
axis=1
)
)
)
.to_dict()
)
return out
From here, we want to do a set subtraction to find the dates and IDs for which there are contracts but no invoices:
from collections import defaultdict
invoice_dates = defaultdict(set, non_overlapping_intervals(invoice_df, 'contract_id', 'from', 'to'))
contract_dates = defaultdict(set, non_overlapping_intervals(contract_df, 'contract_id', 'from', 'to'))
missing_dates = {}
for k, v in contract_dates.items():
missing_dates[k] = list(v - invoice_dates.get(k, set()))
Now we have a dict called missing_dates that gives us each date for which there are no invoices. To convert it into your output format, we need to separate each continuous group for each ID. Using this answer, we arrive at the below:
from itertools import groupby
from operator import itemgetter
missing_invoices = []
for uid, dates in missing_dates.items():
for k, g in groupby(enumerate(sorted(dates)), lambda x: x[0] - x[1]):
group = list(map(int, map(itemgetter(1), g)))
missing_invoices.append([uid, group[0], group[-1]])
missing_invoices = pd.DataFrame(missing_invoices, columns=['contract_id', 'from', 'to'])
# Convert back to datetime
missing_invoices['from'] = missing_invoices['from'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x))
missing_invoices['to'] = missing_invoices['to'].apply(lambda x: pd.Timestamp('1900-01-01') + pd.DateOffset(days=x + 1))
Probably not the simple solution you were looking for, but this should be reasonably efficient.
Related
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted() with this list sorting function as my sorted( , key=):
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("\d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
If you look at the definition of the parameter key in sort_values it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
First way:
sort_int_key = lambda col: col.str.extract("(\d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)\d+(\w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(\d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)\d+(\w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
I have a Data Frame with 16 columns and when i group it by one of the columns it reduces the size of the Data Frame to 12 columns
def update_centroids(cluster, k):
df["Cluster"] = cluster
display(df)
display(df.groupby("Cluster").mean().shape)
return df.groupby("Cluster").mean()
This is the "df"
This is what the function returns
It just removes the cloumn "WindGustDir", "WindDir9am" & "WindDir3pm"
I can't think of anything that would cause that and I can't seem to find anything online.
Sample Data (df)
{'MinTemp': {0: 0.11720244784576628,
1: -0.8427745726259455,
2: 0.03720436280645697,
3: -0.5547814664844322,
4: 0.7731867451681026},
'MaxTemp': {0: -0.10786029175347862,
1: 0.20733745161878284,
2: 0.29330047253849006,
3: 0.6228253860640358,
4: 1.2388937026552729},
'Rainfall': {0: -0.20728093808218293,
1: -0.2769572340371439,
2: -0.2769572340371439,
3: -0.2769572340371439,
4: -0.16083007411220893},
'WindGustDir': {0: 1.0108491579487748,
1: 1.2354908672839122,
2: 0.7862074486136377,
3: -1.2355679354025977,
4: 1.0108491579487748},
'WindGustSpeed': {0: 0.24007558342699453,
1: 0.24007558342699453,
2: 0.390124802975703,
3: -1.2604166120600901,
4: 0.015001754103931822},
'WindDir9am': {0: 1.0468595036148063,
1: 1.6948858890138538,
2: 1.0468595036148063,
3: -0.24919326718328894,
4: -0.8972196525823365},
'WindDir3pm': {0: 1.2126203373471025,
1: 0.764386964628212,
2: 0.764386964628212,
3: -0.8044298398879051,
4: 1.4367370237065478},
'WindSpeed9am': {0: 0.5756272310362935,
1: -1.3396174328344796,
2: 0.45592443954437023,
3: -0.5016978923910164,
4: -0.9805090583587096},
'WindSpeed3pm': {0: 0.5236467885906614,
1: 0.29063908419611084,
2: 0.7566544929852119,
3: -1.2239109943684676,
4: 0.057631379801560294},
'Humidity9am': {0: 0.18973417158101255,
1: -1.2387396584055541,
2: -1.556178287291458,
3: -1.1858332202579036,
4: 0.7717049912051694},
'Humidity3pm': {0: -1.381454000080091,
1: -1.2369929482683248,
2: -0.9962245285820479,
3: -1.6703761037036233,
4: -0.8517634767702817},
'Pressure9am': {0: -1.3829003897707315,
1: -0.9704973422317451,
2: -1.3971211845134583,
3: 0.024958289758919134,
4: -0.9420557527463073},
'Pressure3pm': {0: -1.142657774670493,
1: -1.0420314949852805,
2: -0.9126548496756962,
3: -0.32327235437655133,
4: -1.3007847856044166},
'Temp9am': {0: -0.08857174739799159,
1: -0.0413389204605655,
2: 0.5569435540801637,
3: 0.1003595603517128,
4: 0.0531267334142867},
'Temp3pm': {0: -0.04739593761083206,
1: 0.318392868036414,
2: 0.1574457935516255,
3: 0.6402870170059903,
4: 1.1084966882344651},
'Cluster': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1}}
With your sample data for df, it looks like df.groupby("Cluster").mean() does not remove any columns.
One possible explanation: using groupby().mean() on non-numeric types will cause the column to be removed entirely. If you have a large dataset that you are importing from a file or something, is it possible that there are non-numeric data types?
If, for example, you are reading a csv file using pd.read_csv(), and there are cells that read "NULL", they may be parsed as strings instead of numeric values. One option to find rows with non-numeric values is to use:
import numpy as np
df[~df.applymap(np.isreal).all(1)]
I am trying to replicate a table, which is currently produced in R, in python implementing plotnine library. I am using facet.grid with two variables (CBRegion and CBIndustry).
I have found a similar problem, however, it is also done in R. I applied similar codes as in that link and produced the following table:
I tried to use exactly the same code in python using plotnine library, but the final output is very ugly. This is my python code so far:
myplot = ggplot(data = df_data_bar) + aes(x = "CCR100PDMid %" ,y = "CBSector")+ \
geom_segment(aes(yend="CBSector", xend=0), colour="black", size = 2) +\
geom_text(aes(label = "label")) + \
theme(panel_grid_major_y = element_blank()) + \
facet_grid('CBIndustry ~ CBRegion',scales="free_y",space="free") + \
labs(x="", y = "", title=title) + \
theme_bw() + \
theme(plot_title = element_text(linespacing=0.8, face="bold", size=20, va="center"),
axis_text_x = element_text(colour="#333333",size=12,rotation=0,ha="center",va="top",face="bold"),
axis_text_y = element_text(colour="#333333",size=12,rotation=0,ha="right",va="center",face="bold"),
axis_title_x = element_blank(),
axis_title_y = element_blank(),
legend_position="none",
strip_text_x = element_text(size = 12, face="bold", colour = "black", angle = 0),
strip_text_y = element_text(size = 8, face="bold", colour = "black", angle = 0, ha = "left"),
strip_background_y = element_text(width = 0.2),
figure_size=(30,20))
The image from plotnine is as follows:
Comparing Python vs R, we can clearly see that y-axis labels overlap using plotnine. In addition, when we look at Europe and Community groups we can notice that it has the same size box as others with multiple groups which is not necessary. I also tried different aspect ratios, but it has not resolved my problem.
In short words, I would like to have the same plot as R produces. It does not need to be produced in plotnine. Alternatives are also welcome. Data from top ten rows is:
{'CBRegion': {0: 'Europe', 1: 'Europe', 2: 'Europe', 3: 'Europe', 4: 'Europe', 5: 'Europe', 6: 'Europe', 7: 'Europe', 8: 'Europe', 9: 'Europe'}, 'CBSector': {0: 'Aerospace & Defense', 1: 'Alternative Energy', 2: 'Automobiles & Parts', 3: 'Banks', 4: 'Beverages', 5: 'Chemicals', 6: 'Colleges & Universities', 7: 'Community Groups', 8: 'Construction & Materials', 9: 'Electricity'}, 'CBIndustry': {0: 'Industrials', 1: 'Oil & Gas', 2: 'Consumer Goods', 3: 'Financials', 4: 'Consumer Goods', 5: 'Basic Materials', 6: 'NPO', 7: 'Community Groups', 8: 'Industrials', 9: 'Utilities'}, 'CCR100PDMid': {0: 0.015545818181818181, 1: 0.003296, 2: 0.012897471223021583, 3: 0.008079544600938968, 4: 0.008716597402597401, 5: 0.0094617476340694, 6: 0.008897475862068967, 7: 0.000821, 8: 0.012205547455295736, 9: 0.0050264210526315784}, 'CCR100PDMid %': {0: 1.554581818181818, 1: 0.3296, 2: 1.2897471223021584, 3: 0.8079544600938968, 4: 0.8716597402597401, 5: 0.9461747634069401, 6: 0.8897475862068966, 7: 0.0821, 8: 1.2205547455295735, 9: 0.5026421052631579}, 'label': {0: '1.6%', 1: '0.3%', 2: '1.3%', 3: '0.8%', 4: '0.9%', 5: '0.9%', 6: '0.9%', 7: '0.1%', 8: '1.2%', 9: '0.5%'}}
If it is necessary, I can upload the entire dataset, but I just read the MRC and it says that I should only include a subset of data. I am new to SO and hope that I included all vital information. I will be grateful for any help. Thank you in advance!
The other issues with colours, overlapping labels, wrapping text etc can be fixed, but unfortunately space = 'free' is not currently supported in plotnine. See documentation here. Unfortunately that's kind of a deal-breaker for your table, sadly. You will need to do in R's ggplot.
I have a code that check in different columns for all the dates that are >= "2022-03-01" and i <= "2024-12-31 then it append it to a list ext=[].
What I would like is to be able to extract the more information about located on the same row.
My code:
from pandas import *
data = read_csv("Book1.csv")
# converting column data to list
D_EXT_1 = data['D_EXT_1'].tolist()
D_INT_1 = data['D_INT_1'].tolist()
D_EXT_2 = data['D_EXT_2'].tolist()
D_INT_2 = data['D_INT_2'].tolist()
D_EXT_3 = data['D_EXT_3'].tolist()
D_INT_3 = data['D_INT_3'].tolist()
D_EXT_4 = data['D_EXT_4'].tolist()
D_INT_4 = data['D_INT_4'].tolist()
D_EXT_5 = data['D_EXT_5'].tolist()
D_INT_5 = data['D_INT_5'].toList()
D_EXT_6 = data['D_EXT_6'].toList()
D_INT_6 = data['D_INT_6'].toList()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 + D_INT_3 + D_INT_4 + D_INT_5 + D_INT_6 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
Example of data:
NAME,ADRESS,D_INT_1,D_EXT_1,D_INT_2,D_EXT_2
ALEX,h4n1p8,2020-01-01,2024-01-01,2023-02-02,2020-01-01
What my code will print with that data:
2024-01-01
DESIRED OUTPUT:
Alex, 2024-01-01
As requested by not_speshal
-> data.head().to_dict()
{'EMPL. NO': {0: 5}, "NOM A L'EMPLACEMENT": {0: 'C010 - HOPITAL REGIONAL DE RIMOUSKI/CENTRE SERVEUR OPTILAB'}, 'ADRESSE': {0: '150 AVENUE ROULEAU'}, 'VILLE': {0: 'RIMOUSKI'}, 'PROV': {0: 'QC'}, 'OBJET NO': {0: 67}, "EMPLACEMENT DE L'APPAREIL": {0: 'CHAUFFERIE'}, 'RBQ 2018': {0: nan}, "DESCRIPTION DE L'APPAREIL": {0: 'CHAUDIERE AQUA. A VAPEUR'}, 'MANUFACTURIER': {0: 'MIURA'}, 'DIMENSIONS': {0: nan}, 'MAWP': {0: 170}, 'SVP': {0: 150}, 'DERNIERE INSP. EXT.': {0: '2019-05-29'}, 'FREQ. EXT.': {0: 12}, 'DERNIERE INSP. INT.': {0: '2020-06-03'}, 'FREQ. INT.': {0: 12}, 'D_EXT_1': {0: '2020-05-29'}, 'D_INT_1': {0: '2021-06-03'}, 'D_EXT_2': {0: '2021-05-29'}, 'D_INT_2': {0: '2022-06-03'}, 'D_EXT_3': {0: '2022-05-29'}, 'D_INT_3': {0: '2023-06-03'}, 'D_EXT_4': {0: '2023-05-29'}, 'D_INT_4': {0: '2024-06-03'}, 'D_EXT_5': {0: '2024-05-29'}, 'D_INT_5': {0: '2025-06-03'}, 'D_EXT_6': {0: '2025-05-29'}, 'D_INT_6': {0: '2026-06-03'}}
Start with
import pandas as pd
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
data = pd.read_csv("Book1.csv")
for col in cols:
data.loc[:,col] = pd.to_datetime(data.loc[:,col])
Then use
ext = data[
(
data.loc[:,cols].ge(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].le(pd.to_datetime("2024-12-13"))\
).any(axis=1)
]
EDIT: while it's not clear what date you want if multiple are in the required range, to get what (I understand) you're requesting, use
# assuming
import numpy as np
import pandas as pd
# and
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
ext = data[
np.concatenate(
(
np.setdiff1d(data.columns,cols),
np.array(
(data.loc[:,cols].gt(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].lt(pd.to_datetime("2024-12-13"))\
).idxmax(axis=1)
)
),
axis=None
)]
where cols is as above
IIUC, try:
columns = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[columns] = data[columns].apply(pd.to_datetime)
output = data[((data[columns]>="2022-03-01")&(data[columns]<="2024-12-31")).any(axis=1)]
This will return all the rows where any date in the columns list is between 2022-03-01 and 2024-12-31
It seems that you want to get only rows where at least one of the dates is in the range ["2022-03-01", "2024-12-31"], correct?
First, convert all the date columns to datetime, using DataFrame.apply + pandas.to_datetime.
import pandas as pd
date_cols = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[date_cols] = data[date_cols].apply(pd.to_datetime)
Then create a 2D boolean mask of all the dates that are in the desired range
is_between_dates = (data[date_cols] > "2022-03-01") & (data[datecols] <= "2024-12-31")
# print(is_between_dates) to clearly understand what it represents
Finally, select the rows that contain at least one True value, meaning that there is at least one date in that row that belongs to the date range. This can be achieved using DataFrame.any with axis=1 on the 2D boolean mask, is_between_dates.
# again, print(is_between_dates.any(axis=1)) to see
data = data[is_between_dates.any(axis=1)]
Use melt to reformat your dataframe to be easily searchable:
df = pd.read_csv('Book1.csv').melt(['NAME', 'ADRESS']) \
.astype({'value': 'datetime64'}) \
.query("'2022-03-01' <= value & value <= '2024-12-31'")
At this point your dataframe looks like:
>>> df
NAME ADRESS variable value
1 ALEX h4n1p8 D_EXT_1 2024-01-01
2 ALEX h4n1p8 D_INT_2 2023-02-02
Now it's easy to get a NAME for a date:
>>> df.loc[df['value'] == '2024-01-01', 'NAME']
1 ALEX
Name: NAME, dtype: object
# OR
>>> df.loc[df['value'] == '2024-01-01', 'NAME'].tolist()
['ALEX']
I have a dictionary which includes integer keys and string values. The keys go up to a number N but include gaps. It there an effective way of filling all gaps up to a specific number? (numeration starts with 1, not 0)
Example:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
N = 9
The result should be:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf", 2: "placeholder", 6: "placeholder", 8: "placeholder", 9: "placeholder"}
Of course I can loop manually through it, but is there a smarter way to do that?
One quick way to do it (which does admittedly involve a bit of looping) is
mydict = dict.fromkeys(range(1,N+1),"placeholder") | {
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
Though I suspect you might be reaching for collections.defaultdict:
mydict = defaultdict(lambda: "placeholder",{
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"})