I have this large data frame and I need to when certain resource are available for the first time. Let me explain it from my code.
df1 = df[df['Resource_ID'] == 1348]
df1 = df1[['Format', 'Range_Start', 'Number']]
df1["Range_Start"] = df1["Range_Start"].str[:7]
df1 = df1.groupby(['Format', 'Range_Start'], as_index=True).last()
pd.options.display.float_format = '{:,.0f}'.format
df1 = df1.unstack()
df1.columns = df1.columns.droplevel()
df2 = df1[1:4].sum(axis=0)
df2.name = 'sum'
df2 = df1.append(df2)
df3 = df2.T[['entry', 'sum']].copy()
df3.index = pd.to_datetime(df3.index)
Now print(df3.first('1D')) gives the following output:
Format entry sum
Range_Start
2011-07-01 97 72
I can now see that Resource_ID 1348 first occurs on 2011-07-01, how do I extract only the Year from this information?
This is my sample input csv data:
Access_Stat_ID,Resource_ID,Range_Start,Range_End,Name,Format,Number,Matched_URL
1,15,"2009-03-01 00:00:00","2009-03-31 23:59:59","Mar 2009","entry",3,""
203,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","entry",18,""
204,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","pdf",7,""
It seems need:
first_year = df3.index.year[0]
Related
I am trying to create a dynamic pandas dataframe based on the number of records read, where each record would be a column.
My logic has been to apply a cycle where "for i=1 in N", where N is a read data (string format) to create the columns. This is not quite right for me, I have tried some alternatives but without good results. I only get the last record of the read.
I leave a proposal:
def funct_example(client):
documents = [ v_document ]
poller = client.begin_analyze_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
i = 1
df_final = pd.DataFrame()
for idx, doc in enumerate(docs):
for entity in doc.entities:
for i in doc.entities:
d = {'col' + i : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
display(df_final)
i = i + 1
funct_example(client)
What alternative do you recommend?
SOLUTION:
for idx, doc in enumerate(docs):
for entity in doc.entities:
name = 'col' + str(i)
d = {name : [format(entity.text)]}
df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
Thanks you!
this is because df is getting reassigned after each iteration.
here is one way to accomplish it
declare an empty DF before the start of the for loop
df_final = pd.DataFrame()
add after you have created the df df = pd.DataFrame(data=d)
df_final = pd.concat([df_final, df], axis=1)
this appends to your df_final
Find the 'awarded_amt' by keywords in each of the keywords listed.
The original 'awarded_amt' (column name) has more than 20000 numeric datas. There is another column 'description' in the dataframe has data type as strings. Now we need to find the keywords from 'description', sum up all the rows on 'awarded_amt' with the same keywords found in 'description'. I am tasked to use str.contains to get the solution. Please help!
keywords_list = ['insurance','aircon','food','event management','transport','furniture']
Output (in DataFrame format):
My wrong coding:
keywords_ser = pd.Series(['insurance','aircon','food','event management','transport','furniture'])
keywords_df = pd.DataFrame(keywords_ser,columns=['keywords'])
keywords_list = ['insurance','aircon','food','event management','transport','furniture']
for keyword in keywords_list:
pd.options.display.float_format = '{:,.0f}'.format
bool_finding = gebiz_df['tender_description'].str.contains(keyword)
total_df = gebiz_df[['awarded_amt']][bool_finding].sum()
df = pd.concat([keywords_df,total_df])
display(df)
Wrong output:
enter image description here
Amended coding (total value is correct now but output not right):
keywords_ser = pd.Series(['insurance','aircon','food','event management','transport','furniture'])
keywords_df = pd.DataFrame(keywords_ser,columns=['keywords'])
keywords_list = ['insurance','aircon','food','event management','transport','furniture']
for keyword in keywords_list:
pd.options.display.float_format = '{:,.0f}'.format
sum_ser = gebiz_df['awarded_amt'][gebiz_df['tender_description'].str.lower().str.contains(keyword)]
sum_df = pd.DataFrame(sum_ser,columns=['awarded_amt'])
df1 = pd.DataFrame(sum_df.sum(),columns=['awarded_amt'])
df2 = pd.concat([keywords_df,df1])
display(df2)
Revised Output (instead of 1 dataframetable, it yielded 6 dataframes:
Finally, I managed to crack the code!
pd.options.display.float_format = '{:,.0f}'.format
keywords_list = ['insurance','aircon','food','event management','transport','furniture']
keywords_df = pd.DataFrame(keywords_list,columns=['keywords'])
sum_list = []
for keyword in keywords_list:
sum_ser = gebiz_df['awarded_amt'][gebiz_df['tender_description'].str.lower().str.contains(keyword)].sum()
sum_list.append(sum_ser)
sum_df = pd.DataFrame(sum_list, columns=['awarded_amt'] )
final_df = pd.concat([keywords_df,sum_df], axis=1)
display(final_df)
I'm trying to subtract one data frame from another which all results should result in a 0 or blank based on the data in each my current excel files but will result in 0, 1, 2, or blank in the future. While some do result in a 0 or blank I'm also getting a -1 and 1. Any help that can be provided will be appreciated.
The two Excel sheets are identical except for number changes in second column.
Example
ExternalId TotalInteractions
name1 1
name2 2
name3 2
name4 1
Both sheets will look like the example and the output will look the same. I just need the difference between the two sheets
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = df1['ExternalId']
df4 = df1['TotalInteractions']
df5 = df2['TotalInteractions']
df6 = df4.sub(df5)
frames = (df3, df6)
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
I managed to create a partial answer to getting the unexpected numbers. My problem now is that NewInter has more names than PrevInter does. Which results in a blank in TotalInteractions next to the new ExternalId. Any idea how to make it if it there is a blank to accept the value from NewInter?
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4 - df5
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
Figured out the issues. First part needed to be merged in order for the subtraction to work as the dataframes are not the same size. Also had to add in fill_value = 0 so it would take information from the new file.
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4.sub(df5, fill_value = 0)
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')
df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']
You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)
I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...