I have a CSV file that was created using Pandas. Below is the output from the following code:
test = pd.read_csv('order.csv', header=0)
print(test.head())
3 16258878505032
0 3 16258876670024
1 3 16258876899400
2 3 16258876997704
The only data I need to be processed is the information in the 2nd column and the information on the 3rd column. This is purchase order data where the 2nd column represents a "quantity" and the 3rd column represents the "sku".
I need to take row 1, col 2 and inject it into an input field using selenium. I need row 1, col 3 and perform an action of selecting a sku on a webpage. Add the item to a cart and loop back through process row 2, row 3 etc.
I know how to write the selenium code to perform the web based actions, but not sure how to write the pandas/python code to iterate through the CSV file one row at a time and how to call those values out. My logic would be the following.
read order.csv
get quantity value and sku value for row (one row at the time)
visit website, inject quantity value
remain on website, select sku
add to cart
repeat loop until no more rows to process
Thanks for your help.
First use parameter names in read_csv for avoid convert first row of data to columns names:
test = pd.read_csv('order.csv', names=['quantity','sku'])
print (test)
quantity sku
0 3 16258878505032
1 3 16258876670024
2 3 16258876899400
3 3 16258876997704
Because working with selenium and web is possible use DataFrame.iterrows or another loops solutions:
def func(x):
q = x['quantity']
sku = x['sku']
print (q, sku)
#add selenium code
df.apply(func, axis=1)
Or:
for i, row in test.iterrows():
q = row['quantity']
sku = row['sku']
print (q, sku)
#add selenium code
Related
I need to identify different groups in Excel files and rows inside these groups (to be more accurate I need to get the value of the first cell of the main row under which over rows are grouped).
Below is an example of the files structure (I've minimized the groups but when I receive these files they are expanded):
I know how to create new groups using openpyxl or xlwt, I'm familiar with both openpyxl and xlrd but I'm enable to find anything in the API to solve this requirement.
So, is it possible using Python and if so, which part of openpyxl or xlrd API should I use ?
You should be able to do this using the worksheet's row_dimensions. This returns an object accessible like a dict where the keys are the row numbers of the sheet. outline_level will have a non-zero value for each depth of grouping, or 0 if the row is not part of a group.
So, if you had a sheet where rows 2 and 3 were a group, and rows 5 and 6 were another group, iterating through row_dimensions would look like this:
>>> for row in range(ws.min_row, ws.max_row + 1):
... print(f"row {row} is in group {ws.row_dimensions[row].outline_level}")
...
row 1 is in group 0
row 2 is in group 1
row 3 is in group 1
row 4 is in group 0
row 5 is in group 1
row 6 is in group 1
I should point out that there's some weirdness with accessing the information. My original solution was this:
>>> for row_num, row_data in ws.row_dimensions.items():
... print(f"row {row_num} is group {row_data.outline_level}")
...
row 2 is group 1
row 3 is group 1
row 4 is group 0
row 5 is group 1
row 6 is group 1
Notice that row 1 is missing. It wasn't part of row_dimensions until I manually accessed it as row_dimensions[1] and then it appeared. I don't know how to explain that, but the first approach is probably better as it specifically iterates from the first to last row.
The same process applies to column groups through column_dimensions except that it must be keyed using column letter(s), e.g. ws.column_dimensions["A"].current_level.
I am trying to create a dataframe from three lists which I have generated using webscraped data. However, when I try and turn these lists into dictionaries and then use them to build my pandas dataframe it outputs a dataframe for each dictionary item (row) rather than one dataframe including all of these items as rows within the dataframe.
I believe the issue lies in the for loop that I have used to webscrape the data. I know similar questions have been asked on this one, including here Pandas DataFrame created for each row and here Take multiple lists into dataframe but I have tried the solutions without any joy. I believe the webscrape loop adds a nuance that makes this more tricky.
Step by step walkthrough of my code and the output are below, for reference I have imported pandas as pd and bs4.
# Step 1 create a webscraper which takes three sets of data (price, bedrooms and bathrooms) from a website and populate into three separate lists
for container in containers:
try:
price_container=container.find("a",{"class":"listing-price text-price"})
price_strip=price_container.text.strip()
price_list=[]
price_list.append(price_strip)
except TypeError:
continue
try:
bedroom_container = container.find("span",{"class":"icon num-beds"})
bedroom_strip=(bedroom_container["title"])
bedroom_list=[]
bedroom_list.append(bedroom_strip)
except TypeError:
continue
try:
bathroom_container=container.find("span", {"class":"icon num-baths"})
bathroom_strip=(bathroom_container["title"])
bathroom_list=[]
bathroom_list.append(bathroom_strip)
except TypeError:
continue
# Step 2 create a dictionary
data = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list}
# Step 3 turn it into a pandas dataframe and print the output
d=pd.DataFrame(data)
print(d)
This gives me a dataframe for each dictionary as below.
price bedrooms bathrooms
0 £200,000 3 2
[1 rows x 3 columns]
price bedrooms bathrooms
0 £400,000 5 3
[1 rows x 3 columns]
prices bedrooms bathrooms
0 £900,000 6 4
[1 rows x 3 columns]
and so on.....
I've tried dictionary comprehension and list comprehension, to give me one dataframe rather than a dataframe for each dictionary item:
data = [({'price':price, 'bedrooms':bedrooms, 'bathrooms':bathrooms}) for item in container]
df = pd.DataFrame(data)
print(df)
and, despite how I do the list expression, this yields an even weirder output. It gives me a dataframe for each item in the dictionary with the same row of information repeated a number of times
price bedrooms bathrooms
0 £200,000 3 2
0 £200,000 3 2
0 £200,000 3 2
[3 rows x 3 columns]
price bedrooms bathrooms
0 £400,000 5 3
0 £400,000 5 3
0 £400,000 5 3
[3 rows x 3 columns]
price bedrooms bathrooms
0 £900,000 6 4
0 £900,000 6 4
0 £900,000 6 4
[1 rows x 3 columns]
and so on...
How do I resolve this problem and get all of my data into one pandas dataframe?
Firstly you should do price_list=[] and bedroom_list=[] and bathroom_list=[] before your for loop - otherwise they were 1-element long at most as it in every turn they would be reseted to [] then appended with single element. Secondly if you wish to have single dataframe you should create it outside for loop i.e. dedent data = {'price':price_list, 'bedrooms':bedroom_list, 'bathrooms':bathrooms_list}
and following lines. Finally in case of missing data you should denote it - if any but first continue will be executed your price_list, bedroom_list, bathroom_list will have different lengths. I suggest replacing first continue using price_list.append(None) second using bedroom_list.append(None) third using bathroom_list.append(None), so you would have clear indication in your dataframe where data is missing.
The code part you're testing here is good- a dictionary of lists will always return a single dataframe. So this part:
pd.DataFrame(data)
can't be the cause of the problem. Instead, it's the fact that it's buried inside a loop, so is running three times. The same goes for your lists which are being defined over and over again.
Take those parts out of the loop, and you should be ok.
You have to merge the three lists
df = pd.DataFrame(data["price"] + data["bedrooms"] + data["bathrooms"] )
if you want something more generic :
list_ = [item for i in data for item in data[i]]
df = pd.DataFrame(list_)
enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?
One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1
I have read all the answers related to my question available in stackoverflow but my question is little different from available answers. I have very large dataframe and some portion of that dataframe is following-
Input Dataframe is like
A B C D
0 foot 17/1: OGChan_2020011717711829281829281 , 7days ...
1 arm this will processed after ;;;
2 leg go_2020011625692400374400374 16/1: Id Imerys_2020011618188744093744093
3 head xyziemen_2020011510691787006787006 en_2020011510749462801462801 ;;;
: : : :
In this dataframe, firstly I am extracting ID's from column B based upon some regular expression. Some rows of Column B may contain that ID's, some may not and some rows of column B may blank. Following is the code-
df = pd.read_excel("Book1.xlsx", "Sheet1")
dict= {}
for i in df.index:
j = str(df['B'][i])
if(re.findall('_\d{25}', j)):
a = re.findall('_\d{25}', j)
print(a)
dict[i] = a
Regular Expression starts with _(undersore) and 25 digits. Example in above df are _2020011618188744093744093, _2020011510749462801462801 etc..
Now I want to insert these ID's in Column D of a particular row. For Example If two ID's are find at 0th row than first ID should insert in 0th row of column D and second Id should insert on 1st row of column D and all the content of dataframe should shifted down. What I want will clear from following output.I want my output as following based upon above input.
A B .. D
0 foot 17/1: OGChan_2020011717711829281829281 ,7days _2020011717711829281829281
1 arm this will processed after
2 leg go_2020011625692400374400374 16/1: _2020011625692400374400374
Id Imerys_2020011618188744093744093
3 _2020011618188744093744093
4 head xyziemen_2020011510691787006787006 _2020011510691787006787006
en_2020011510749462801462801
5 _2020011510749462801462801
: : : :
In above output 1 ID is found at 0th row.So column D of 0th row contains that ID. No ID is found at first index. So column D of 1st index is empty. At second index there are two ID's. Hence first ID is placed on 2nd row of column D and second ID is placed on 3rd row of column D and it shifted the previous content of third row to 4th row. I want above output as my final output.
Hope I am clear. Thanks in advance
I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!