convert scraped list to pandas dataframe using columns and index - python

process and data scraping of url( within all given links in a loop )looks like :
for url in urls :
page=requests.get(url)
#fetch and proceed page here and acquire cars info one per page
print(car.name)
print(car_table)
and the output :
BMW
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner','']
FORD
['color','','weight','','height','','width','','serial','','owner','']
HONDA
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner','']
at the end how can i have a dataframe same as below by considering that i dunno number of car fields(columns) and number of cars(index) but defined df with them as columns and index
print(car_df)
-----|color|weight|height|width|serial|owner
BMW |red 50 120 200
FORD |
HONDA|blue 60 160 OMEGA
any help appreciated :)

This approach is to create a list of dicts as we iterate through the urls, and then after the loop we convert this to a dictionary. I'm assuming that the car_table is always the column followed by the value over and over again
import pandas as pd
import numpy as np
#Creating lists from your output instead of requesting from the url since you didn't share that
car_names = ['BMW','FORD','HONDA']
car_tables = [
['color','red','weight','50kg','height','120cm','width','200cm','serial','','owner',''],
['color','','weight','','height','','width','','serial','','owner',''],
['color','blue','weight','60kg','height','','width','160cm','serial','OMEGA','owner',''],
]
urls = range(len(car_names))
all_car_data = []
for url in urls:
car_name = car_names[url] #using car_name instead of car.name for this example
car_table = car_tables[url] #again, you get this value some other way
car_data = {'name':car_name}
columns = car_table[::2] #starting from 0, skip every other entry to just get the columns
values = car_table[1::2] #starting from 1, skip every other entry to just get the values
#Zip the columns together with the values, then iterate and update the dict
for col,val in zip(columns,values):
car_data[col] = val
#Add the dict to a list to keep track of all the cars
all_car_data.append(car_data)
#Convert to a dataframe
df = pd.DataFrame(all_car_data)
#df = df.replace({'':np.NaN}) #you can use this if you want to replace the '' with NaNs
df
Output:
name color weight height width serial owner
0 BMW red 50kg 120cm 200cm
1 FORD
2 HONDA blue 60kg 160cm OMEGA

Related

pandas: add a string to certain values in a comma separated column if values exist in a list

I have a pandas dataframe as follows,
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['this is the good student','she wears a beautiful green dress','he is from a friendly family of four','the house is empty','the number four five is new'],
'labels':['O,O,O,ADJ,O','O,O,O,ADJ,ADJ,O','O,O,O,O,ADJ,O,O,NUM','O,O,O,O','O,O,NUM,NUM,O,O']})
I would like to add a 'B-' label to the ADJ or NUM is they are not repeated right after, and 'I-' if there is a repetition. so here is my desired output,
output:
text labels
0 this is the good student O,O,O,B-ADJ,O
1 she wears a beautiful green dress O,O,O,B-ADJ,I-ADJ,O
2 he is from a friendly family of four O,O,O,O,B-ADJ,O,O,B-NUM
3 the house is empty O,O,O,O
4 the number four five is new O,O,B-NUM,I-NUM,O,O
so far I have created a list of unique values as such
unique_labels = (np.unique(sum(df["labels"].str.split(',').dropna().to_numpy(), []))).tolist()
unique_labels.remove('O') # no changes required for O label
and tried to first add the B label which I got an error(ValueError: Must have equal len keys and value when setting with an iterable),
for x in unique_labels:
df.loc[df["labels"].str.contains(x), "labels"]= ['B-' + x for x in df["labels"]]
Try:
from itertools import groupby
def fn(x):
out = []
for k, g in groupby(map(str.strip, x.split(","))):
if k == "O":
out.extend(g)
else:
out.append(f"B-{next(g)}")
out.extend([f"I-{val}" for val in g])
return ",".join(out)
df["labels"] = df["labels"].apply(fn)
print(df)
Prints:
text labels
0 this is the good student O,O,O,B-ADJ,O
1 she wears a beautiful green dress O,O,O,B-ADJ,I-ADJ,O
2 he is from a friendly family of four O,O,O,O,B-ADJ,O,O,B-NUM
3 the house is empty O,O,O,O
4 the number four five is new O,O,B-NUM,I-NUM,O,O

DataFrame filter column

I have the following dataframe 'X_df'
which city has the 5th highest total number of Walmart stores (super stores and regular stores combined)?
data_url = 'https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv'
x_df = pd.read_csv(data_url, header=0)
x_df['STRSTATE'].where(x_df['type_store'] == 7)
You can use Dataframe.max() to get the max city count the get the city name
X_df=df[X_df['city_count']==X_df['city_count'].max()]
x_df["city_name"]
Edit:
I think something like this is what you want? :
data_url = 'https://raw.githubusercontent.com/plotly/datasets/master/1962_2006_walmart_store_openings.csv'
x_df = pd.read_csv(data_url, header=0)
city_store_count = x_df.groupby(['STRCITY']).size().sort_values(ascending = False).to_frame()
city_store_count.columns = ['Stores_in_City']
city_store_count.iloc[4]
The fifth biggest is actually a shared 3rd place with ten stores, so you could print the top 10 for instance:
city_store_count.head(10)

Compile one DataFrame from a loop sequence of smaller DataFrames

I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.

Reorder row values csv pandas

I have a csv file
1 , name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2 , name , Kelvi20-Flipcart, LG-Walmart
3, name , Kenstar-Walmart, Sony-Amazon , Kenstar-Flipcart
4, name , LG18-Walmart, Bravia-Amazon
I need the rows to be rearranged by the websites ie the part after -;
1, name , 1012B-Amazon , 2044C-Flipcart , Bosh27-Walmart
2, name , , Kelv20-Flipcart, LG-Walmart
3, name , Sony-Amazon, Kenstar-Flipcart ,Kenstar-Walmart
4, name , Bravia-Amazon, ,LG18-Walmart
Is it possible using pandas ? Finding the existence of a sting and re arrange it and iterate through all rows and repeat this for the next string ? I went through the documentation of Series.str.contains and str.extract but was unable to find a solution .
Using sorted with key
df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
2 3 4 5
1 ABC DEF GHI JKL
2 ABC DEF GHI
3 ABC DEF GHI JKL
#df.iloc[:,1:]=df.iloc[:,1:].apply(lambda x : sorted(x,key=lambda y: (y=='',y)),1)
Since you mention reindex I think get_dummies will work
s=pd.get_dummies(df.iloc[:,1:],prefix ='',prefix_sep='')
s=s.drop('',1)
df.iloc[:,1:]=s.mul(s.columns).values
df
1 2 3 4 5
1 name ABC DEF GHI JKL
2 name ABC DEF GHI
3 name ABC DEF GHI JKL
Assuming the empty value is np.nan:
# Fill in the empty values with some string to allow sorting
df.fillna('NaN', inplace=True)
# Flatten the dataframe, do the sorting and reshape back to a dataframe
pd.DataFrame(list(map(sorted, df.values)))
0 1 2 3
0 ABC DEF GHI JKL
1 ABC DEF GHI NaN
2 ABC DEF GHI JKL
UPDATE
Given the update to the question and the sample data being as follows
df = pd.DataFrame({'name': ['name1', 'name2', 'name3', 'name4'],
'b': ['1012B-Amazon', 'Kelvi20-Flipcart', 'Kenstar-Walmart', 'LG18-Walmart'],
'c': ['2044C-Flipcart', 'LG-Walmart', 'Sony-Amazon', 'Bravia-Amazon'],
'd': ['Bosh27-Walmart', np.nan, 'Kenstar-Flipcart', np.nan]})
a possible solution could be
def foo(df, retailer):
# Find cells that contain the name of the retailer
mask = df.where(df.apply(lambda x: x.str.contains(retailer)), '')
# Squash the resulting mask into a series
col = mask.max(skipna=True, axis=1)
# Optional: trim the name of the retailer
col = col.str.replace(f'-{retailer}', '')
return col
df_out = pd.DataFrame(df['name'])
for retailer in ['Amazon', 'Walmart', 'Flipcart']:
df_out[retailer] = foo(df, retailer)
resulting in
name Amazon Walmart Flipcart
0 name1 1012B Bosh27 2044C
1 name2 LG Kelvi20
2 name3 Sony Kenstar Kenstar
3 name4 Bravia LG18
Edit after Question Update:
This is the abc csv:
1,name,ABC,GHI,DEF,JKL
2,name,GHI,DEF,ABC,
3,name,JKL,GHI,ABC,DEF
This is the company csv (it is necessary to watch the commas carefully):
1,name,1012B-Amazon,2044C-Flipcart,Bosh27-Walmart
2,name,Kelvi20-Flipcart,LG-Walmart,
3,name,Kenstar-Walmart,Sony-Amazon,Kenstar-Flipcart
4,name,LG18-Walmart,Bravia-Amazon,
Here is the code
import pandas as pd
import numpy as np
#These solution assume that each value that is not empty is not repeated
#within each row. If that is not the case for your data, it would be possible
#to do some transformations that the non empty values are unique for each row.
#"get_company" returns the company if the value is non-empty and an
#empty value if the value was empty to begin with:
def get_company(company_item):
if pd.isnull(company_item):
return np.nan
else:
company=company_item.split('-')[-1]
return company
#Using the "define_sort_order" function, one can retrieve a template to later
#sort all rows in the sort_abc_rows function. The template is derived from all
#values, aside from empty values, within the matrix when "by_largest_row" = False.
#One could also choose the single largest row to serve as the
#template for all other rows to follow. Both options work similarly when
#all rows are subsets of the largest row i.e. Every element in every
#other row (subset) can be found in the largest row (or set)
#The difference relates to, when the items contain unique elements,
#Whether one wants to create a table with all sorted elements serving
#as the columns, or whether one wants to simply exclude elements
#that are not in the largest row when at least one non-subset row does not exist
#Rather than only having the application of returning the original data rows,
#one can get back a novel template with different values from that of the
#original dataset if one uses a function to operate on the template
def define_sort_order(data,by_largest_row = False,value_filtering_function = None):
if not by_largest_row:
if value_filtering_function:
data = data.applymap(value_filtering_function)
#data.values returns a numpy array
#with rows and columns. .flatten()
#puts all elements in a 1 dim array
#set gets all unique values in the array
filtered_values = list(set((data.values.flatten())))
filtered_values = [data_value for data_value in filtered_values if not_empty(data_value)]
#sorted returns a list, even with np.arrays as inputs
model_row = sorted(filtered_values)
else:
if value_filtering_function:
data = data.applymap(value_filtering_function)
row_lengths = data.apply(lambda data_row: data_row.notnull().sum(),axis = 1)
#locates the numerical index for the row with the most non-empty elements:
model_row_idx = row_lengths.idxmax()
#sort and filter the row with the most values:
filtered_values = list(set(data.iloc[model_row_idx]))
model_row = [data_value for data_value in sorted(filtered_values) if not_empty(data_value)]
return model_row
#"not_empty" is used in the above function in order to filter list models that
#they no empty elements remain
def not_empty(value):
return pd.notnull(value) and value not in ['',' ',None]
#Sorts all element in each _row within their corresponding position within the model row.
#elements in the model row that are missing from the current data_row are replaced with np.nan
def reorder_data_rows(data_row,model_row,check_by_function=None):
#Here, we just apply the same function that we used to find the sorting order that
#we computed when we originally #when we were actually finding the ordering of the model_row.
#We actually transform the values of the data row temporarily to determine whether the
#transformed value is in the model row. If so, we determine where, and order #the function
#below in such a way.
if check_by_function:
sorted_data_row = [np.nan]*len(model_row) #creating an empty vector that is the
#same length as the template, or model_row
data_row = [value for value in data_row.values if not_empty(value)]
for value in data_row:
value_lookup = check_by_function(value)
if value_lookup in model_row:
idx = model_row.index(value_lookup)
#placing company items in their respective row positions as indicated by
#the model_row #
sorted_data_row[idx] = value
else:
sorted_data_row = [value if value in data_row.values else np.nan for value in model_row]
return pd.Series(sorted_data_row)
##################### ABC ######################
#Reading the data:
#the file will automatically include the header as the first row if this the
#header = None option is not included. Note: "name" and the 1,2,3 columns are not in the index.
abc = pd.read_csv("abc.csv",header = None,index_col = None)
# Returns a sorted, non-empty list. IF you hard code the order you want,
# then you can simply put the hard coded order in the second input in model_row and avoid
# all functions aside from sort_abc_rows.
model_row = define_sort_order(abc.iloc[:,2:],False)
#applying the "define_sort_order" function we created earlier to each row before saving back into
#the original dataframe
#lambda allows us to create our own function without giving it a name.
#it is useful in this circumstance in order to use two inputs for sort_abc_rows
abc.iloc[:,2:] = abc.iloc[:,2:].apply(lambda abc_row: reorder_data_rows(abc_row,model_row),axis = 1).values
#Saving to a new csv that won't include the pandas created indices (0,1,2)
#or columns names (0,1,2,3,4):
abc.to_csv("sorted_abc.csv",header = False,index = False)
################################################
################## COMPANY #####################
company = pd.read_csv("company.csv",header=None,index_col=None)
model_row = define_sort_order(company.iloc[:,2:],by_largest_row = False,value_filtering_function=get_company)
#the only thing that changes here is that we tell the sort function what specific
#criteria to use to reorder each row by. We're using the result from the
#get_company function to do so. The custom function get_company, takes an input
#such as Kenstar-Walmart, and outputs Walmart (what's after the "-").
#we would then sort by the resulting list of companies.
#Because we used the define_sort_order function to retrieve companies rather than company items in order,
#We need to use the same function to reorder each element in the DataFrame
company.iloc[:,2:] = company.iloc[:,2:].apply(lambda companies_row: reorder_data_rows(companies_row,model_row,check_by_function=get_company),axis=1).values
company.to_csv("sorted_company.csv",header = False,index = False)
#################################################
Here is the first result from sorted_abc.csv:
1 name ABC DEF GHI JKL
2 name ABC DEF GHI NaN
3 name ABC DEF GHI JKL
After modifying the code to the subsequent form inquired about,
here is the sorted_company.csv that resulted from running the
script.
1 name 1012B-Amazon 2044C-Flipcart Bosh27-Walmart
2 name NaN Kelvi20-Flipcart LG-Walmart
3 name Sony-Amazon Kenstar-Flipcart Kenstar-Walmart
4 name Bravia-Amazon NaN LG18-Walmart
I hope it helps!

Extracting many URLs in a python dataframe

I have a dataframe which contains text including one or more URL(s) :
user_id text
1 blabla... http://amazon.com ...blabla
1 blabla... http://nasa.com ...blabla
2 blabla... https://google.com ...blabla ...https://yahoo.com ...blabla
2 blabla... https://fnac.com ...blabla ...
3 blabla....
I want to transform this dataframe with the count of URL(s) per user-id :
user_id count_URL
1 2
2 3
3 0
Is there a simple way to perform this task in Python ?
My code start :
URL = pd.DataFrame(columns=['A','B','C','D','E','F','G'])
for i in range(data.shape[0]) :
for j in range(0,8):
URL.iloc[i,j] = re.findall("(?P<url>https?://[^\s]+)", str(data.iloc[i]))
Thanks you
Lionel
In general, the definition of a URL is much more complex than what you have in your example. Unless you are sure you have very simple URLs, you should look up a good pattern.
import re
URLPATTERN = r'(https?://\S+)' # Lousy, but...
First, extract the URLs from each string and count them:
df['urlcount'] = df.text.apply(lambda x: re.findall(URLPATTERN, x)).str.len()
Next, group the counts by user id:
df.groupby('user_id').sum()['urlcount']
#user_id
#1 2
#2 3
#3 0
Below there is another way to do that:
#read data
import pandas as pd
data = pd.read_csv("data.csv")
#Divide data into URL and user_id and cast it to pandas DataFrame
URL = pd.DataFrame(data.loc[:,"text"].values)
user_id = pd.DataFrame(data.loc[:,"user_id"].values)
#count the number of appearance of the "http" in each row of data
sub = "http"
count_URL = []
for val in URL.iterrows():
counter = val[1][0].count(sub)
count_URL.append(counter)
#list to DataFrame
count_URL = pd.DataFrame(count_URL)
#Concatenate the two data frames and apply the code of #DyZ to group by and count the number of url
finalDF = pd.concat([user_id,count_URL],axis=1)
finalDF.columns=["user_id","urlcount"]
data = finalDF.groupby('user_id').sum()['urlcount']
print(data.head())

Categories

Resources