I have a pandas dataframe correct_X_test that contains one column review containing reviews.
I need to add two new columns that contain parts of the reviews as below:
for one line of review review ='x1 x2 x3 x x x xi x x x xn', I need to stock sub_review_1_i='x1 x2 x3 x x x xi' and sub_review_i_n='xi x x x xn' for i in (1,n)
I extract the two strings using this code:
for j in correct_y_test.index:
input_list=correct_X_test["review"][j].split()
for i in range(len(input_list)):
#Construction de la séquence de x1 à xi
sub_list_1_i=input_list[:i+1]
sub_str_1_i = ""
for ele in sub_list_1_i:
sub_str_1_i += ele + " "
#Construction de la séquence de xi à xn
sub_list_i_n=input_list[i:]
sub_str_i_n = ""
for ele in sub_list_i_n:
sub_str_i_n += ele + " "
but don't see how to stock this in the dateframe because for a review we will have i rows and 2 columns
any idea, please?
The way I see, you have two options:
Option 1: store the sub-reviews as lists
In this option, for every "review", you create two lists to store the values from sub_str_1_i, and another for sub_str_i_n. Then you add those lists as new columns in their respective rows. Here's an example:
import pandas as pd
# == Create some dummy data ====================================================
correct_X_test = pd.DataFrame({"review": ["This is a review",
"This is another review",
"This is a third review"]})
# == Solution 1 ================================================================
correct_X_test['1_i'] = None
correct_X_test['i_n'] = None
for j, row in correct_X_test.iterrows():
input_list = row["review"].split()
sub_list_1_i, sub_list_i_n = [], []
for i in range(len(input_list)):
# Construction de la séquence de x1 à xi
sub_str_1_i = " ".join(input_list[:i+1])
# Construction de la séquence de xi à xn
sub_str_i_n = " ".join(input_list[i:])
sub_list_1_i.append(sub_str_1_i)
sub_list_i_n.append(sub_str_i_n)
correct_X_test.loc[j, '1_i'] = sub_list_1_i
correct_X_test.loc[j, 'i_n'] = sub_list_i_n
print(correct_X_test)
# Prints:
#
# review 1_i \
# 0 This is a review [This, This is, This is a, This is a review]
# 1 This is another review [This, This is, This is another, This is anoth...
# 2 This is a third review [This, This is, This is a, This is a third, Th...
# i_n
# 0 [This is a review, is a review, a review, review]
# 1 [This is another review, is another review, an...
# 2 [This is a third review, is a third review, a ...
Option 2: create new rows for every combination of sub_str_1_i and sub_str_i_n
In this option, each combination of sub_str_1_i and sub_str_i_n are stored as new rows in the dataframe. You can use the method pd.DataFrame.explode to convert the output from Option 1 into new rows:
correct_X_test.explode(['i_n', '1_i'])
# Returns:
#
# review 1_i i_n
# 0 This is a review This This is a review
# 0 This is a review This is is a review
# 0 This is a review This is a a review
# 0 This is a review This is a review review
# 1 This is another review This This is another review
# 1 This is another review This is is another review
# 1 This is another review This is another another review
# 1 This is another review This is another review review
# 2 This is a third review This This is a third review
# 2 This is a third review This is is a third review
# 2 This is a third review This is a a third review
# 2 This is a third review This is a third third review
# 2 This is a third review This is a third review review
You can create empty lists for the two sub_review columns and then append the corresponding sub-review strings to these lists for each value of i. Finally, you can add the two lists as new columns to the correct_X_test dataframe.try this code :
sub_review_1_i = []
sub_review_i_n = []
for j in correct_X_test.index:
input_list = correct_X_test["review"][j].split()
for i in range(len(input_list)):
sub_list_1_i = input_list[:i+1]
sub_str_1_i = " ".join(sub_list_1_i)
sub_review_1_i.append(sub_str_1_i)
sub_list_i_n = input_list[i:]
sub_str_i_n = " ".join(sub_list_i_n)
sub_review_i_n.append(sub_str_i_n)
correct_X_test["sub_review_1_i"] = sub_review_1_i
correct_X_test["sub_review_i_n"] = sub_review_i_n
the sub_review_1_i and sub_review_i_n lists are initialized before the loop, and then populated with the sub-review strings for each value of i. Finally, the two lists are added as new columns to the correct_X_test dataframe using correct_X_test["sub_review_1_i"] = sub_review_1_i and correct_X_test["sub_review_i_n"] = sub_review_i_n.
Related
So, i'm doing a looping to create some features for a dataframe, here's how it
medianas_origem_midia = []
teste_mannwhitneyu = []
alpha = 0.05
primeiro_pedido = []
volumetria = []
for i in flag_origem_midia:
aux = df.groupby(['cpf',i]).agg({'margem_total_regime_especial_trat':sum}).reset_index() #Pegando a soma por CPF
teste_mannwhitneyu.append(teste_medianas(aux,i))
a = df.groupby(i).agg({'data_primeiro_pedido':min}).reset_index()
primeiro_pedido.append(f"""Outros: {str(a.data_primeiro_pedido[0])} \
\
{i}: {str(a.data_primeiro_pedido[1])}""")
a = df.groupby(i).size().reset_index(name = 'vol')
volumetria.append(f"""Outros: {str(a.vol[0])} {i}: {str(a.vol[1])} """)
aux = aux.groupby(i).agg({'margem_total_regime_especial_trat':np.median}).reset_index()
medianas_origem_midia.append((aux.margem_total_regime_especial_trat[1]-aux.margem_total_regime_especial_trat[0])/aux.margem_total_regime_especial_trat[0]*100)
del aux
d = {'Origem/Midia':flag_origem_midia,'LTV':medianas_origem_midia,'Teste de Medianas':teste_mannwhitneyu,'Data Primeiro Pedido':primeiro_pedido,'Volumetria':volumetria}
df_origem_midia = pd.DataFrame(d)
Here, primeiro_pedido list store the first purchase on each of the flags in flag_origem_midia.
My issue is that, the string created in this list is too large and it looks like this:
Origem/Midia
LTV
Teste de Medianas
Data Primeiro Pedido
Volumetria
fl_direct / n/a
-5.897762
Different distribution (reject H0)
Outros: 2021-01-01 fl_direct / n/a: 20...
Outros: 2722123 fl_direct / n/a: 864342
As you can see in the colum "Data Primeiro Pedido", there's a "..." in the end of the string, but my goal is to show the entire string. Is there a way to break the line?
I tried to put an '\n' in the middle of the string, but it didn't worked.
Have you tried setting pandas max column width option as suggested here?
pd.set_option('display.max_colwidth', None)
it should remove the ellipsis and display the whole value for the cell.
I have a dataframe consisting of Wikipedia articles with geocoordinates and some statistics. The column 'Availability' contains a tuple of the languages that article is available in (out of a selection).
What I'm trying to do is plot a bubble map with plotly, and the legend being the availability in those languages. For example, out of ['ca','es'] you would have [],['ca'],['es'],['ca','es'] meaning not available, only in catalan, only in spanish or available in both respectively.
The problem is that when trying to use those combinations to create a dataframe with only the matching rows using Dataframe.isin(), it always returns an empty df.
The columns of the dataframe are:
Columns: [French Title, Qitem, Pageviews, page_title_1, page_title_2, Availability, Lat, Lon, Text]
Here is my code:
fig = go.Figure()
scale = 500
for comb in combinations:
df_sub = df[df['Availability'].isin(tuple(comb))] #The problem is here. This returns an empty DF
if(len(df_sub.index)) == 0: continue #There are no occurrencies with that comb
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name = comb #Here is the underlying restriction. I need to separate the traces according to their availability.
))
PS: I guess it has something to do with pandas not working very good with using lists or tuples as a column value, but didn't figure out how to achieve what I want. Any of you has any idea? Thank you in advance. Comb appears as a string or a tuple of strings: ('es','ca') , but the values in df['Availability] when I print them appear like (es,ca)
Sample dataframe (sorry for the style I'm new to Stack overflow)**
French Title Qitem Pageviews \
0 Liban Q822 53903
1 France Q142 25728
2 Biélorussie Q184 21688
3 ÃŽle Maurice Q2656389 20478
4 Affaire Dupont de Ligonnès Q16010109 16075
page_title_1 page_title_2 \
0 LÃbano LÃban
1 Francia França
2 Bielorrusia Bielorússia
3 Isla de Mauricio Illa Maurici
4 Asesinatos y desapariciones de Dupont de Ligonnès
Availability Lat Lon \
0 (es, ca) 33.90000000 35.53330000
1 (es, ca) 48.86700000 2.32650000
2 (es, ca) 53.528333333333 28.046666666667
3 (es, ca) -20.30084200 57.58209200
4 (es,) 47.23613230 -1.56848610
Text
0 Liban<br>(33.90000000, 35.53330000)<br>Q822
1 France<br>(48.86700000, 2.32650000)<br>Q142
2 Biélorussie<br>(53.528333333333, 28.046666666667)<br>Q184
3 ÃŽle Maurice<br>(-20.30084200, 57.58209200)<br>Q2656389
4 Affaire Dupont de Ligonnès<br>(47.23613230, -1.56848610)<br>Q16010109
You can use Series.apply() to achieve your goal:
df['Availability'].apply(lambda x: 'ca' in x)
That will return True if 'ca' is in the tuple. It can easily be modified to return some label, eg. Catalan.
In the end I turned the tuple into a list because due to not using df.isin() it doesn't raise the Unhashable Type Error, and was able to separate the traces via combinations using df.apply() (thanks to mkos for the idea):
for comb in combinations:
if len(comb) ==0:
name ='Not available'
df_sub = df[df['Availability'].apply(lambda x: len(x)==0)]
else:
df_sub = df[df['Availability'].apply(lambda x: set(comb) == set(x))]
name = ','.join(comb)
if(len(df_sub.index)) == 0: continue
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name =name
))
You can see the result here.
I am trying to append values from linear regression in a rolling window. The storable values are supposed to be appended in a certain position of my df (i.e. having a df 2300 x 2300, the first value from regression should be in the first col at 228th row and so on so forth).
Here below is my code.
Any help is more than welcome.
df_rolling_tstat # 2300 x 2300 dataframe
for i in range(len(Switz_fund_ret.iloc[1:, 1:2].columns)):
s = Switz_fund_ret.loc[birth_date[i], :]
start = s['contatore']
e = Switz_fund_ret.loc[death_date[i], :]
end = e['contatore']
window = 12
for j in range(end-start):
roll_one = Switz_fund_ret[i].iloc[start+j:start+window+j]
#market excess return del mercato quando il fondo era in attività
roll_two = Switz_fund_ret[2308].iloc[start+j:start+window+j]
#risk free rate quando il fondo era in attività
roll_three = Switz_fund_ret[2309].iloc[start+j:start+window+j]
roll_excess_return_fund = roll_one - roll_three
roll_two = sm.add_constant(roll_two)
roll_y=np.array(roll_excess_return_fund, dtype=float)
roll_x=np.array(roll_two, dtype=float)
roll_model = sm.OLS(roll_y, roll_x).fit()
roll_reg.append(roll_model)
alpha_roll.append(roll_model.params[0])
t_stat_roll.append(roll_model.tvalues[0])
p_value_roll.append(roll_model.pvalues[0])
I would like, for instance, to retrieve roll_model.pvalues[0] and put it in the first column of df at 228th position. Afterward, for the 2nd regression, I want to store roll_model.pvalues[0] at 229th entry.
Many thanks.
I am trying to create a DataFrame with more than 500 rows, derived from an API query. When I check the length of my arrays, as so:
print(len(cities), len(country), len(max_temp), len(latit), len(longit), len(humid_), len(cloud_), len(wind))
I get the following output:
577 526 526 526 526 526 526 526
Now, I read the answers about casting these to Series, which adds NaN to the empty cells. The problem is that this mismatches the column values, i.e., all the numeric values are listed first, then all the NaN's at the end. This will cause the country, max_temp, etc., to line up with the wrong city. What I want to do is have the NaN appear in the correct row of each city with missing data. I could simply dropna if I had a DataFrame; but with the different array lengths, I cannot get a DataFrame.
Okay, editing in light of the comments: I began with a randomly generated list of coordinates, then:
for lat_lng in lat_lngs:
city = citipy.nearest_city(lat_lng[0], lat_lng[1]).city_name
# If the city is unique, then add it to a our cities list
if city not in cities:
cities.append(city)
This generated a list of cities. Then I did:
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
What I believe is occurring is that I am getting an array something like this:
City Country Max Temp (etc.)
Boston US 30 (etc.)
Honolulu
Rome IT 27
Vladivostok RU 20
In this example, "Honolulu" had no data, so generated a row with only the city Column filled. I can't be sure, since I can't view it as a DataFrame. What I want to do is either put NaN in the same row as Honolulu, or drop the row with Honolulu.
So after consulting with an expert offline, I got this solution:
In my lists of variables, add a new one:
new_city = []
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
Then, in my Try loop:
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
new_city.append(response['name])
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
And finally, build my dataframe with that new_city variable instead of the original city variable. This gives me all lists of the same length.
I have two dataframes, a df of actors who have a feature that is a list of movie identifier numbers for films that they've worked on. I also have a list of movies that have an identifier number that will show up in the actor's list if the actor was in that movie.
I've attempted to iterate through the movies dataframe, which does produce results but is too slow.
It seems like iterating through the list of movies from the actors dataframe would result in less looping, but I've been unable to save results.
Here is the actors dataframe:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
And the movies dataframe:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
As you can see, the movies['tconst'] identifier shows up in a list in the actors dataframe.
My very slow iteration through the movie dataframe is as follows:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
This generates some results but is not fast enough to be useful. One observation is that by creating a new dataframe of all the actors who have the movie identifier in their knownForTitles is that this list can be put into a single feature of the movies dataframe.
Whereas for my attempt to loop through the actors dataframe below, I don't seem to be able to append items into the movies dataframe:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
So if I run the above code, I get a very fast result, but the 'cast' field remains empty.
I figured out the problem I was having with def actors_loop(movie_df, actor_df) function. The problem is that
results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
is setting the value equal to a copy of the results dataframe. It would be better to use the df.set_value() method or the df.at[] method.
I also figured out a much faster solution to the problem, rather than iterate through two dataframes and create recursive looping, it would be better to iterate once. So I created a list of tuples:
def actor_tuples(actor_df):
tuples =[]
for index, value in actor_df['knownForTitles'].iteritems():
cinemetography = [x.strip() for x in value.split(',')]
for movie in cinemetography:
tuples.append((actor_df['primaryName'].loc[index], movie))
return tuples
This created a list of tuples of the following form:
[('Fred Astaire', 'tt0043044'),
('Lauren Bacall', 'tt0117057')]
I then created a list of movie identifier numbers and index points (from the movie dataframe), that took this form:
{'tt0000009': 0,
'tt0000147': 1,
'tt0000335': 2,
'tt0000502': 3,
'tt0000574': 4,
'tt0000615': 5,
'tt0000630': 6,
'tt0000675': 7,
'tt0000676': 8,
'tt0000679': 9}
I then used the below function to iterate through the actor tuples and use the movie identifier as the key in the movie dictionary, this returned the correct movie index, which I used to add the actor name tuple to the target dataframe:
def add_cast(movie_df, Atuples, Mtuples):
results_df = movie_df.copy()
results_df['cast'] = ''
counter = 0
total = len(Atuples)
for tup in Atuples:
#this passes the movie ID into the movie_dict that will return an index
try:
movie_index = Mtuples[tup[1]]
if results_df.at[movie_index, 'cast'] == '':
results_df.at[movie_index, 'cast'] += tup[0]
else:
results_df.at[movie_index, 'cast'] += ',' + tup[0]
except KeyError:
pass
#logging
counter +=1
if counter % 1000000 == 0:
logging.warning(f'Index {counter} out of {total}, {counter/total}% finished')
return results_df
It ran in 10 minutes (making 2 sets of tuples, then the adding function) for 16.5 million actor tuples. The results are below:
0 tt0000009 Miss Jerry 1894 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 Documentary,News,Sport
2 tt0000335 Soldiers of the Cross 1900 Biography,Drama
3 tt0000502 Bohemios 1905 \N
4 tt0000574 The Story of the Kelly Gang 1906 Biography,Crime,Drama
cast
0 Blanche Bayliss,Alexander Black,William Courte...
1 Bob Fitzsimmons,Enoch J. Rector,John L. Sulliv...
2 Herbert Booth,Joseph Perry,Orrie Perry,Reg Per...
3 Antonio del Pozo,El Mochuelo,Guillermo Perrín,...
4 Bella Cola,Sam Crewes,W.A. Gibson,Millard John...
Thank you stack overflow!