How to break line a string in a dataframe? - python

So, i'm doing a looping to create some features for a dataframe, here's how it
medianas_origem_midia = []
teste_mannwhitneyu = []
alpha = 0.05
primeiro_pedido = []
volumetria = []
for i in flag_origem_midia:
aux = df.groupby(['cpf',i]).agg({'margem_total_regime_especial_trat':sum}).reset_index() #Pegando a soma por CPF
teste_mannwhitneyu.append(teste_medianas(aux,i))
a = df.groupby(i).agg({'data_primeiro_pedido':min}).reset_index()
primeiro_pedido.append(f"""Outros: {str(a.data_primeiro_pedido[0])} \
\
{i}: {str(a.data_primeiro_pedido[1])}""")
a = df.groupby(i).size().reset_index(name = 'vol')
volumetria.append(f"""Outros: {str(a.vol[0])} {i}: {str(a.vol[1])} """)
aux = aux.groupby(i).agg({'margem_total_regime_especial_trat':np.median}).reset_index()
medianas_origem_midia.append((aux.margem_total_regime_especial_trat[1]-aux.margem_total_regime_especial_trat[0])/aux.margem_total_regime_especial_trat[0]*100)
del aux
d = {'Origem/Midia':flag_origem_midia,'LTV':medianas_origem_midia,'Teste de Medianas':teste_mannwhitneyu,'Data Primeiro Pedido':primeiro_pedido,'Volumetria':volumetria}
df_origem_midia = pd.DataFrame(d)
Here, primeiro_pedido list store the first purchase on each of the flags in flag_origem_midia.
My issue is that, the string created in this list is too large and it looks like this:
Origem/Midia
LTV
Teste de Medianas
Data Primeiro Pedido
Volumetria
fl_direct / n/a
-5.897762
Different distribution (reject H0)
Outros: 2021-01-01 fl_direct / n/a: 20...
Outros: 2722123 fl_direct / n/a: 864342
As you can see in the colum "Data Primeiro Pedido", there's a "..." in the end of the string, but my goal is to show the entire string. Is there a way to break the line?
I tried to put an '\n' in the middle of the string, but it didn't worked.

Have you tried setting pandas max column width option as suggested here?
pd.set_option('display.max_colwidth', None)
it should remove the ellipsis and display the whole value for the cell.

Related

add rows and columns to pandas dataframe

I have a pandas dataframe correct_X_test that contains one column review containing reviews.
I need to add two new columns that contain parts of the reviews as below:
for one line of review review ='x1 x2 x3 x x x xi x x x xn', I need to stock sub_review_1_i='x1 x2 x3 x x x xi' and sub_review_i_n='xi x x x xn' for i in (1,n)
I extract the two strings using this code:
for j in correct_y_test.index:
input_list=correct_X_test["review"][j].split()
for i in range(len(input_list)):
#Construction de la séquence de x1 à xi
sub_list_1_i=input_list[:i+1]
sub_str_1_i = ""
for ele in sub_list_1_i:
sub_str_1_i += ele + " "
#Construction de la séquence de xi à xn
sub_list_i_n=input_list[i:]
sub_str_i_n = ""
for ele in sub_list_i_n:
sub_str_i_n += ele + " "
but don't see how to stock this in the dateframe because for a review we will have i rows and 2 columns
any idea, please?
The way I see, you have two options:
Option 1: store the sub-reviews as lists
In this option, for every "review", you create two lists to store the values from sub_str_1_i, and another for sub_str_i_n. Then you add those lists as new columns in their respective rows. Here's an example:
import pandas as pd
# == Create some dummy data ====================================================
correct_X_test = pd.DataFrame({"review": ["This is a review",
"This is another review",
"This is a third review"]})
# == Solution 1 ================================================================
correct_X_test['1_i'] = None
correct_X_test['i_n'] = None
for j, row in correct_X_test.iterrows():
input_list = row["review"].split()
sub_list_1_i, sub_list_i_n = [], []
for i in range(len(input_list)):
# Construction de la séquence de x1 à xi
sub_str_1_i = " ".join(input_list[:i+1])
# Construction de la séquence de xi à xn
sub_str_i_n = " ".join(input_list[i:])
sub_list_1_i.append(sub_str_1_i)
sub_list_i_n.append(sub_str_i_n)
correct_X_test.loc[j, '1_i'] = sub_list_1_i
correct_X_test.loc[j, 'i_n'] = sub_list_i_n
print(correct_X_test)
# Prints:
#
# review 1_i \
# 0 This is a review [This, This is, This is a, This is a review]
# 1 This is another review [This, This is, This is another, This is anoth...
# 2 This is a third review [This, This is, This is a, This is a third, Th...
# i_n
# 0 [This is a review, is a review, a review, review]
# 1 [This is another review, is another review, an...
# 2 [This is a third review, is a third review, a ...
Option 2: create new rows for every combination of sub_str_1_i and sub_str_i_n
In this option, each combination of sub_str_1_i and sub_str_i_n are stored as new rows in the dataframe. You can use the method pd.DataFrame.explode to convert the output from Option 1 into new rows:
correct_X_test.explode(['i_n', '1_i'])
# Returns:
#
# review 1_i i_n
# 0 This is a review This This is a review
# 0 This is a review This is is a review
# 0 This is a review This is a a review
# 0 This is a review This is a review review
# 1 This is another review This This is another review
# 1 This is another review This is is another review
# 1 This is another review This is another another review
# 1 This is another review This is another review review
# 2 This is a third review This This is a third review
# 2 This is a third review This is is a third review
# 2 This is a third review This is a a third review
# 2 This is a third review This is a third third review
# 2 This is a third review This is a third review review
You can create empty lists for the two sub_review columns and then append the corresponding sub-review strings to these lists for each value of i. Finally, you can add the two lists as new columns to the correct_X_test dataframe.try this code :
sub_review_1_i = []
sub_review_i_n = []
for j in correct_X_test.index:
input_list = correct_X_test["review"][j].split()
for i in range(len(input_list)):
sub_list_1_i = input_list[:i+1]
sub_str_1_i = " ".join(sub_list_1_i)
sub_review_1_i.append(sub_str_1_i)
sub_list_i_n = input_list[i:]
sub_str_i_n = " ".join(sub_list_i_n)
sub_review_i_n.append(sub_str_i_n)
correct_X_test["sub_review_1_i"] = sub_review_1_i
correct_X_test["sub_review_i_n"] = sub_review_i_n
the sub_review_1_i and sub_review_i_n lists are initialized before the loop, and then populated with the sub-review strings for each value of i. Finally, the two lists are added as new columns to the correct_X_test dataframe using correct_X_test["sub_review_1_i"] = sub_review_1_i and correct_X_test["sub_review_i_n"] = sub_review_i_n.

Extract several values from a row when a certain value is found

I have a code that check in different columns for all the dates that are >= "2022-12-01" and i <= "2024-12-31.
What I would like is to be able to extract some other informations located on the same row.
these are the the headers of my columns :
EMPL. NO
NOM A L'EMPLACEMENT
ADRESSE
VILLE
PROV
OBJET NO
EMPLACEMENT DE L'APPAREIL
DESCRIPTION DE L'APPAREIL
MANUFACTURIER
DIMENSIONS
MAWP
SVP
DERNIERE INSP. EXT.
FREQ. EXT.
DERNIERE INSP. INT.
FREQ. INT.
D_EXT_1
D_INT_1
D_EXT_2
D_INT_2
D_EXT_3
D_INT_3
D_EXT_4
D_INT_4
D_EXT_5
D_INT_5
D_EXT_6
D_INT_6
I would like to search for are all the dates that are between >= "2022-12-01" and i <= "2024-12-31 in any of the columns with the prefix D_EXT_x and extract it with all the information on the row that comes before D_EXT_1.
This is the code I got from a question I asked earlier:
import pandas as pd
cols = [prefix + str(i) for prefix in ['D_INT_'] for i in range(1,7)]
data = pd.read_csv("dates.csv")
for col in cols:
data.loc[:,col] = pd.to_datetime(data.loc[:,col])
ext = data[
(
data.loc[:,cols].ge(pd.to_datetime("2022-12-01"))\
& data.loc[:,cols].le(pd.to_datetime("2024-12-31"))\
).any(axis=1)
]
print(ext)
The problem is that it's not doing what it's supposed to do. My file has 1692 lines and 29 columns but the output is giving me : [1692 rows x 1715 columns].
here is the original question:
how to extract entire row when a value is found
Any help would be appreciated
# Get the rows
rows_with_valid_date = df[after_this <= df[date_column_name] <= before_this]
# Get the wanted columns
needed_values = rows_with_valid_date[[wanted_column1, wanted_column2, etc]]
You can fill in the correct names where needed.

Using df.isin() function over a column of tuples | Pandas

I have a dataframe consisting of Wikipedia articles with geocoordinates and some statistics. The column 'Availability' contains a tuple of the languages that article is available in (out of a selection).
What I'm trying to do is plot a bubble map with plotly, and the legend being the availability in those languages. For example, out of ['ca','es'] you would have [],['ca'],['es'],['ca','es'] meaning not available, only in catalan, only in spanish or available in both respectively.
The problem is that when trying to use those combinations to create a dataframe with only the matching rows using Dataframe.isin(), it always returns an empty df.
The columns of the dataframe are:
Columns: [French Title, Qitem, Pageviews, page_title_1, page_title_2, Availability, Lat, Lon, Text]
Here is my code:
fig = go.Figure()
scale = 500
for comb in combinations:
df_sub = df[df['Availability'].isin(tuple(comb))] #The problem is here. This returns an empty DF
if(len(df_sub.index)) == 0: continue #There are no occurrencies with that comb
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name = comb #Here is the underlying restriction. I need to separate the traces according to their availability.
))
PS: I guess it has something to do with pandas not working very good with using lists or tuples as a column value, but didn't figure out how to achieve what I want. Any of you has any idea? Thank you in advance. Comb appears as a string or a tuple of strings: ('es','ca') , but the values in df['Availability] when I print them appear like (es,ca)
Sample dataframe (sorry for the style I'm new to Stack overflow)**
French Title Qitem Pageviews \
0 Liban Q822 53903
1 France Q142 25728
2 Biélorussie Q184 21688
3 ÃŽle Maurice Q2656389 20478
4 Affaire Dupont de Ligonnès Q16010109 16075
page_title_1 page_title_2 \
0 Líbano Líban
1 Francia França
2 Bielorrusia Bielorússia
3 Isla de Mauricio Illa Maurici
4 Asesinatos y desapariciones de Dupont de Ligonnès
Availability Lat Lon \
0 (es, ca) 33.90000000 35.53330000
1 (es, ca) 48.86700000 2.32650000
2 (es, ca) 53.528333333333 28.046666666667
3 (es, ca) -20.30084200 57.58209200
4 (es,) 47.23613230 -1.56848610
Text
0 Liban<br>(33.90000000, 35.53330000)<br>Q822
1 France<br>(48.86700000, 2.32650000)<br>Q142
2 Biélorussie<br>(53.528333333333, 28.046666666667)<br>Q184
3 ÃŽle Maurice<br>(-20.30084200, 57.58209200)<br>Q2656389
4 Affaire Dupont de Ligonnès<br>(47.23613230, -1.56848610)<br>Q16010109
You can use Series.apply() to achieve your goal:
df['Availability'].apply(lambda x: 'ca' in x)
That will return True if 'ca' is in the tuple. It can easily be modified to return some label, eg. Catalan.
In the end I turned the tuple into a list because due to not using df.isin() it doesn't raise the Unhashable Type Error, and was able to separate the traces via combinations using df.apply() (thanks to mkos for the idea):
for comb in combinations:
if len(comb) ==0:
name ='Not available'
df_sub = df[df['Availability'].apply(lambda x: len(x)==0)]
else:
df_sub = df[df['Availability'].apply(lambda x: set(comb) == set(x))]
name = ','.join(comb)
if(len(df_sub.index)) == 0: continue
fig.add_trace(go.Scattergeo(
lat=df_sub['Lat'],
lon=df_sub['Lon'],
text=df_sub['Text'],
marker = dict(
size = df[order_by],
sizeref=2. * max(df[order_by]) / (scale ** 2),
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode='area'
), name =name
))
You can see the result here.

append value from loop linear regression in a certain position in dataframe

I am trying to append values from linear regression in a rolling window. The storable values are supposed to be appended in a certain position of my df (i.e. having a df 2300 x 2300, the first value from regression should be in the first col at 228th row and so on so forth).
Here below is my code.
Any help is more than welcome.
df_rolling_tstat # 2300 x 2300 dataframe
for i in range(len(Switz_fund_ret.iloc[1:, 1:2].columns)):
s = Switz_fund_ret.loc[birth_date[i], :]
start = s['contatore']
e = Switz_fund_ret.loc[death_date[i], :]
end = e['contatore']
window = 12
for j in range(end-start):
roll_one = Switz_fund_ret[i].iloc[start+j:start+window+j]
#market excess return del mercato quando il fondo era in attività
roll_two = Switz_fund_ret[2308].iloc[start+j:start+window+j]
#risk free rate quando il fondo era in attività
roll_three = Switz_fund_ret[2309].iloc[start+j:start+window+j]
roll_excess_return_fund = roll_one - roll_three
roll_two = sm.add_constant(roll_two)
roll_y=np.array(roll_excess_return_fund, dtype=float)
roll_x=np.array(roll_two, dtype=float)
roll_model = sm.OLS(roll_y, roll_x).fit()
roll_reg.append(roll_model)
alpha_roll.append(roll_model.params[0])
t_stat_roll.append(roll_model.tvalues[0])
p_value_roll.append(roll_model.pvalues[0])
I would like, for instance, to retrieve roll_model.pvalues[0] and put it in the first column of df at 228th position. Afterward, for the 2nd regression, I want to store roll_model.pvalues[0] at 229th entry.
Many thanks.

Is it possible to do a dropna-like operation before creating a DataFrame?

I am trying to create a DataFrame with more than 500 rows, derived from an API query. When I check the length of my arrays, as so:
print(len(cities), len(country), len(max_temp), len(latit), len(longit), len(humid_), len(cloud_), len(wind))
I get the following output:
577 526 526 526 526 526 526 526
Now, I read the answers about casting these to Series, which adds NaN to the empty cells. The problem is that this mismatches the column values, i.e., all the numeric values are listed first, then all the NaN's at the end. This will cause the country, max_temp, etc., to line up with the wrong city. What I want to do is have the NaN appear in the correct row of each city with missing data. I could simply dropna if I had a DataFrame; but with the different array lengths, I cannot get a DataFrame.
Okay, editing in light of the comments: I began with a randomly generated list of coordinates, then:
for lat_lng in lat_lngs:
city = citipy.nearest_city(lat_lng[0], lat_lng[1]).city_name
# If the city is unique, then add it to a our cities list
if city not in cities:
cities.append(city)
This generated a list of cities. Then I did:
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
What I believe is occurring is that I am getting an array something like this:
City Country Max Temp (etc.)
Boston US 30 (etc.)
Honolulu
Rome IT 27
Vladivostok RU 20
In this example, "Honolulu" had no data, so generated a row with only the city Column filled. I can't be sure, since I can't view it as a DataFrame. What I want to do is either put NaN in the same row as Honolulu, or drop the row with Honolulu.
So after consulting with an expert offline, I got this solution:
In my lists of variables, add a new one:
new_city = []
country = []
latit = []
longit = []
max_temp = []
humid_ = []
cloud_ = []
wind = []
Then, in my Try loop:
for city in cities:
try:
query_url = base_url + "q=" + city + "&appid=" + weather_api_key
response = requests.get(query_url).json()
new_city.append(response['name])
country.append(response['sys']['country'])
latit.append(response['coord']['lat'])
longit.append(response['coord']['lon'])
max_temp.append(response['main']['temp_max'])
humid_.append(response['main']['humidity'])
cloud_.append(response['clouds']['all'])
wind.append(response['wind']['speed'])
except:
print(f'Data not found.')
And finally, build my dataframe with that new_city variable instead of the original city variable. This gives me all lists of the same length.

Categories

Resources