Print formatting to file in Python - python

I wish to write to a file in a formatted way. I have been searching how I can do this, and the best solution I have managed to get is this:
write_to_file.write('{:20} {:20} {:3}\n'.format(w[0][0], w[0][1], w[1]))
However, when I do this I do not get precise formatting.
det er 6
er det 5
den er 5
du kan 4
hva er 3
har en 3
er død 3
å gjøre 3
jeg vil 3
har vi 3
et dikt 2
når du 2
det var 2
må være 2
kan skrive 2
hva gjør 2
ha et 2
jeg har 2
du skal 2
vi kan 2
jeg kan 2
en vakker 2
er du 2
når man 2
får jeg 2
I get things printed in the fashion above. I need everything to align perfectly.

That's because you're still dealing with bytes. Once you start dealing with actual characters you'll find that they align perfectly.
write_to_file.write('{:20} {:20} {:3}\n'.format(u'å', u'gjøre', u'3'))

The DataFrame command in the Pandas library would also be worth looking into.

Related

Add a comma after two words in pandas

I have the following texts in a df column:
La Palma
La Palma Nueva
La Palma, Nueva Concepcion
El Estor
El Estor Nuevo
Nuevo Leon
San Jose
La Paz Colombia
Mexico Distrito Federal
El Estor, Nuevo Lugar
What I need is to add a comma at the end of each row but the condition that it is only two words. I found a partial solution:
df['Column3'] = df['Column3'].apply(lambda x: str(x)+',')
(solution found in stackoverflow)
Given:
words
0 La Palma
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor
4 El Estor Nuevo
5 Nuevo Leon
6 San Jose
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
Doing:
df.words = df.words.apply(lambda x: x+',' if len(x.split(' ')) == 2 else x)
print(df)
Outputs:
words
0 La Palma,
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor,
4 El Estor Nuevo
5 Nuevo Leon,
6 San Jose,
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar

Pandas categorize a dataframe based on another dataframe with substrings

I'm trying to learn pandas and python to transfer some problems from excel to pandas/python. I have a big csv file from my bank with over 10000 records. I want to categorize the records based on the description. For that I have a big mapping file with keywords. In excel I used vLookup and I'm trying to get this solution into Pandas/python
So I can read the csv into a dataframe dfMain. One column (in dfMain) with text called description is for me input to categorize it based on an the mapping file called dfMap.
dfMain looks simplified something like this:
Datum Bedrag Description
2020-01-01 -166.47 een cirkel voor je uit
2020-01-02 -171.79 even een borreling
2020-01-02 -16.52 stilte zacht geluid
2020-01-02 -62.88 een steentje in het water
2020-01-02 -30.32 gooi jij je zorgen weg
2020-01-02 -45.99 dan ben je laf weet je dat
2020-01-02 -322.44 je klaagt ook altijd over pech
2020-01-03 -4.80 jij kan niet ophouden zorgen
2020-01-07 5.00 de wereld te besnauwen
dfMap looks simplified like this
sleutel code
0 borreling A1
1 zorgen B2
2 steentje C2
3 een C1
dfMap contains keywords('sleutel') and a Category code ('code').
When the 'sleutel' is a substring of 'description' in dfMain an extra to be added column called 'category' in dfMain will get the value of the code.
I'm aware that multiple keywords can apply to certain values of description but first come counts, in other words: the number of rows in dfMain must stay the same.
the resulting data frame must then look like this:
Out[34]:
Datum Bedrag Description category
2020-01-01 -166.47 een cirkel voor je uit C1
2020-01-02 -171.79 even een borreling A1
2020-01-02 -16.52 stilte zacht geluid NaN
2020-01-02 -62.88 een steentje in het water C2
2020-01-02 -30.32 gooi jij je zorgen weg B2
2020-01-02 -45.99 dan ben je laf weet je dat NaN
2020-01-02 -322.44 je klaagt ook altijd over pech NaN
2020-01-03 -4.80 jij kan niet ophouden zorgen B2
2020-01-07 5.00 de wereld te besnauwen NaN
I tried a lot of things with join but can't get it to work.
Try this:
import pandas as pd
# prepare the data
Datum = ['2020-01-01', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-02', '2020-01-03', '2020-01-03']
Bedrag = [-166.47, -171.79, -16.52, -62.88, -30.32, -45.99, -322.44, -4.80, 5.00]
Description = ["een cirkel voor je uit", "even een borreling", "stilte zacht geluid", "een steentje in het water",
"gooi jij je zorgen weg", "dan ben je laf weet je dat", "je klaagt ook altijd over pech", "jij kan niet ophouden zorgen", "de wereld te besnauwen"]
dfMain = pd.DataFrame(Datum, columns=['Datum'])
dfMain['Bedrag'] = Bedrag
dfMain['Description'] = Description
sleutel = ["borreling", "zorgen", "steentje", "een"]
code = ["A1", "B2", "C2", "C1"]
dfMap = pd.DataFrame(sleutel, columns=['sleutel'])
dfMap['code'] = code
print(dfMap)
# solution
map_code = pd.Series(dfMap.code.values ,index=dfMap.sleutel).to_dict()
def extract_codes(row):
for item in map_code:
if item in row:
return map_code[item]
return "NaN"
dfMain['category'] = dfMain['Description'].apply(extract_codes)
print(dfMain)
An efficient solution is to use a regex with extract and then to map the result:
regex = '(%s)' % dfMap['sleutel'].str.cat(sep='|')
dfMain['category'] = (
dfMain['Description']
.str.extract(regex, expand=False)
.map(dfMap.set_index('sleutel')['code'])
)
Output:
Datum Bedrag Description category
0 2020-01-01 -166.47 een cirkel voor je uit C1
1 2020-01-02 -171.79 even een borreling C1
2 2020-01-02 -16.52 stilte zacht geluid NaN
3 2020-01-02 -62.88 een steentje in het water C1
4 2020-01-02 -30.32 gooi jij je zorgen weg B2
5 2020-01-02 -45.99 dan ben je laf weet je dat NaN
6 2020-01-02 -322.44 je klaagt ook altijd over pech NaN
7 2020-01-03 -4.80 jij kan niet ophouden zorgen B2
8 2020-01-07 5.00 de wereld te besnauwen NaN
The regex generated will end up as '(borreling|zorgen|steentje|een)'

Cut string if the string size exceeds 5000 bytes

Currently I am trying to translate a Pandas dataframe with Amazon Translate however, the max length of request text allowed is 5000 bytes and the dataframe contains multiple strings that exceed this limit.
Therefore I want to implement a solution that is able to cut the string of the "Content" column in chunks below 5000 bytes, amount of chunks dependent on the original string size, so that the limit is not exceeded and there is no lost of text.
To be more precise: the dataframe contains 3 columns:
Newspaper Date Content
6 Trouw 2018 Het is de laatste kolenmijn in Duitsland de Pr...
7 Trouw 2018 Liever wat meer kwijt aan energieheffing dan r...
8 Trouw 2018 De VVD doet een voorstel dat op het Binnenhof ...
9 Trouw 2018 In Nederland bestaat grote weerstand tegen ker...
10 Trouw 2017 Theo Potma 1932 2017 had zijn blocnote altijd...
11 Trouw 2017 Hoe en hoe snel kan Nederland zijn beloften op...
12 Trouw 2017 transitie Hoe en hoe snel kan Nederland zijn ...
14 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
15 Trouw 2016 Welke ideeën koestert Angela Merkel Henri Beun...
16 Trouw 2015 Rapport Dwing burger CO\n Nederland heeft e...
And only the "Content" column should be checked for string size and cut in chunks but keep the original "Newspaper" and "Date" column data. By doing so I can still trace back the text to the original row.
Is there anyone who can help with such a solution?

How to get the relation between categorical and numerical variables of a dataframe?

I have a dataframe with 49 columns. Most of them are categorical (dtype object), some are numerical. As I'm a newbie in data science I tried to plot the Pearson correlation heatmap and see the correlation of independent variables but only numeric variables are taken into account.
So how to get the relation between categorical and numerical variables of a dataframe?
Here is an excerpt of my dataframe:
>>> df1.head(3)
Sexe date_naissance Groupe_dage ville Statut_marital Niveau_de_scolarite Situation_professionnelle Autre_situation_professionnelle Revenu_mensuel Si_connexion_internet Canal_acces_info Autre_canal_acces_info Si_situtation_ville_degradee Si_intention_emigration Besoin_Sante Besoin_Education Besoin_Conditions_de_vie Besoin_Lutte_contre_criminalite Besoin_Emploi Besoin_Lutte_contre_corruption Besoin_Eau_potable Besoin_Infrastructures Besoin_Culture_art Besoin_Amelioration_services_publics Besoin_Acces_logement Besoin_Autres_besoins Non_declaration_besoins Autres_besoins Si_connait_president_commune Si_connait_parlementaires Si_inscrit_LE Si_vote_2016 Intention_vote_2021 Consentement Langue_du_questionnaire region id_reg status nbr_app adherent
0 Une femme 1964-04-15 Entre 45 et 54 ans Al Hoceima Marié et je n'ai pas encore d'enfants à charge 1er cycle universitaire / Licence Je suis independent NaN 5,000-7,499 DHS Oui Internet NaN Je suis d'accord Je ne suis pas d'accord True False True False True False False False False False False False False NaN Oui Oui Oui Oui Je sais déjà pour qui je vais voter en 2021 J'accepter d'être recontacté Arabe Tanger-Tetouan-Al Hoceima 1.0 Qualifié 3.0 True
1 Une femme NaN Entre 18 et 24 ans Tétouan Célibataire 1er cycle universitaire / Licence Je suis journalier, je travaille de temps à a... NaN 1-2,499 DHS Non Internet NaN Je suis d'accord Je suis d'accord True True False False True False False False False False False False False NaN Oui Non Non NaN Je ne voterai pas en 2021 Non Arabe Tanger-Tetouan-Al Hoceima 1.0 NaN NaN NaN
2 Un homme NaN Entre 25 et 34 ans Khenifra Marié et j'ai des enfants à charge Niveau lycée Je suis journalier, je travaille de temps à a... NaN Je préfére ne pas répondre Non Télévision NaN Je suis d'accord Je suis d'accord True False True False True False False False False False False False False NaN Oui Non Non NaN Je vais voter en 2021 mais je ne sais toujours... J'accepter d'être recontacté Arabe Beni Mellal-Khenifra 5.0 Na veut pas répondre 2.0 NaN
My attempt
Following this guide on categorical encoding I tried the following:
# for each column where dtype is object
for column in df1.columns:
if df1[column].dtypes == np.object:
df1[column] = df1[column].astype('category')
df1[column] = df1[column].cat.codes
#Using Pearson Correlation
cor = df1.corr()
mask = np.zeros_like(cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(12,10))
sns.heatmap(cor,
vmin=-1,
cmap='coolwarm',
annot=False,
mask = mask);
I guess it doesn't make sense as I'm doing the correlation between categorical variables or numerical and categorical variables.
For correlation between categorical values you can use the corrected Cramer's V, and for correlation between numerical and categorical variables you can use the correlation ratio.

In pandas how to create a dataframe from a list of dictionaries?

In python3 and pandas I have a list of dictionaries in this format:
a = [{'texto27/2': 'SENADO: PLS 00143/2016, de autoria de Telmário Mota, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Assuntos Sociais. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=2914881'}, {'texto27/3': 'SENADO: PEC 00176/2019, de autoria de Randolfe Rodrigues, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Constituição, Justiça e Cidadania. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=8027142'}, {'texto6/4': 'SENADO: PL 05643/2019, de autoria de Câmara dos Deputados, fala sobre violência sexual e sofreu alterações em sua tramitação. Tramitação: Comissão de Direitos Humanos e Legislação Participativa. Situação: MATÉRIA COM A RELATORIA. http://legis.senado.leg.br/sdleg-getter/documento?dm=8015569'}]
I tried to transform it into a dataframe with these commands:
import pandas as pd
df_lista_sentencas = pd.DataFrame(a)
df_lista_sentencas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
texto27/2 1 non-null object
texto27/3 1 non-null object
texto6/4 1 non-null object
dtypes: object(3)
memory usage: 100.0+ bytes
But the generated dataframe has blank lines:
df_lista_sentencas.reset_index()
index texto27/2 texto27/3 texto6/4
0 0 SENADO: PLS 00143/2016, de autoria de Telmário... NaN NaN
1 1 NaN SENADO: PEC 00176/2019, de autoria de Randolfe... NaN
2 2 NaN NaN SENADO: PL 05643/2019, de autoria de Câmara do...
I would like to generate something like this:
texto27/2 texto27/3 texto6/4
SENADO: PLS 00143/2016, de autoria de Telmário... SENADO: PEC 00176/2019, de autoria de Randolfe.. SENADO: PL 05643/2019, de autoria de Câmara do...
Please, does anyone know how I can create a dataframe without blank lines?
May be using bfill:
df = df_lista_sentencas.bfill().iloc[[0]]
print(df)

Categories

Resources