Dictionary of unique words and their position in a file - python

I'm trying to build a 'database' of words and their corresponding tweet identifier.
My guess is that a dictionary is the best option for doing this.
Identifier, user, text, words get split on tab.
Example of input :
1035421931321864192 SchipholWatch RT #vinvanoort: Zo, ik heb getekend Genoeg #geschiphol, hoogste tijd voor een eerlijk en duurzaam #luchtvaartbeleid RT #vinvanoort : Zo , ik heb getekend Genoeg #geschiphol , hoogste tijd voor een eerlijk en duurzaam #luchtvaartbeleid
1035421930541772800 ev4uam2 RT #AfshinEllian1: Kennelijk vinden ze daar aan die gezellige tafel normaal dat steltje barbaren onze grondwettelijke rechten bedreigen. Zouden we ook voor andere buitenwettelijke dreigingen moeten capituleren? Wat een door ons gesubsidieerde domheid! #laatop1 #cartoonwedstrijd RT #AfshinEllian1 : Kennelijk vinden ze daar aan die gezellige tafel normaal dat steltje barbaren onze grondwettelijke rechten bedreigen . Zouden we ook voor andere buitenwettelijke dreigingen moeten capituleren ? Wat een door ons gesubsidieerde domheid ! #laatop1 #cartoonwedstrijd
Example of desired output:
{'exampleword' : ['1035421930541772800', '1235424930545772800']}
Current code :
def main():
olist = []
worddict = {}
for line in sys.stdin:
i,u,t,w = line.split('\t')
splitword = w.split()
olist.extend(splitword)
for num,name in enumerate(olist):
print("{} [{}]".format(name.strip(), num))
main()
So far i have tried iterating over the lines and adding splitword + i(which is the tweet identifier) to a dictionary, without succes.

Basically what you want is to "reverse" a dictionary with list values to another dictionary with list values.
I abstracted from the actual tweet data since that would obfuscate the actual problem's answer.
A greedy implementation could be:
import collections
def reverse_dict(input):
output = collections.defaultdict(list)
for key, val in input.items():
for item in val:
output[item].append(key)
return output
def main():
input = {
'u123': ['hello', 'world'],
'u456': ['hello', 'you'],
'u789': ['you', 'world'],
}
output = reverse_dict(input)
print output
if __name__ == '__main__':
main()
As #Michael Butscher said, the expected output from your question is not a valid Python dictionary. The above code will output:
{'world': ['u789', 'u123'], 'you': ['u789', 'u456'], 'hello': ['u456', 'u123']}
Furthermore, as #Austin answered, approaching this problem using "brute force" won't necessarily be the best solution.

Related

Use of random in a Hangman game

This is a Hangman game. The fact is that the user counts with one help, which can be used to reveal one of the letters of the word. I need it for the unknown letters (actually it works randomly and when the user has the word almost done it's more probably for the letter to be revealed yet, so there's no really help)
How can I modify the code for it to reveal a letter that hasn't been revealed yet?
import random
#AHORCADO
lista_palabras=['programacion', 'python', 'algoritmo', 'computacion', 'antioquia', 'turing', 'ingenio', 'AYUDA']
vidas=['馃А','馃А','馃А','馃А','馃А','馃А','馃А']
num_word=random.randint(0,6)
palabra=lista_palabras[num_word]
print(' _ '*len(palabra))
print('Inicias con siete vidas', "".join(vidas),'\n', 'Pista: escribe AYUDA para revelar una letra (s贸lo tienes disponible 1)')
#print(palabra)
palabra_actual=['_ ']*len(palabra)
posicion=7
contador_pistas=0
while True:
fullword="".join(palabra_actual)
#condici贸n para ganar
if letra==palabra or fullword==palabra:
print(palabra)
print('隆GANASTE!')
break
letra=input('Inserta una letra: ')
#condici贸n que agrega letra adivinada
if letra in palabra:
orden=[i for i in range(len(palabra)) if palabra[i] == letra]
for letras in orden:
palabra_actual[letras]=letra
print(''.join(palabra_actual))
#condici贸n AYUDAs
elif letra=='AYUDA' and contador_pistas==0:
pista=random.randint(0,len(palabra)-1)
palabra_actual[pista]=palabra[pista]
print(''.join(palabra_actual))
contador_pistas+=1
#condici贸n l铆mite de ayudas
elif letra=='AYUDA' and contador_pistas>=1:
print('Ya no te quedan pistas restantes')
#condici贸n para perder
elif letra not in lista_palabras:
posicion-=1
vidas[posicion]='馃拃'
print('隆Perdiste una vida!',''.join(vidas))
if posicion==0:
print('GAME OVER')
break
Thank you <3
Create a list of of index values for the word, and remove the indexes from the list as the user selects the correct characters. then when they ask for help, you can call random.choice on the remaining indexes so you are guarateed to get one that hasn't been chosen yet
...
...
palabra_actual=['_ ']*len(palabra)
posicion=7
contador_pistas=0
indexes = list(range(len(palabra))) # list of index values
while True:
fullword="".join(palabra_actual)
if letra==palabra or fullword==palabra:
print(palabra)
print('隆GANASTE!')
break
letra=input('Inserta una letra: ')
if letra in palabra:
orden=[i for i in range(len(palabra)) if palabra[i] == letra]
for letras in orden:
indexes.remove(letras) # remove index values for selected characters
palabra_actual[letras]=letra
print(''.join(palabra_actual))
elif letra=='AYUDA' and contador_pistas==0:
pista=random.choice(indexes) # choose from remaining index values using random.choice
palabra_actual[pista]=palabra[pista]
print(''.join(palabra_actual))
contador_pistas+=1
...
...

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola citt脿
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola citt脿 []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola citt脿', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

Compare two columns and take their position

I have two files:
file1.txt:
-33.68;-53.48;Chu铆;Rio Grande do Sul;Brazil;
-33.68;-53.4;Chu铆;Rio Grande do Sul;Brazil;
-33.68;-53.32;Santa Vit贸ria do Palmar;Rio Grande do Sul;Brazil;
-33.6;-53.48;Santa Vit贸ria do Palmar;Rio Grande do Sul;Brazil;
-33.6;-53.4;Chu铆;Rio Grande do Sul;Brazil;
file2.txt:
-37.6 -57.72 13
-37.6 -57.48 15
-33.6 -53.4 12
-33.6 -53.48 5
I want to compare lat and lon and join the lines
Expected result:
-33.6;-53.48;Santa Vit贸ria do Palmar;Rio Grande do Sul;Brazil;5
-33.6;-53.4;Chu铆;Rio Grande do Sul;Brazil;12
Code:
fileWrite = open("out.txt","w")
gg2=[]
with open("ddprecip.txt", encoding="utf8", mode='r') as file5:
bruto = [line.split() for line in file5]
for dd in range(len(bruto)):
lat = float(bruto[dd][0])
lon = float(bruto[dd][1])
valor= int(float(bruto[dd][2]))
gg2.append(str(lat)+";"+str(lon)+";"+str(valor))
with open("geo2.txt", encoding="utf8", mode='r') as f:
text = f.readlines()
for ind in range(len(bruto)):
coord2 = (gg2[ind].split(";")[0]+";"+gg2[ind].split(";")[1])
match = [i for i,x in enumerate(text) if coord2 in x]
if match:
variaveis = text[match[0]].split(";")
show = coord2+";"+variaveis[2]+";"+variaveis[3]+";"+gg2[ind].split(";")[2]+";"+variaveis[4]
print(show)
fileWrite.write(str(show.encode("utf-8"))+";\n")
fileWrite.close()
Problem:
If you have a lat/lon: 3.6; -53.4
will return the line: -33.6;-53.4;Chu铆;Rio Grande do Sul;Brazil;
I need the lat and lon to be exact in both files
I think you're making doing what you want much harder than it needs to be, in a relatively slow way that's using a lot more memory than necessary.
To speed up the whole process, a dictionary named geo_dict is first created from the second file. It maps each unique (lat, log) pair of value to a place name. This will make checking for matches much quicker than doing a linear-search through the list of all of them.
It also unnecessary to convert the values to floats and ints, in fact it might be better to not do it because comparing float values can be problematic on a computer.
Anyway, after the dictionary is created, each line in the first file can be read and processed sequentially. Note that lines with no match are skipped.
from pprint import pprint
with open("geo2.txt", encoding="utf8", mode='r') as file2:
geo_dict = {tuple(line[:2]): line[2:5] for line in (line.split(';') for line in file2)}
pprint(geo_dict)
print()
with open("ddprecip.txt", encoding="utf8", mode='r') as file1, \
open("out.txt","w") as fileWrite:
for line in (line.split() for line in file1):
lat, lon, valor = line[:3]
match = geo_dict.get((lat, lon))
if match:
show = ';'.join(line[:2] + match[:3] + [valor])
fileWrite.write(show + "\n")
print('Done')
On-screen output:
{('-33.6', '-53.4'): ['Chu铆', 'Rio Grande do Sul', 'Brazil'],
('-33.6', '-53.48'): ['Santa Vit贸ria do Palmar',
'Rio Grande do Sul',
'Brazil'],
('-33.68', '-53.32'): ['Santa Vit贸ria do Palmar',
'Rio Grande do Sul',
'Brazil'],
('-33.68', '-53.4'): ['Chu铆', 'Rio Grande do Sul', 'Brazil'],
('-33.68', '-53.48'): ['Chu铆', 'Rio Grande do Sul', 'Brazil']}
Done
Contents of the out.txt file created:
-33.6;-53.4;Chu铆;Rio Grande do Sul;Brazil;12
-33.6;-53.48;Santa Vit贸ria do Palmar;Rio Grande do Sul;Brazil;5

how do i divide list into smallers list

My list is formatted like:
gymnastics_school,participant_name,all-around_points_earned
I need to divide it up by schools but keep the scores.
import collections
def main():
names = ["gymnastics_school", "participant_name", "all_around_points_earned"]
Data = collections.namedtuple("Data", names)
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(Data(*items))
These are examples of how they're set up:
Lanier City Gymnastics,Ben W.,55.301
Lanier City Gymnastics,Alex W.,54.801
Lanier City Gymnastics,Sky T.,51.2
Lanier City Gymnastics,William G.,47.3
Carrollton Boys,Cameron M.,61.6
Carrollton Boys,Zachary W.,58.7
Carrollton Boys,Samuel B.,58.6
La Fayette Boys,Nate S.,63
La Fayette Boys,Kaden C.,62
La Fayette Boys,Cohan S.,59.1
La Fayette Boys,Cooper J.,56.101
La Fayette Boys,Avi F.,53.401
La Fayette Boys,Frederic T.,53.201
Columbus,Noah B.,50.3
Savannah Metro,Levi B.,52.801
Savannah Metro,Taylan T.,52
Savannah Metro,Jacob S.,51.5
SAAB Gymnastics,Dawson B.,58.1
SAAB Gymnastics,Dean S.,57.901
SAAB Gymnastics,William L.,57.101
SAAB Gymnastics,Lex L.,52.501
Suwanee Gymnastics,Colin K.,57.3
Suwanee Gymnastics,Matthew B.,53.201
After processing it should look like:
Lanier City Gymnastics:participants(4)
as it own list
Carrollton Boys(3)
as it own list
La Fayette Boys(6)
etc.
I would recommend putting them in dictionaries:
data = {}
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
Then access schools could be done in the following way:
>>> data['Lanier City Gymnastics']
[['Ben W.',55.301],['Alex W.',54.801],['Sky T'.,51.2],['William G.',47.3]
EDIT:
Assuming you need the whole dataset as a list first, then you want to divide it into smaller lists you can generate the dictionary from the list:
data = []
with open('state_meet.txt','r') as f:
for line in f:
line = line.strip()
items = line.split(',')
items[2] = float(items[2])
data.append(items)
#perform median or other operation on your data
nested_data = {}
for items in data:
if items[0] in data:
data[items[0]].append(items[1:])
else:
data[items[0]] = [items[1:]]
nested_data[item[0]]
When you need to get a subset of a list you can use slicing:
mylist[start:stop:step]
where start, stop and step are optional (see link for more comprehensive introduction)

Parsing a simple text file in python

I've the following text file taken from a csv file. The file is two long to be shown properly here, so here's the line info:
The file has 5 lines:The 1st one starts in ETIQUETASThe 2nd one stars in RECURSOSThe 3rd one starts in DATOS CLIENTE Y PIEZAThe 4th one starts in Numero Referencia,The 5th and last one starts in BRIDA Al.
ETIQUETAS:;;;;;;;;;START;;;;;;;;;;;;;;;;;;;;;END;;
RECURSOS:;;;;;;;;;0;0;0;0;0;0;0;0;0;1;0;0;0;0;0;0;1;1;1;0;1;0;;Nota: 0
equivale a infinito, para decir que no existen recursos usar un numero
negativo DATOS CLIENTE Y PIEZA;;;;PLAZOS Y PROCESOS;;;;;;;;;;hoja
de ruta;MU;;;;;;;;;;;;;;;;; Numero Referencia;Descripcion
Referencia;Nombre Cliente;Codigo Cliente;PLAZO DE
ENTREGA;piezas;PROCESO;MATERIAL;stock;PROVEEDOR;tiempo ida
pulidor;pzas dia;TPO;tiempo vuelta pulidor;TIEMPO RECEPCION;CONTROL
CALIDAD DE ENTRADA;TIEMPO CONTROL CALIDAD DE ENTRADA;ALMACEN A (ANTES
DE ENTRAR
MAQUINA);GRANALLA;TPO;LIMPIADO;TPO;BRILLADO;TPO;;CARGA;MAQUINA;SOLTAR;control;EMPAQUETADO;ALMACENB;TIEMPO;
BRIDA Al;BRIDA Al;AEROGRAFICAS AHE,
S.A.;394;;;niquelado;aluminio;;;;matriz;;;5min;NO;;3dias;;;;;;;;1;1;1;;1;4D;;
I want to do two things:
Count the between START and END of the first line, both inclusive and save it as TOTAL_NUMBERS. This means if I've START;;END has to count 3; the START itself, the blank space between the two ;; and the END itself. In the example of the test, START;;;;;;;;;;;;;;;;;;;;;END it has to count 22.
What I've tried so far:
f = open("lt.csv", 'r')
array = []
for line in f:
if 'START' in line:
for i in line.split(";"):
array.append(i)
i = 0
while i < len(array):
if i == 'START':
# START COUNTING, I DONT KNOW HOW TO CONTINUE
i = i + 1
2.Check the file, go until the word PROVEEDOR appears, and save that word and the following TOTAL_NUMBERS(in the example, 22) on an array.
This means it has to save:
final array = ['PROVEEDOR', 'tiempo ida pulidor', 'pzas dia, 'TPO', 'tiempo vuelta pulidor', 'TIEMPO RECEPCION', 'CONTROL CALIDAD DE ENTRADA', 'TIEMPO CONTROL CALIDAD DE ENTRADA, 'ALMACEN A (ANTES DE ENTRAR MAQUINA)', 'GRANALLA', 'TPO', 'LIMPIADO', 'TPO','BRILLADO','TPO','','CARGA', 'MAQUINA', 'SOLTAR', 'control', 'EMPAQUETADO', 'ALMACENB']
Thanks in advance.
I am assuming the file is split into two lines; the first line with START and END and then a long line which needs to be parsed. This should work:
with open('somefile.txt') as f:
first_row = next(f).strip().split(';')
TOTAL_NUMBER = len(first_row[first_row.index('START'):first_row.index('END')+1])
bits = ''.join(line.rstrip() for line in f).split(';')
final_array = bits[bits.index('PROVEEDOR'):bits.index('PROVEEDOR')+TOTAL_NUMBER]

Categories

Resources