Separating .txt file with Python - python

I have to separate .txt file into small pieces, based on the matched value. For example, I have .txt file looks like:
Names Age Country
Mark 19 USA
John 19 UK
Elon 20 CAN
Dominic 21 USA
Andreas 21 UK
I have to extract all rows with the same value “Age” and to copy them to other file or perfom some other action..
How it is possible to be done with Python, I have never do that before.
Thank you in advance :)
I am asking, because of I have no idea how it should be done. The excpected result is to have this data separated:
Names Age Country
Mark 19 USA
John 19 UK
Names Age Country
Elon 20 CAN
Names Age Country
Dominic 21 USA
Andreas 21 UK

Here is a possible solution:
with open('yourfile.txt') as infile:
header = next(infile)
ages = {}
for line in infile:
name, age, country = line.rsplit(' ', 2)
if age not in ages:
ages[age] = []
ages[age].append([name, age, country])
for age in ages:
with open(f'age-{age}.txt', 'w') as agefile:
agefile.writeline(header)
agefile.writelines(ages[age])
For the sample you posted, the code above will leave you with files named age-19.txt, age-20.txt, and age-21.txt, with the contents separated by age, as you requested.

If you have them all in a list you can use something like this...
alltext = ["Names Age Country", "Mark 21 USA", "John 21 UK","Elon 20 CAN","Dominic 21 USA", "Andreas 21 UK"]
Canada = [alltext[0]] #Creates a list with your column header
NotCanada = [alltext[0]] #Creates a list with your column header
for row in alltext[1:]:
x = row.split()
if x[2] == "CAN":
Canada.append(row)
else:
NotCanada.append(row)
print(Canada)
print(NotCanada)
Will print two different lists of your separated players.
['Names Age Country', 'Elon 20 CAN']
['Names Age Country', 'Mark 21 USA', 'John 21 UK', 'Dominic 21 USA', 'Andreas 21 UK']

Related

Merging strings of people's names in pandas

I have two datasets that I want to merge based off the persons name. One data set player_nationalities has their full name:
Player, Nationality
Kylian Mbappé, France
Wissam Ben Yedder, France
Gianluigi Donnarumma, Italy
The other dataset player_ratings shortens their first names with a full stop and keeps the other name(s).
Player, Rating
K. Mbappé, 93
W. Ben Yedder, 89
G. Donnarumma, 91
How do I merge these tables based on the column Player and avoid merging people with the same last name? This is my attempt:
df = pd.merge(player_nationality, player_ratings, on='Player', how='left')
Player, Nationality, Rating
K. Mbappé, France, NaN
W. Ben Yedder, France, NaN
G. Donnarumma, Italy, NaN
You would need to normalize the keys in both DataFrames in order to merge them.
One idea would be to create a function to process the full name in player_nationalities and merge on the processed value for player name. eg:
def convert_player_name(name):
try:
first_name, last_name = name.split(' ', maxsplit=1)
return f'{first_name[0]}. {last_name}'
except ValueError:
return name
player_nationalities['processed_name'] = [convert_player_name(name) for name in player_nationalities['Player']]
df_merged = player_nationalities.merge(player_ratings, left_on='processed_name', right_on='Player')
[out]
Player_x Nationality processed_name Player_y Rating
0 Kylian Mbappé France K. Mbappé K. Mbappé 93
1 Wissam Ben Yedder France W. Ben Yedder W. Ben Yedder 89
2 Gianluigi Donnarumma Italy G. Donnarumma G. Donnarumma 91

Python Text File to Data Frame with Specific Pattern

I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988

Converting 2 columns of names into 4 columns of names using Pandas

I have an Excel file that consists of two columns: last_name, first_name. The list is sorted by years of experience. I would like to create an new Excel file (or text file) that prints the names two-by-two.
Last First
Smith Joe
Jones Mary
Johnson Ken
etc
and converts it to
Smith Joe Jones Mary
Johnson Ken etc.
effectively printing every other name on the same row as the name above.
I have reached the point where the names can be printed into a single set of columns, but I can't move every other name to adjacent columns.
Thanks
TRY:
result = pd.concat([df.iloc[::2].reset_index(drop=True),
df.iloc[1::2].reset_index(drop=True)], 1)
OUTPUT:
Last First Last First
0 Smith Joe Jones Mary
1 Johnson Ken etc None

dataframe one row save to excel sheet

i'd like to ask a simple question. I want to save the last row of a dataframe to an excel sheet last rows without colname of dataframe.
Dataframe:
date Name Age
2019/10/1 Kate 18
2019/10/2 Jim 20
2019/10/3 James 23
excel sheet:
date Name Age
2019/9/29 Rose 18
2019/9/30 Eva 20
I want to add dataframe values to excel last row,something like this
excel sheet new:
date Name Age
2019/9/29 Rose 18
2019/9/30 Eva 20
2019/10/1 Kate 18
2019/10/2 Jim 20
2019/10/3 James 23
code:
app = xw.App(visible=False, add_book=False)
wb=app.books.open(path_xl)
sht=wb.sheets[sht_name]
rng=sht.range("A1")
last_rows=rng.current_region.rows.count
sht.range('A'+str(last_rows+1)).value=df.values
but the result saved in excel sheet is wrong, i dont know why

Splitting multiple pipe delimited values in multiple columns of a comma delimited CSV and mapping them to each other

I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result

Categories

Resources