I am trying to convert a bunch of text files into a data frame using Pandas.
Each text file contains simple text which starts with two relevant information: the Number and the Register variables.
Then, the text files have some random text we should not be taken into consideration.
Last, the text files contains information such as the share number, the name of the person, birth date, address and some additional rows that start with a lowercase letter. Each group contains such information, and the pattern is always the same: the first row for the group is defined by a number (hereby id), followed by the "SHARE" word.
Here is an example:
Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000
I need to transform the text into a data frame with the following output, where each group is stored in one row:
Number
Register
City
Id
Share
Name
Born
c
f
h
i
01600
4314
London
1
73/1284
John Smith
1960-01-01
NaN
4222/2001
1334/2000
5774/2000
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/1988
4222/2000
NaN
NaN
My initial approach was to first import the text file and apply regular expression for each case:
import pandas as pd
import re
df = open(r'Test.txt', 'r').read()
for line in re.findall('SHARE.*', df):
print(line)
But probably there is a better way to do it.
Any help is highly appreciated. Thanks in advance.
This can be done without regex with list comprehension and splitting strings:
import pandas as pd
text = '''Number 01600 London Register 4314
Some random text...
1 SHARE: 73/1284
John Smith
BORN: 1960-01-01 ADR: Streetname 3/2 1000
f 4222/2001
h 1334/2000
i 5774/2000
4 SHARE: 58/1284
Boris Morgan
BORN: 1965-01-01 ADR: Streetname 4 2000
c 4222/1988
f 4222/2000'''
text = [i.strip() for i in text.splitlines()] # create a list of lines
data = []
# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]
# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
for i in items:
d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
items = list(s.split() for s in i[3:])
merged_items = []
for i in items:
if len(i[0]) == 1 and i[0].isalpha():
merged_items.append(i)
else:
merged_items[-1][-1] = merged_items[-1][-1] + i[0]
d.update({name: value for name,value in merged_items})
data.append(d)
#load the list of dicts as a dataframe
df = pd.DataFrame(data)
Output:
Number
Register
City
Id
Share
Name
Born
f
h
i
c
0
01600
4314
London
1
73/1284
John Smith
1960-01-01
4222/2001
1334/2000
5774/2000
nan
1
01600
4314
London
4
58/1284
Boris Morgan
1965-01-01
4222/2000
nan
nan
4222/1988
Related
i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column?
Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?
Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1
Dataframe
Adresse
0 Karlstr 10, 10 B, 30,; 04916 Hamburg
1 München Dorfstr. 28-55, 22555
2 Marnstraße. Berlin 12, 45666 Berlin
3 Musterstr, 24855 Dresden
... ...
850 Muster Hausweg 11, Hannover, 56668
851 Mariestr. 4, 48669 Nürnberg
852 Hilden Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 Potsdam
Output
Cities
0 Hamburg
1 München
2 Berlin
3 Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam
You could extract in a list all the cities from the dictionary you provided ( I asssume it's the 'stadt' key ), and then use str.findall in your column:
cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))
>>>
0 [Karlstr, Hamburg]
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
Name: Adresse, dtype: object
You can simply use str.extract since all the names are between couple of stars.
df["cities"] = df["Adress"].str.extract(r'\*\*(\w+)\*\*')
Since it seems the stars are not present in your file, you can do it differently.
Use the dictionary of cities, called cities from the file you linked but keep only a unique sequence (called a set) of cities.
german_cities = set(map(lambda x: x['stadt'], cities))
Then, we'll split the address string for each row and lookup in the German cities dictionary.
Since the first argument of apply is the series itself, we just need to tell it to have a look at the set of German cities.
def lookup_cities(string, cities):
splits = string.replace(",", "").split(" ")
for s in splits:
if s in cities:
return s
return "NaN"
df["Adress"].apply(lookup_cities, args=(german_cities,))
Now if you find any "NaN" then it's either that a city in your document has a typo or maybe several way to write it, you'll have to investigate yourself.
P.S: I had to remove all the spaces in the cities files otherwise the names wouldn't match. It was just a matter of using find and replace all in my editor.
You can use regular expression to extract the city names, as they are indicated by **:
import re
import pandas
df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})
df['Cities'] = [re.findall(r".*\*\*(.*)\*\*", address)[0] for address in df['Adresse']]
This results in:
df
Adresse Cities
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg** Hamburg
1 **München** Dorfstr. 28-55, 22555 München
2 Marnstraße. Berlin 12, 45666 **Berlin** Berlin
3 Musterstr, 24855 **Dresden** Dresden
I have this huge netflix dataset which I am trying to see which actors appeared in the most movies/tv shows specifically in America. First, I created a list of unique actors from the dataset. Then created a nested for loop to loop through each name in list3(containing unique actors which checked each row in df3(filtered dataset with 2000+rows) if the column cast contained the current actors name from list3. I believe using iterrows takes too long
myDict1 = {}
for name in list3:
if name not in myDict1:
myDict1[name] = 0
for index, row in df3.iterrows():
if name in row["cast"]:
myDict1[name] += 1
myDict1
Title
cast
Movie1
Robert De Niro, Al Pacino, Tarantino
Movie2
Tom Hanks, Robert De Niro, Tom Cruise
Movie3
Tom Cruise, Zendaya, Seth Rogen
I want my output to be like this:
Name
Count
Robert De Niro
2
Tom Cruise
2
Use
out = df['cast'].str.split(', ').explode().value_counts()
out = pd.DataFrame({'Name': out.index, 'Count': out.values})
>>> out
Name Count
0 Tom Cruise 2
1 Robert De Niro 2
2 Zendaya 1
3 Seth Rogen 1
4 Tarantino 1
5 Al Pacino 1
6 Tom Hanks 1
l=['Robert De Niro','Tom Cruise']#list
df=df.assign(cast=df['cast'].str.split(',')).apply(pd.Series.explode)#convert cast into list and explode
df[df['cast'].str.contains("|".join(l))].groupby('cast').size().reset_index().rename(columns={'cast':'Name',0:'Count'})#groupby cast, find size and rename columns
Name Count
0 Robert De Niro 2
1 Tom Cruise 2
You could use collections.Counter to get the counts of the actors, after splitting the strings:
from collections import Counter
pd.DataFrame(Counter(df.cast.str.split(", ").sum()).items(),
columns = ['Name', 'Count'])
Name Count
0 Robert De Niro 2
1 Al Pacino 1
2 Tarantino 1
3 Tom Hanks 1
4 Tom Cruise 2
5 Zendaya 1
6 Seth Rogen 1
If you are keen about speed, and you have lots of data, you could dump the entire processing within plain python and rebuild the dataframe:
from itertools import chain
pd.DataFrame(Counter(chain.from_iterable(ent.split(", ")
for ent in df.cast)).items(),
columns = ['Name', 'Count'])
I would like to generate an integer-based unique ID for users (in my df).
Let's say I have:
index first last dob
0 peter jones 20000101
1 john doe 19870105
2 adam smith 19441212
3 john doe 19870105
4 jenny fast 19640822
I would like to generate an ID column like so:
index first last dob id
0 peter jones 20000101 1244821450
1 john doe 19870105 1742118427
2 adam smith 19441212 1841181386
3 john doe 19870105 1742118427
4 jenny fast 19640822 1687411973
10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).
I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.
I can't use groupby/cat code type methods in case the order of the
rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.
Feel like I may be tackling this the wrong way as I can't find much literature on it!
Thanks
You can try using hash function.
df['id'] = df[['first', 'last']].sum(axis=1).map(hash)
Please note the hash id is greater than 10 digits and is a unique integer sequence.
Here's a way of doing using numpy
import numpy as np
np.random.seed(1)
# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()
# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))
# maps ids to names
maps = {k:v for k,v in zip(names, ids)}
# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)
index first last dob id
0 0 peter jones 20000101 9176146523
1 1 john doe 19870105 8292931172
2 2 adam smith 19441212 4108641136
3 3 john doe 19870105 8292931172
4 4 jenny fast 19640822 6385979058
You can apply the below function on your data frame column.
def generate_id(s):
return abs(hash(s)) % (10 ** 10)
df['id'] = df['first'].apply(generate_id)
In case find out some values are not in exact digits, something like below you can do it -
def generate_id(s, size):
val = str(abs(hash(s)) % (10 ** size))
if len(val) < size:
diff = size - len(val)
val = str(val) + str(generate_id(s[:diff], diff))
return int(val)
I converted this page (it's squad lists for different sports teams) from PDF to text using this code:
import PyPDF3
import sys
import tabula
import pandas as pd
#One method
pdfFileObj = open(sys.argv[1],'rb')
pdfReader = PyPDF3.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)
The output looks like this:
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF
I wanted to transform this output to a tab delimited file with three columns: team name, player name, and number. So for the example I gave, the output would look like:
Bohemians James Talbot 1
Bohemians Derek Pender 2
Bohemians Darragh Leahy 3
Cork City Mark McNulty 1
Cork City Colm Horgan 2
Cork City Alan Bennett 3
Derry City Peter Cherrie 1
Derry City Conor McDermott 2
Derry City Ciaran Coll 3
I know I need to first (1) Divide the file into sections based on team, and then (2) within each team section; combine each name + number field into pairs to assign each number to a name.
I wrote this little bit of code to parse the big file into each sports team:
import sys
fileopen = open(sys.argv[1])
recording = False
for line in fileopen:
if not recording:
if line.startswith('PREMI'):
recording = True
elif line.startswith('2019 SEA'):
recording = False
else:
print(line)
But I'm stuck, because the above code won't divide up the block of text per team (i.e. i need multiple blocks of text extracted to separate strings or lists?). Can someone advise how to divide up the text file I have per team (so in this example, I should be left with three blocks of text...and then somehow I can work on each team-divided block of text to pair numbers and names).
Soooo, not necessarily true to form and I don't take into consideration the other libraries you'd used, but it was designed to give you a start. You can reformat it however you wish.
>>> string = '''2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF'''
>>> def reorder(string):
import re
headers = ['Team', 'Name', 'Number']
print('\n')
print(headers)
print()
paragraphs = re.findall('2019[\S\s]+?(?=2019|$)', string)
for paragraph in paragraphs:
club = re.findall('(?i)CLUB:[\s]*([\S\s]+?)\n', paragraph)
names_numbers = re.findall('(?i)([\d]+)[\n]{1,3}[\s]*([\S\ ]+)', paragraph)
for i in range(len(names_numbers)):
if len(club) == 1:
print(club[0]+' | '+names_numbers[i][1]+' | '+names_numbers[i][0])
>>> reorder(string)
['Team', 'Name', 'Number']
BOHEMIANS | James Talbot | 1
BOHEMIANS | Derek Pender | 2
BOHEMIANS | Darragh Leahy | 3
CORK CITY | Mark McNulty | 1
CORK CITY | Colm Horgan | 2
CORK CITY | Alan Bennett | 3
DERRY CITY | Peter Cherrie | 1
DERRY CITY | Conor McDermott | 2
DERRY CITY | Ciaran Coll | 3
I have a csv with comma delimiters that has multiple values in a column that are delimited by a pipe and I need to map them to another column with multiple pipe delimited values and then give them their own row along with data in the original row that doesn't have multiple values. My CSV looks like this (with commas between the categories):
row name city amount
1 frank | john | dave toronto | new york | anaheim 10
2 george | joe | fred fresno | kansas city | reno 20
I need it to look like this:
row name city amount
1 frank toronto 10
2 john new york 10
3 dave anaheim 10
4 george fresno 20
5 joe kansas city 20
6 fred reno 20
Maybe not the nicest but working solution:
(works with no piped lines and for different pipe-length)
df = pd.read_csv('<your_data>.csv')
str_split = ' | '
# Calculate maximum length of piped (' | ') values
df['max_len'] = df[['name', 'city']].apply(lambda x: max(len(x[0].split(str_split)),
len(x[0].split(str_split))), axis=1)
max_len = df['max_len'].max()
# Split '|' piped cell values into columns (needed at unpivot step)
# Create as many new 'name_<x>' & 'city_<x>' columns as 'max_len'
df[['name_{}'.format(i) for i in range(max_len)]] = df['name'].apply(lambda x: \
pd.Series(x.split(str_split)))
df[['city_{}'.format(i) for i in range(max_len)]] = df['city'].apply(lambda x: \
pd.Series(x.split(str_split)))
# Unpivot 'name_<x>' & 'city_<x>' columns into rows
df_pv_name = pd.melt(df, value_vars=['name_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
df_pv_city = pd.melt(df, value_vars=['city_{}'.format(i) for i in range(max_len)],
id_vars=['amount'])
# Rename upivoted columns (these are the final columns)
df_pv_name = df_pv_name.rename(columns={'value':'name'})
df_pv_city = df_pv_city.rename(columns={'value':'city'})
# Rename 'city_<x>' values (rows) to be 'key' for join (merge)
df_pv_city['variable'] = df_pv_city['variable'].map({'city_{}'.format(i):'name_{}'\
.format(i) for i in range(max_len)})
# Join unpivoted 'name' & 'city' dataframes
df_res = df_pv_name.merge(df_pv_city, on=['variable', 'amount'])
# Drop 'variable' column and NULL rows if you have not equal pipe-length in original rows
# If you want to drop any NULL rows then replace 'all' to 'any'
df_res = df_res.drop(['variable'], axis=1).dropna(subset=['name', 'city'], how='all',
axis=0).reset_index(drop=True)
The result is:
amount name city
0 10 frank toronto
1 20 george fresno
2 10 john new york
3 20 joe kansas city
4 10 dave anaheim
5 20 fred reno
Another test input:
name city amount
0 frank | john | dave | joe | bill toronto | new york | anaheim | los angeles | caracas 10
1 george | joe | fred fresno | kansas city 20
2 danny miami 30
Result of this test (if you don't want NaN rows then replace how='all' to how='any' in the code at merging):
amount name city
0 10 frank toronto
1 20 george fresno
2 30 danny miami
3 10 john new york
4 20 joe kansas city
5 10 dave anaheim
6 20 fred NaN
7 10 joe los angeles
8 10 bill caracas
Given a row:
['1','frank|joe|dave', 'toronto|new york|anaheim', '20']
you can use
itertools.izip_longest(*[value.split('|') for value in row])
on it to obtain following structure:
[('1', 'frank', 'toronto', '20'),
(None, 'joe', 'new york', None),
(None, 'dave', 'anaheim', None)]
Here we want to replace all None values with last seen value in corresponding column. Can be done when looping over result.
So given a TSV already splitted by tabs following code should do the trick:
import itertools
def flatten_tsv(lines):
result = []
for line in lines:
flat_lines = itertools.izip_longest(*[value.split('|') for value in line])
for flat_line in flat_lines:
result.append([result[-1][i] if v is None else v
for i, v in enumerate(flat_line)])
return result