Converting a text document with special format to Pandas DataFrame Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame
What is the definining line between a helicopter and a drone a person can ride in?
Will I be more secure with my own router behind my ISP's router?
Should man-made satellites feature an intelligent inverted "cow catcher"?
How to produce a PS1 prompt in bash or ksh93 similar to tcsh
Why are two-digit numbers in Jonathan Swift's "Gulliver's Travels" (1726) written in "German style"?
Protagonist's race is hidden - should I reveal it?
What came first? Venom as the movie or as the song?
How to ask rejected full-time candidates to apply to teach individual courses?
Why did Israel vote against lifting the American embargo on Cuba?
Why do people think Winterfell crypts is the safest place for women, children & old people?
Why do C and C++ allow the expression (int) + 4*5?
When does Bran Stark remember Jamie pushing him?
Providing direct feedback to a product salesperson
Coin Game with infinite paradox
Does the Pact of the Blade warlock feature allow me to customize the properties of the pact weapon I create?
Does Prince Arnaud cause someone holding the Princess to lose?
What's the connection between Mr. Nancy and fried chicken?
Can I ask an author to send me his ebook?
Is it OK if I do not take the receipt in Germany?
"Destructive force" carried by a B-52?
2 sample t test for sample sizes - 30,000 and 150,000
Recursive calls to a function - why is the address of the parameter passed to it lowering with each call?
If gravity precedes the formation of a solar system, where did the mass come from that caused the gravity?
Why isn't everyone flabbergasted about Bran's "gift"?
Converting a text document with special format to Pandas DataFrame
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How can I reverse a list in Python?Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeUse a list of values to select rows from a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a text file with the following format:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
I need to covert this text to a DataFrame with the following format:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
How I can do it?
python pandas
|
show 1 more comment
I have a text file with the following format:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
I need to covert this text to a DataFrame with the following format:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
How I can do it?
python pandas
I can only think of regex helping here.
– amanb
9 hours ago
1
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
The data is in text format.
– Mary
8 hours ago
|
show 1 more comment
I have a text file with the following format:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
I need to covert this text to a DataFrame with the following format:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
How I can do it?
python pandas
I have a text file with the following format:
1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
I need to covert this text to a DataFrame with the following format:
Id Term weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345
How I can do it?
python pandas
python pandas
edited 6 hours ago
Brad Solomon
15.1k83996
15.1k83996
asked 9 hours ago
MaryMary
465217
465217
I can only think of regex helping here.
– amanb
9 hours ago
1
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
The data is in text format.
– Mary
8 hours ago
|
show 1 more comment
I can only think of regex helping here.
– amanb
9 hours ago
1
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
The data is in text format.
– Mary
8 hours ago
I can only think of regex helping here.
– amanb
9 hours ago
I can only think of regex helping here.
– amanb
9 hours ago
1
1
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
The data is in text format.
– Mary
8 hours ago
The data is in text format.
– Mary
8 hours ago
|
show 1 more comment
8 Answers
8
active
oldest
votes
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
Walkthrough
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
Disclaimer
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as w
.
3
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
add a comment |
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
add a comment |
Assuming your data (csv
file) looks like given:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
@Rebin addengine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by' '
. It yields correct data on my computer.
– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
@Rebin add it as a param topd.read_csv
-df = pd.read_csv(..., engine='python')
– pault
8 hours ago
|
show 1 more comment
Just to put my two cents in: you could write yourself a parser and feed the result into pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = line+
line = id colon pair*
pair = term ws weight sep? ws?
id = ~"d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"d+(?:.d+)?"
ws = ~"s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_expr(self, node, visited_children):
return [item for lst in visited_children for item in lst]
pv = PandasVisitor()
result = pv.visit(tree)
df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
print(df)
This yields
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
add a comment |
Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.
import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
add a comment |
It is possible to just use entirely pandas:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
Turn the column 1
into a list and then expand:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
Now we need to str split the last line and drop empties:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
Resulting newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
add a comment |
Could I assume that there is just 1 space before 'TERM'?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
add a comment |
1) You can read row by row.
2) Then you can separate by ':' for your index and ',' for the values
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2)
content = [x.split(':') for x in content]
This will give you the following result:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
8 Answers
8
active
oldest
votes
8 Answers
8
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
Walkthrough
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
Disclaimer
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as w
.
3
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
add a comment |
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
Walkthrough
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
Disclaimer
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as w
.
3
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
add a comment |
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
Walkthrough
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
Disclaimer
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as w
.
Here's an optimized way to parse the file with re
, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.
import re
import pandas as pd
SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)
def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))
Example:
>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
>>> df.dtypes
Id int64
Term object
weight float64
dtype: object
Walkthrough
SEP_RE
looks for an initial separator: a literal :
followed by one or more spaces. It uses maxsplit=1
to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.
After that, DATA_RE.finditer()
deals with each (term, weight) pair extraxted from rest
. The string rest
itself will look like frack 0.733, shale 0.700,
. .finditer()
gives you multiple match
objects, where you can use ["key"]
notation to access the element from a given named capture group, such as (?P<term>[a-z]+)
.
An easy way to visualize this is to use an example line
from your file as a string:
>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']
Now you have the initial ID and rest of the components, which you can unpack into two identifiers.
>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'
The better way to visualize it is with pdb
. Give it a try if you dare ;)
Disclaimer
This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.
For instance, it assumes that each each Term
can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re
characters such as w
.
edited 2 hours ago
answered 8 hours ago
Brad SolomonBrad Solomon
15.1k83996
15.1k83996
3
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
add a comment |
3
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
3
3
Brilliant answer, I must say.
– amanb
8 hours ago
Brilliant answer, I must say.
– amanb
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
@amanb Thank you!
– Brad Solomon
8 hours ago
add a comment |
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
add a comment |
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
add a comment |
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:
import pandas as pd
from itertools import chain
text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """
df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)
print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345
Explanation
I assume that you've read your file into the string text
. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :
print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]
The next step is to split on the comma to separate the values, and assign the Id
to each set of values:
print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]
Finally, we use itertools.chain.from_iterable
to flatten this output, which can then be passed straight to the DataFrame constructor.
Note: The *
tuple unpacking is a python 3 feature.
edited 8 hours ago
answered 8 hours ago
paultpault
17.3k42754
17.3k42754
add a comment |
add a comment |
Assuming your data (csv
file) looks like given:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
@Rebin addengine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by' '
. It yields correct data on my computer.
– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
@Rebin add it as a param topd.read_csv
-df = pd.read_csv(..., engine='python')
– pault
8 hours ago
|
show 1 more comment
Assuming your data (csv
file) looks like given:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
@Rebin addengine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by' '
. It yields correct data on my computer.
– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
@Rebin add it as a param topd.read_csv
-df = pd.read_csv(..., engine='python')
– pault
8 hours ago
|
show 1 more comment
Assuming your data (csv
file) looks like given:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
Assuming your data (csv
file) looks like given:
df = pd.read_csv('untitled.txt', sep=': ', header=None)
df.set_index(0, inplace=True)
# split the `,`
df = df[1].str.strip().str.split(',', expand=True)
# 0 1 2 3
#-- ------------ ------------- ---------- ---
# 1 frack 0.733 shale 0.700
#10 space 0.645 station 0.327 nasa 0.258
# 4 celebr 0.262 bahar 0.345
# stack and drop empty
df = df.stack()
df = df[~df.eq('')]
# split ' '
df = df.str.strip().str.split(' ', expand=True)
# edit to give final expected output:
# rename index and columns for reset_index
df.index.names = ['Id', 'to_drop']
df.columns = ['Term', 'weight']
# final df
final_df = df.reset_index().drop('to_drop', axis=1)
edited 8 hours ago
answered 8 hours ago
Quang HoangQuang Hoang
3,83211020
3,83211020
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
@Rebin addengine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by' '
. It yields correct data on my computer.
– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
@Rebin add it as a param topd.read_csv
-df = pd.read_csv(..., engine='python')
– pault
8 hours ago
|
show 1 more comment
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
@Rebin addengine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by' '
. It yields correct data on my computer.
– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
@Rebin add it as a param topd.read_csv
-df = pd.read_csv(..., engine='python')
– pault
8 hours ago
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
how do you not getting error by ''' sep=': ' ''' which is 2 character separator?
– Rebin
8 hours ago
1
1
@Rebin add
engine='python'
– pault
8 hours ago
@Rebin add
engine='python'
– pault
8 hours ago
@pault weird, 'cause I already split by
' '
. It yields correct data on my computer.– Quang Hoang
8 hours ago
@pault weird, 'cause I already split by
' '
. It yields correct data on my computer.– Quang Hoang
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
I dont know how to add engine python? what is the command?
– Rebin
8 hours ago
1
1
@Rebin add it as a param to
pd.read_csv
- df = pd.read_csv(..., engine='python')
– pault
8 hours ago
@Rebin add it as a param to
pd.read_csv
- df = pd.read_csv(..., engine='python')
– pault
8 hours ago
|
show 1 more comment
Just to put my two cents in: you could write yourself a parser and feed the result into pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = line+
line = id colon pair*
pair = term ws weight sep? ws?
id = ~"d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"d+(?:.d+)?"
ws = ~"s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_expr(self, node, visited_children):
return [item for lst in visited_children for item in lst]
pv = PandasVisitor()
result = pv.visit(tree)
df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
print(df)
This yields
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
add a comment |
Just to put my two cents in: you could write yourself a parser and feed the result into pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = line+
line = id colon pair*
pair = term ws weight sep? ws?
id = ~"d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"d+(?:.d+)?"
ws = ~"s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_expr(self, node, visited_children):
return [item for lst in visited_children for item in lst]
pv = PandasVisitor()
result = pv.visit(tree)
df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
print(df)
This yields
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
add a comment |
Just to put my two cents in: you could write yourself a parser and feed the result into pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = line+
line = id colon pair*
pair = term ws weight sep? ws?
id = ~"d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"d+(?:.d+)?"
ws = ~"s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_expr(self, node, visited_children):
return [item for lst in visited_children for item in lst]
pv = PandasVisitor()
result = pv.visit(tree)
df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
print(df)
This yields
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
Just to put my two cents in: you could write yourself a parser and feed the result into pandas
:
import pandas as pd
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
file = """1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345
"""
grammar = Grammar(
r"""
expr = line+
line = id colon pair*
pair = term ws weight sep? ws?
id = ~"d+"
colon = ws? ":" ws?
sep = ws? "," ws?
term = ~"[a-zA-Z]+"
weight = ~"d+(?:.d+)?"
ws = ~"s+"
"""
)
tree = grammar.parse(file)
class PandasVisitor(NodeVisitor):
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_pair(self, node, visited_children):
term, _, weight, *_ = visited_children
return (term.text, weight.text)
def visit_line(self, node, visited_children):
id, _, pairs = visited_children
return [(id.text, *pair) for pair in pairs]
def visit_expr(self, node, visited_children):
return [item for lst in visited_children for item in lst]
pv = PandasVisitor()
result = pv.visit(tree)
df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
print(df)
This yields
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345
answered 7 hours ago
JanJan
26.1k52750
26.1k52750
add a comment |
add a comment |
Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.
import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
add a comment |
Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.
import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
add a comment |
Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.
import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.
import pandas as pd
file=r"give_your_path".replace('\', '/')
my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
with open(file,"r+") as f:
for line in f.readlines():#looping every line
my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
my_list_of_lists.append(my_id+term)
df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
df.columns=["Id","Term","weight"]#giving columns their names
answered 8 hours ago
JoPapou13JoPapou13
914
914
add a comment |
add a comment |
It is possible to just use entirely pandas:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
Turn the column 1
into a list and then expand:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
Now we need to str split the last line and drop empties:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
Resulting newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
add a comment |
It is possible to just use entirely pandas:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
Turn the column 1
into a list and then expand:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
Now we need to str split the last line and drop empties:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
Resulting newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
add a comment |
It is possible to just use entirely pandas:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
Turn the column 1
into a list and then expand:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
Now we need to str split the last line and drop empties:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
Resulting newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
It is possible to just use entirely pandas:
df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """), sep=":", header=None)
#df:
0 1
0 1 frack 0.733, shale 0.700,
1 10 space 0.645, station 0.327, nasa 0.258,
2 4 celebr 0.262, bahar 0.345
Turn the column 1
into a list and then expand:
df[1] = df[1].str.split(",", expand=False)
dfs = []
for idx, rows in df.iterrows():
print(rows)
dfslice = pd.DataFrame("Id": [rows[0]]*len(rows[1]), "terms": rows[1])
dfs.append(dfslice)
newdf = pd.concat(dfs, ignore_index=True)
# this creates newdf:
Id terms
0 1 frack 0.733
1 1 shale 0.700
2 1
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
6 10
7 4 celebr 0.262
8 4 bahar 0.345
Now we need to str split the last line and drop empties:
newdf["terms"] = newdf["terms"].str.strip()
newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
newdf.columns = ["Id", "terms", "Term", "Weights"]
newdf = newdf.drop("terms", axis=1).dropna()
Resulting newdf:
Id Term Weights
0 1 frack 0.733
1 1 shale 0.700
3 10 space 0.645
4 10 station 0.327
5 10 nasa 0.258
7 4 celebr 0.262
8 4 bahar 0.345
answered 8 hours ago
Rocky LiRocky Li
3,6831719
3,6831719
add a comment |
add a comment |
Could I assume that there is just 1 space before 'TERM'?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
add a comment |
Could I assume that there is just 1 space before 'TERM'?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
add a comment |
Could I assume that there is just 1 space before 'TERM'?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
Could I assume that there is just 1 space before 'TERM'?
df=pd.DataFrame(columns=['ID','Term','Weight'])
with open('C:/random/d1','r') as readObject:
for line in readObject:
line=line.rstrip('n')
tempList1=line.split(':')
tempList2=tempList1[1]
tempList2=tempList2.rstrip(',')
tempList2=tempList2.split(',')
for item in tempList2:
e=item.split(' ')
tempRow=[tempList1[0], e[0],e[1]]
df.loc[len(df)]=tempRow
print(df)
answered 8 hours ago
RebinRebin
193212
193212
add a comment |
add a comment |
1) You can read row by row.
2) Then you can separate by ':' for your index and ',' for the values
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2)
content = [x.split(':') for x in content]
This will give you the following result:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
add a comment |
1) You can read row by row.
2) Then you can separate by ':' for your index and ',' for the values
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2)
content = [x.split(':') for x in content]
This will give you the following result:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
add a comment |
1) You can read row by row.
2) Then you can separate by ':' for your index and ',' for the values
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2)
content = [x.split(':') for x in content]
This will give you the following result:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]
1) You can read row by row.
2) Then you can separate by ':' for your index and ',' for the values
1)
with open('path/filename.txt','r') as filename:
content = filename.readlines()
2)
content = [x.split(':') for x in content]
This will give you the following result:
content =[
['1','frack 0.733, shale 0.700,'],
['10', 'space 0.645, station 0.327, nasa 0.258,'],
['4','celebr 0.262, bahar 0.345 ']]
answered 8 hours ago
CedricLyCedricLy
11
11
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
add a comment |
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
3
3
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
Your result is not the result asked for in the question.
– GiraffeMan91
8 hours ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I can only think of regex helping here.
– amanb
9 hours ago
1
Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.
– Quang Hoang
8 hours ago
It can be done with explode and split
– Wen-Ben
8 hours ago
Also , When you read the text to pandas what is the format of the df ?
– Wen-Ben
8 hours ago
The data is in text format.
– Mary
8 hours ago